Montag, 10. Februar 2020

Network Disturbance - Defective QSFP+ transceiver

Source: https://status.combahton.net/incident/182

We have identified a defective QSFP+ transceiver which led to last days issues. The transceiver sporadically causes a control process within Juniper JunOS to go stuck, which then causes a Kernel Panic of our Core Router at Interxion Frankfurt.

Fortunately, we were able to identify the issue. The previous equipment was not throwing any errors related to the module, beside it had the same syslog configuration.

As a temporary solution, we have disabled the port and other members in the same LACP channel. Staff is currently on it's way to the datacenter in order to remove the broken transceiver. Network availability is ensured by remaining uplinks.

Update 19:26: Staff has arrived on-site, we are removing the transceiver in question.

Update 19:42: We have identified the defective optics.

Update 20:18: Optics has been swapped, we will stay on site for some time to be able to react quickly.

Update 11.02.2020 - 23:20: We were able to reproduce the last days issue on a test device. Putting ~35Gbit of traffic on those optics causes the device to have a Kernel Panic after around 3 hours. The problem is now totally comprehensible on the optics. The optics even seem to have died during the load test, the device didnt anylonger recognize them.