Montag, 17. Februar 2020

Current status - network problems / 2

Despite all efforts, the well-known network disruptions still occur. However, the frequency has decreased a little. We have already made enormous efforts to solve the problem. In addition to very hardware-related debugging, we tried to rule out all hardware problems and replaced everything (including the fiber optic cables). Even a complete reset of the devices with subsequent reconfiguration from the scratch did not help. We are currently investigating BGP sessions as the cause. Since occurrence, we have been busy with the elimination of the issue and use all resources available to us.

Update 2020-02-17 @ 16:07 After checking the BGP sessions we have deactivated our free BGP router for the time being. We currently suspect that a customer's session is flapping, pushing too many updates onto the core router, which then causes the routing equipment at Interxion to struggle with the load.

Update 2020-02-17 @ 21:51 We have continued our investigation throughout the day and were able to figure out a issue which occurs on Layer3, which under certain conditions causes traffic destined to a specific host to go over the Routing Engine. This causes high load, which then causes BGP Sessions to flap between states, which is noticeable as short downtime or packetloss. We have implemented further measures to mitigate the impact for now and will keep monitoring the router. We will implement hardware based measures tomorrow morning by installing a seperate switch in order to seperate the traffic.

Update 2020-02-18 @ 11:30 The seperate switch has been installed.

Update 2020-02-19 @ 12:01 The network appears to be stable over the past 24 hours. We found out a bug in Juniper JunOS or at least the Chipset of the QFX devices we use, which led to the repeated issues. We have made extensive changes to mitigate the bug.