Freitag, 7. Februar 2020

Short interruption with packet loss


At the specified time there was a brief network dropout with packet loss. The reason for this is a router that we will restart in the next few days. As a result, no further interruptions are expected.

Update 17:40 UTC+1: The issue occured a second time. We are rebooting the affected fpc right now.

Update 17:58 UTC+1: fpc 0 is back online, we are watching the current situation and hope that the restart has resolved the issue. There was just a small amount of packetloss for most customers, a small portion without redundant connectivity was offline for around 10-15 minutes.

Update 18:57 UTC+1: We have identified a issue with the sflow daemon on fpc 1. Service has been restarted, ddos detection might have reacted slower previously.

Update 07.02.2020 - 22:58 UTC+1: The same issue occured again. We are looking into it.

Update 07.02.2020 - 23:07 UTC+1: We have implemented additional measures to resolve the issue and will continue monitoring the operation closely.

Update 07.02.2020 - 23:25 UTC+1: According to our latest analysis, todays issue was not in relation with yesterdays issue. Juniper JunOS provides a kind of so called "ddos-protection", which main purpose it is to rate-limit certain amounts of traffic towards the control plane. In the current case, the so called "ddos-protection", which is in fact not really what anyone should call ddos-protection, rather some senseless rate-limiting, overreacted and generated a high amount of cpu load while dropping randomly packets between the virtual chassis. We've disabled the mechanism as it's absolutely useless and does not provide any advantage over already implemented loopback firewall filters. This is kinda frustrating as it caused a repeated packetloss within the last two days of about 20-30 seconds.We hope to have the gear under control now. If the issue persists, we will upgrade the firmware image.

Update 08.02.2020 - 16:20 UTC+1: The issue occured again. This time, both devices are showing FPGA related issues, we are now rebooting them both.

Update 08.02.2020 - 16:44 UTC+1: Reboot has been carried out. Both devices appear to be stable for now. However, the planned firmware upgrade tomorrow morning will still be carried out.

Update 08.02.2020 - 21:19 UTC+1: Since the previous measures were unsuccessful, we will take immediate measures and check the situation on site. For this purpose, we will prefer the announced firmware upgrades and, if necessary, undertake a complete replacement of the devices. After the problem has arisen so often in the past few days and all the measures taken have been unsuccessful, we feel compelled to take this step in the interest of our customers. The maintenance work is carried out immediately after arrival in Frankfurt (around 09.02.2020 - 00: 00-02: 00 UTC + 1).

Update 09.02.2020 - 01:30 UTC+1: We have been busy with maintenance work since 01:20. For this, the latest firmware was installed and a restart was carried out.

Update 09.02.2020 - 02:00 UTC+1: All routing instances are properly booted again.

Update 09.02.2020 - 15:00 UTC+1: The router was again affected. We have now shut down the virtual chassis member in question and will move the uplinks to the remaining device.

Update 09.02.2020 - 17:53 UTC+1: As the problem still persist, we are now migrating all links to a replacement device, which will impact redundancy for now, but should resolve the issue. Additional replacement gear will be ordered tomorrow by express delivery to ensure further redundancy. Update 09.02.2020 - 20:30 UTC+1: Most uplinks has been migrated, working on the remaining ones. Outer connectivity is fully restored to normality.

Update 09.02.2020 - 21:08 UTC+1: Maintenance work is finished, all equipment appears to be stable.