sw1-phl1 (Netrality MMR) - I2C Bus Errors
Incident Report for PhillyIX
Resolved
At approximately 1:30pm on Friday, September 13, 2019, our monitoring system alerted us to an issue with sw1-phl1.phillyix.net, the main peering switch in our Netrality MMR colo space. Traffic continued to forward through the switch, but several integral system daemons had crashed or were in a stuck state. We immediately opened a ticket with the switch vendor, and upon investigation, it appears a bad optic may have locked up the I2C bus on the switch. The I2C bus is used as a communications channel between the kernel and system components such as EEPROM data stored on the transceivers, the power supplies, the fans, etc.

Because the switch is in this half-broken state, we will need to perform a full powercycle, and re-insert each optic one at a time until we find the one that's causing the error. We have scheduled this emergency maintenance for tonight, Saturday, September 14, 2019 at 8:30pm (local time). We expect this maintenance to take one hour to complete, and we recommend members either shut all their BGP sessions, or administratively shut their port down, to avoid routing issues due to flapping connectivity during the maintenance. A followup message will be posted at the completion of the maintenance. We apologize for any inconvenience.

UPDATE Sep 14 @ 9:24pm:
----------------------
The switch was rebooted and a faulty optic was identified and replaced. We will continue to pursue RCA with our switch vendor, but we now consider this incident closed. Thank you for your patience.
Posted Sep 14, 2019 - 13:30 EDT