Supposedly it was a problem with the management network used by their optical gear, but it looks a lot like a layer-2 network spanning 15 data centers and no control-plane policing on the managed devices… proving yet again that large-scale layer-2 networks are a really bad idea.
Please note that it doesn’t matter whether they had problems with a stretched Ethernet segment or something else. According to their explanation a single device broadcasting packets was able to affect devices across multiple locations – as I’m trying to explain for years (not that many people would listen and/or care), a single broadcast domain is a single failure domain no matter what $vendor PowerPoints or whitepapers claim, and it’s not a question of whether the concoction will fail but when. Keep that in mind the next time your $vendor rep brings dancing unicorns into the room.
On a tangential note, cloud providers that know what they’re doing don’t support anything else but unicast routing for a really good reason – check out the details in AWS Networking webinar.
Finally, just in case you think failures like this one are a black swan event, check the list of post-mortems and associated lessons learned collected by Dan Luu… keeping in mind that most of the failures are never reported.
Thanks to Ivan Pepelnjak (see source)