Monday, January 7, 2019

Large Layer-2 Domains Strike Again…

I started January 2018 blogging with a major service provider failure. Why should 2019 be any different? Here’s what Century Link claimed was causing two-day outage (more comments here).

Supposedly it was a problem with the management network used by their optical gear, but it looks a lot like a layer-2 network spanning 15 data centers and no control-plane policing on the managed devices… proving yet again that large-scale layer-2 networks are a really bad idea.

Please note that it doesn’t matter whether they had problems with a stretched Ethernet segment or something else. According to their explanation a single device broadcasting packets was able to affect devices across multiple locations – as I’m trying to explain for years (not that many people would listen and/or care), a single broadcast domain is a single failure domain no matter what $vendor PowerPoints or whitepapers claim, and it’s not a question of whether the concoction will fail but when. Keep that in mind the next time your $vendor rep brings dancing unicorns into the room.

On a tangential note, cloud providers that know what they’re doing don’t support anything else but unicast routing for a really good reason – check out the details in AWS Networking webinar.

Finally, just in case you think failures like this one are a black swan event, check the list of post-mortems and associated lessons learned collected by Dan Luu… keeping in mind that most of the failures are never reported.

Let's block ads! (Why?)


Thanks to Ivan Pepelnjak (see source)

No comments:

Post a Comment