[AusNOG] Outage that costs Millions

John Edwards john at netniche.com.au
Thu Jul 1 10:56:36 EST 2010


On 01/07/2010, at 9:50 AM, Matt Carter wrote:

> I've seen news articles citing 20,000 DSL tails offline I know for a fact that more than 20k DSL tails were affected. The same article also cited "hardware failure" but with diverse fibre this and diverse switch that, and supposed N+1 across the wazoo, I fail to see how that is possible. What I believe I saw was a complete collapse of a large(or total) portion of the MAN. If you consider the experience of Tatt's and others, this _seems_ be consistent in the post-event analysis.


Similar meltdowns have happened in the past due to mac address table limitations. Once upon a time VTP misadventures would have also been a safe bet. I have heard that the network in question once had issues with an ISP running DSL/PPPoE backhaul across it, and as a result enforced a limit of mac addresses per customer. Fast forward to 2010, and there are probably more than enough devices on this network to get to the limits of most switching hardware in this country.

It's quite possible that following a network event (or say - a lottery agency enabling additional virtual servers and terminals to deal with demand) , spanning tree could have chosen a bridge somewhere in the network that exceeded this limit and resorted to flooding rather than switching. I'm not aware of any feature in spanning tree that changes the decision process when the switch knows the mac table is full - which may mean that when the network event was resolved the "spanning tree" issue remained.

John




More information about the AusNOG mailing list