[AusNOG] Outage that costs Millions

Adrian Chadd adrian at creative.net.au
Thu Jul 1 12:44:51 EST 2010

On Thu, Jul 01, 2010, Narelle wrote:
> On Thu, Jul 1, 2010 at 10:20 AM, Matt Carter <matt at iseek.com.au> wrote:
> >
> > As others have pointed out a "spanning tree issue" doesn't tear down your
> > network for 90 minutes, it *prevents* it from being torn down for 90 minutes,
> > it could be thought of as a "last resort safety" so to assert a spanning tree
> > issue caused this problem, in my mind, is to assert a lack of spanning tree,
> > meaning the required last resort safety mechanisms were either not in place,
> > or not configured properly. (if they were, how could it be a spanning tree issue??)
> I have seen failures of this duration in large spanning tree networks
> before. The reason for the lengthy time to restore is a) it can be
> really tricky to find the 'root' of the problem, b) people had
> forgotten life way back when before routers were everywhere (and over
> the last few years have been relearning all this, and c) in large
> carriers people start to get scared when restoring traffic as the
> rerouting of traffic gets complex / ports need to be identified,
> records updated etc etc.
> It just ain't as simple as it looks...

+1. In fact, this happened in IP networks 10+ years ago with BGP announcements
"echoing" around the internet for hours after the event (eg a single withdraw/
readvertise) occured.

STP behaviour over LAN != STP behaviour over WAN != STP behaviour over low
bandwidth WAN, etc.

You can start having fun things occur like your STP never hitting steadyish
state because of things like flapping roots, and these can occur because of
full pipes triggering packet loss which causes differing paths to be preferred,
then lots more learning needs to occur; then ARP timeouts may occur and
retransmission of ARP happens; then your switch/router CPU maxes out and you
miss a link/routing protocol heartbeat; then more flaps occur, etc, etc.

This happens even now in the IP world. Why oh why do you think spanning tree
is any different? :)


