[AusNOG] Strange BGP issues (Quagga/Zebra)

Nathan Brookfield nathan.brookfield at serversaustralia.com.au
Mon Mar 19 11:25:42 EST 2012


Hi Mat;

Our Vyatta boxes are all running 0.99.18 and although we saw the problem on
two boxes we have several others which were not impacted in any way.  The
unusual part of this issue was that every session went down at once as if
the entire process had hung and just generally recovered itself.  

This occurred twice, once as per my original e-mail at around 8pm and then
again early this morning at 1221am, the symptoms were the same except that
the problem disappeared within 15 minutes.  I am looking through BGPPlay to
see if I can find anything unusual but I'm not hopeful to catch anything.

We use quite powerful boxes to handle Denial of Service traffic etc so even
while routing 500mbit and while routing tables are reconverging the CPU
usually stays quite low unless we are receiving high PPS.  At present it's
only doing 0.00, 0.01, 0.05 and the box is routing a few hundred megabit so
this does not overly shock me.

Many Thanks;
Nathan

-----Original Message-----
From: ausnog-bounces at lists.ausnog.net
[mailto:ausnog-bounces at lists.ausnog.net] On Behalf Of Mattia Rossi
Sent: Monday, 19 March 2012 11:10 AM
To: ausnog at lists.ausnog.net
Subject: Re: [AusNOG] Strange BGP issues (Quagga/Zebra)

On 18/03/2012 21:50, Nathan Brookfield wrote:
> Hi Guys;
>
> I'm wondering if anyone else with a full BGP feed noticed any unusual 
> behaviour this evening with their BGPd daemon crashing at around 
> 2000hrs running on the Quagga platform. I have absolutely no 
> explanation for it and it occurred on two separate Vyatta boxes that 
> take multiple full BGP feeds.
>
> It caused the BGP Daemon to kill all of our Upstream & Downstream 
> sessions simultaneously and then magically 40 or so minutes later 
> after a restart of the daemon and one of the physical boxes not fixing 
> the problem, it just fixed itself and the issue completely disappeared 
> with everything running fine for an hour and a half now after the problem.
>

I'm running Quagga 0.99.17 and no problem has been logged. I'm only getting
one single full BGP feed from APNIC though.

> I have seen this issue previously with malformed communities or 4 Byte 
> AS Numbers but there are no known bugs I am aware of that would 
> explain a similar issue this evening.
>

I've discovered that since 0.99.17 and the current Quagga from the git
repository (so I don't know exactly when that change was committed), there
has been a substantial change in how 4Byte AS numbers are handled.
While 0.99.17 was quite liberal in accepting the encoding of the ASPath
(e.g. 4byte was negotiated, but 2bytes sent -- I know that shouldn't
happen) and wouldn't reset the connection upon ASPath weirdness, but rather
ignore the problem, and drop the update. Now it resets the connection. So a
malformed packet will cause a reset.

> During the issue BGPd would not respond to any commands on the CLI yet 
> the CPU on our boxes was 0.01, the only unusual symptom is of course, 
> complete traffic loss with no explanation. All peers impacted were 
> spread over multiple routers in multiple locations and multiple NIC's.
>

The weird bit is the 0.01 CPU part... I would have suspected a high CPU
usage, as each Quagga would try to reconnect to each peer and then re-send
the full routing tables, consuming a massive amount of CPU..

If the CPU never went up, then there's something suspect with your OS and
how the CPU info is updated/displayed..
Also, the fact that you got complete traffic loss suggests that there was no
route in the table anymore, thus at some point the table had to be refilled,
using CPU...

Mat
_______________________________________________
AusNOG mailing list
AusNOG at lists.ausnog.net
http://lists.ausnog.net/mailman/listinfo/ausnog




More information about the AusNOG mailing list