[AusNOG] Strange BGP issues (Quagga/Zebra)

Mon Mar 19 11:10:00 EST 2012

On 18/03/2012 21:50, Nathan Brookfield wrote:
> Hi Guys;
>
> I’m wondering if anyone else with a full BGP feed noticed any unusual
> behaviour this evening with their BGPd daemon crashing at around 2000hrs
> running on the Quagga platform. I have absolutely no explanation for it
> and it occurred on two separate Vyatta boxes that take multiple full BGP
> feeds.
>
> It caused the BGP Daemon to kill all of our Upstream & Downstream
> sessions simultaneously and then magically 40 or so minutes later after
> a restart of the daemon and one of the physical boxes not fixing the
> problem, it just fixed itself and the issue completely disappeared with
> everything running fine for an hour and a half now after the problem.
>

I'm running Quagga 0.99.17 and no problem has been logged. I'm only 
getting one single full BGP feed from APNIC though.

> I have seen this issue previously with malformed communities or 4 Byte
> AS Numbers but there are no known bugs I am aware of that would explain
> a similar issue this evening.
>

I've discovered that since 0.99.17 and the current Quagga from the git 
repository (so I don't know exactly when that change was committed), 
there has been a substantial change in how 4Byte AS numbers are handled.
While 0.99.17 was quite liberal in accepting the encoding of the ASPath 
(e.g. 4byte was negotiated, but 2bytes sent -- I know that shouldn't 
happen) and wouldn't reset the connection upon ASPath weirdness, but 
rather ignore the problem, and drop the update. Now it resets the 
connection. So a malformed packet will cause a reset.

> During the issue BGPd would not respond to any commands on the CLI yet
> the CPU on our boxes was 0.01, the only unusual symptom is of course,
> complete traffic loss with no explanation. All peers impacted were
> spread over multiple routers in multiple locations and multiple NIC’s.
>

The weird bit is the 0.01 CPU part... I would have suspected a high CPU 
usage, as each Quagga would try to reconnect to each peer and then 
re-send the full routing tables, consuming a massive amount of CPU..

If the CPU never went up, then there's something suspect with your OS 
and how the CPU info is updated/displayed..
Also, the fact that you got complete traffic loss suggests that there 
was no route in the table anymore, thus at some point the table had to 
be refilled, using CPU...

Mat