[AusNOG] Cloudflare offline

Paul Wallace paul.wallace at mtgi.com.au
Tue Mar 5 12:11:21 EST 2013

It takes the referred to level of noise to take the focus away from Juniper's gross unreliability!

From: ausnog-bounces at lists.ausnog.net [mailto:ausnog-bounces at lists.ausnog.net] On Behalf Of Ben Dale
Sent: Tuesday, March 05, 2013 9:42 AM
To: ausnog at lists.ausnog.net
Subject: Re: [AusNOG] Cloudflare offline

Just to close this one off for everyone who hit me up offline, it looks like they hit an existing PR 734453.

After upgrading to the release suggested, lab tests confirm the issue is no longer happening (though abnormally high values are still accepted).

You may now return to your regular arguments over historical decisions.



On 04/03/2013, at 4:35 PM, Ben Dale <bdale at comlinx.com.au<mailto:bdale at comlinx.com.au>> wrote:

On 04/03/2013, at 12:08 AM, Damian Guppy <the.damo at gmail.com<mailto:the.damo at gmail.com>> wrote:

They have now put up an incident report, cause was a combination of a bad rule was applied to all edge routers across all 23 global datacenters using flowspec and a bug in Junos caused the routers to have a memory leak and crash when they processed the rule, to top things off their automated recovery tools couldnt reboot/recover the vast majority of the routers automatically, and the ones they could got flooded with all the traffic the rest of them would normally handle. They ended up having to get people onsite at all datacenters to physically hard reboot the routers.

Poor guys



Bug looks to be pretty easy to reproduce too (in an arbitrary version):

bdale at mx80-bng1> show route table inetflow.0

inetflow.0: 1 destinations, 1 routes (1 active, 0 holddown, 0 hidden)
Restart Complete
+ = Active Route, - = Last Active, * = Both,*,port=53,len=99971,=99985/term:N/A
                   *[Flow/5] 00:01:37

bdale at mx80-bng1> show chassis routing-engine | match Mem
    Memory utilization          37 percent

... after taking a swig of beverage

bdale at mx80-bng1> show chassis routing-engine | match Mem
    Memory utilization          97 percent
bdale at mx80-bng1> show chassis routing-engine | match Mem
    Memory utilization          99 percent

It also pegs the CPU up to maximum during this time.

Anyone using Flowspec out there might want to take a good hard look at your validation until this is addressed (a commit script would do the trick).   Especially so, those receiving Flowspec via BGP from external sources eg: Team Cymru (should be just prefixes), Arbor (Roland may have more insight on sizing validation) etc as there appears to be no way to filter/validate specific rules (just sources you learnt them from).

I've tried a few other "illegal" values (eg: 65537, 65555) for packet length, but nothing kicks it off like the Cloudflare sizes (the rate memory is consumed *may* be proportional to the size of the packet described).  Removing the route prior to topping out doesn't reclaim the memory either : (


AusNOG mailing list
AusNOG at lists.ausnog.net<mailto:AusNOG at lists.ausnog.net>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ausnog.net/pipermail/ausnog/attachments/20130305/77dab426/attachment.html>

More information about the AusNOG mailing list