[AusNOG] Cloudflare offline
Paul Wallace
paul.wallace at mtgi.com.au
Tue Mar 5 12:11:21 EST 2013
It takes the referred to level of noise to take the focus away from Juniper's gross unreliability!
From: ausnog-bounces at lists.ausnog.net [mailto:ausnog-bounces at lists.ausnog.net] On Behalf Of Ben Dale
Sent: Tuesday, March 05, 2013 9:42 AM
To: ausnog at lists.ausnog.net
Subject: Re: [AusNOG] Cloudflare offline
Just to close this one off for everyone who hit me up offline, it looks like they hit an existing PR 734453.
After upgrading to the release suggested, lab tests confirm the issue is no longer happening (though abnormally high values are still accepted).
You may now return to your regular arguments over historical decisions.
Cheers,
Ben
On 04/03/2013, at 4:35 PM, Ben Dale <bdale at comlinx.com.au<mailto:bdale at comlinx.com.au>> wrote:
On 04/03/2013, at 12:08 AM, Damian Guppy <the.damo at gmail.com<mailto:the.damo at gmail.com>> wrote:
They have now put up an incident report, cause was a combination of a bad rule was applied to all edge routers across all 23 global datacenters using flowspec and a bug in Junos caused the routers to have a memory leak and crash when they processed the rule, to top things off their automated recovery tools couldnt reboot/recover the vast majority of the routers automatically, and the ones they could got flooded with all the traffic the rest of them would normally handle. They ended up having to get people onsite at all datacenters to physically hard reboot the routers.
Poor guys
http://blog.cloudflare.com/todays-outage-post-mortem-82515
--Damian
Bug looks to be pretty easy to reproduce too (in an arbitrary version):
bdale at mx80-bng1> show route table inetflow.0
inetflow.0: 1 destinations, 1 routes (1 active, 0 holddown, 0 hidden)
Restart Complete
+ = Active Route, - = Last Active, * = Both
173.2.3.4,*,port=53,len=99971,=99985/term:N/A
*[Flow/5] 00:01:37
Fictitious
bdale at mx80-bng1> show chassis routing-engine | match Mem
Memory utilization 37 percent
... after taking a swig of beverage
bdale at mx80-bng1> show chassis routing-engine | match Mem
Memory utilization 97 percent
bdale at mx80-bng1> show chassis routing-engine | match Mem
Memory utilization 99 percent
It also pegs the CPU up to maximum during this time.
Anyone using Flowspec out there might want to take a good hard look at your validation until this is addressed (a commit script would do the trick). Especially so, those receiving Flowspec via BGP from external sources eg: Team Cymru (should be just prefixes), Arbor (Roland may have more insight on sizing validation) etc as there appears to be no way to filter/validate specific rules (just sources you learnt them from).
I've tried a few other "illegal" values (eg: 65537, 65555) for packet length, but nothing kicks it off like the Cloudflare sizes (the rate memory is consumed *may* be proportional to the size of the packet described). Removing the route prior to topping out doesn't reclaim the memory either : (
Ben
_______________________________________________
AusNOG mailing list
AusNOG at lists.ausnog.net<mailto:AusNOG at lists.ausnog.net>
http://lists.ausnog.net/mailman/listinfo/ausnog
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ausnog.net/pipermail/ausnog/attachments/20130305/77dab426/attachment.html>
More information about the AusNOG
mailing list