[AusNOG] Best practices on speeding up BGP convergence times
rhys at nexusone.com.au
Tue Feb 27 13:12:13 EST 2018
Thanks David for confirming BFD is the way to go here. Luckily, I have been able to enable BFD on all my transit links so far, so the time to detect peer failure has been quick.
And thanks Geoff for your detailed reply. From some off-list discussions, I think that I first need to apply some of the configs (like Add-Path) that I mentioned originally and see how I go from there, and also need to pinpoint with more certainty where the issue is occurring.
I know that I’ve mentioned primary/secondary transit links, but I actually _am_ announcing all prefixes on all transit links, and I’m only using AS Path prepending to try and optimise routing for prefixes that are in VIC vs NSW. So it’s not a case of conditionally advertising routes in this case. I did also try advertising more specific prefixes (e.g. /22 at NSW and /24 in VIC) but I found anecdotally that AS path prepending was faster for the inbound traffic to converge during failover.
So in a sense, I _am_ talking about MRAI timers, which I totally understand is just not a valid discussion to be having in the context of the general internet and it’s likely that yes, the outage window I’m seeing when a prefix is announced over a new transit path is totally reasonable. BUT where I start to run into a problem with the outcome is still this way when I have multiple links with a single transit provider. For example:
* I have cross-connect directly between one of my transit edge routers and one of their routers.
* I have another cross-connect directly between another of my transit edge routers and another of their routers (and this is not to mean that I intend this to be a backup path – I send out traffic active/active).
* Both links are to the same transit provider, in the same POP.
* I am advertising the same prefixes over both links, no AS path prepending, so the announcements are basically identical.
* My transit provider in Sydney uses localpref on their side to designate one session as “primary” and I am not able to change that. But I can and do send traffic out on both links as equal cost.
* As far as the rest of the internet is concerned my prefixes are still being announced from the same transit provider, so there shouldn’t be a need to propagate routing changes beyond my directly adjacent peer and their internal network. This is primarily why I am expecting not to see any impact in this scenario.
* Given that I have adjusted my MRAI timer down to 0 with my adjacent transit peers, and have BFD enabled, they should be able to switchover to the alternate link fairly quickly
* And yet, I see a 20 second outage window even in this scenario when I ping from an external connection into one of my prefixes announced over this transit.
That scenario above is mainly what I am concerned about as I didn’t expect much/any service impact in the above scenario, since I would have thought the path over the internet in general would remain unchanged up till my transit provider’s internal network.
Regarding what you listed as problem b) totally understand this, and I would expect some kind of delay when re-announcing via another transit since as you say, this has to propagate through countless upstreams throughout the internet - naturally this will take time. It’s good to hear you say 20-30 seconds is a good number in terms of getting everyone to re-learn routes. That’s really helpful.
In terms of time it takes to learn a new outbound path, I don’t see this as an issue given the options I have to announce multiple paths over iBGP and use of BFD – this should be possible to make quick by tuning my internal peer configs.
Thanks everyone for your experiences and insights. Based on some of the replies I got, it seems like it is reasonable to expect that in the scenario described in the bullet points above, it’s possible to see very little if any forwarding loss. And only once I am forced to advertise via a new transit would I expect to see the 20-30 second window as everyone on the internet learns a new path. I do need to improve my iBGP convergence and actually implement some of the methods I mentioned originally, and re-evaluate so as to rule out my iBGP convergence time as the issue I’m currently seeing for the scenario in the bullet points above.
Thanks everyone for your help.
Chief Information Officer
Nexus One Pty Ltd
E: support at nexusone.com.au<mailto:support at nexusone.com.au>
P: +61 2 9191 0606
M: PO Box 127, Royal Exchange NSW 1225
A: Level 10 307 Pitt St, Sydney NSW 2000
From: AusNOG <ausnog-bounces at lists.ausnog.net> on behalf of David Hughes <david at hughes.com.au>
Date: Tuesday, 27 February 2018 at 9:39 am
To: Geoff Huston <gih at apnic.net>
Cc: "ausnog at lists.ausnog.net" <ausnog at lists.ausnog.net>
Subject: Re: [AusNOG] Best practices on speeding up BGP convergence times
On 26 Feb 2018, at 9:52 pm, Geoff Huston <gih at apnic.net<mailto:gih at apnic.net>> wrote:
a) detecting link down quickly
You can adjust your BGP session keepalive timers to smaller values and make the session more sensitive to outages as a result. I also thought that these days you can get the interface status to directly map to the session state, but its been a while since I’ve done this in anger and frankly I have NFC how to do that, even if I used to know! Maybe you are already doing that anyway.
This is the scenario I was talking about (references below). You can easily have link on a northbound interface even if the peer isn’t there (you hit a layer-2 agg switch on the way for example). If the peer fails but you still have link on the interface you’ll be blindly forwarding packets to it, even though it’s not there anymore, until the BGP timers expire. That was the point of the lightning talk I gave way-back -then. Default timers aren’t helpful in this situation.
Fast forward to this decade and you have routing protocols that are “BFD-aware” so you have sub-second link failure detection. That allows the control plane to pull down the peer session and remove paths to that peer from the FIB. You can only run BFD if your upstream is as well so you know they will dump the prefixes from that peer session as quickly as you will. It makes failing over to a secondary link within the same upstream provider pretty seamless.
-------------- next part --------------
An HTML attachment was scrubbed...
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 17039 bytes
More information about the AusNOG