[AusNOG] Crashes all round on Tuesday

Mark Smith markzzzsmith at gmail.com
Wed Jul 1 15:11:33 EST 2015


On 1 July 2015 at 14:56, Mark Smith <markzzzsmith at gmail.com> wrote:
> On 1 July 2015 at 12:33, Ross Wheeler <ausnog at rossw.net> wrote:
>>
>>
>> I had several links went down at 10:00 (give or take a few seconds) - well,
>> not mine so much as my upstream - and it's been blamed on this issue.
>>
>
> So from a little bit of Human Computer Interaction (HCI) I studied
> many years ago, I remember that humans will wait for some sort of
> response for between 3 to 5 seconds. So if the period of your packet
> loss and the retransmission to recover from it is short enough, the
> humans effected may notice a slight delay, but they won't take any
> remedial actions themselves (i.e, they won't push the submit button
> again, and won't complain about it.)
>

This can also be particularly useful to know when cutting a set of
links over from an old piece of equipment to a new one. 3 to 5 seconds
is a bit tight to move the link, you can push people's response
expectations out in the outage notice (e.g., "between 7 and 8 am, we
will be conducting network maintenance. During this period, you may
encounter system delays of up to 5 to 10 seconds). I think asking
people to wait any longer than 10 seconds means this is a service
impacting outage and should be scheduled out of normal operating
hours.

Also make sure that anything/any protocols that may cause the new
equipment to taking longer than 3 to 5 seconds to bring up the link is
temporarily or permanently switched off. Traditional STP would be a
prime example (make sure there isn't a loop in the network topology at
all, or at least during the cut-over window if you're going to switch
STP back on later). Bear in mind that your window from
"working-to-working" is the 5 to 10 seconds (or 3 to 5 normally), so
e.g., BGP sessions might come up within a few seconds, but if
downloading the full route table, resolving the routes and putting
them into the FIB is going to take more than 10 seconds, you'll have
to do a proper service impacting outage at an appropriate time.

Finally, remember that UDP and DCCP don't do recovery from packet
loss, so if your apps are using them, they'll either have to be
tolerant of packet loss of up to 10 (or 3 to 5) seconds, do recovery
themselves, or should be rewritten to use TCP or SCTP.

<snip>


More information about the AusNOG mailing list