[AusNOG] Crashes all round on Tuesday

Wed Jul 1 15:17:50 EST 2015

On 1 July 2015 at 15:11, Mark Smith <markzzzsmith at gmail.com> wrote:
> On 1 July 2015 at 14:56, Mark Smith <markzzzsmith at gmail.com> wrote:
>> On 1 July 2015 at 12:33, Ross Wheeler <ausnog at rossw.net> wrote:
>>>
>>>
>>> I had several links went down at 10:00 (give or take a few seconds) - well,
>>> not mine so much as my upstream - and it's been blamed on this issue.
>>>
>>
>> So from a little bit of Human Computer Interaction (HCI) I studied
>> many years ago, I remember that humans will wait for some sort of
>> response for between 3 to 5 seconds. So if the period of your packet
>> loss and the retransmission to recover from it is short enough, the
>> humans effected may notice a slight delay, but they won't take any
>> remedial actions themselves (i.e, they won't push the submit button
>> again, and won't complain about it.)
>>
>
> This can also be particularly useful to know when cutting a set of
> links over from an old piece of equipment to a new one. 3 to 5 seconds
> is a bit tight to move the link, you can push people's response
> expectations out in the outage notice (e.g., "between 7 and 8 am, we
> will be conducting network maintenance. During this period, you may
> encounter system delays of up to 5 to 10 seconds). I think asking
> people to wait any longer than 10 seconds means this is a service
> impacting outage and should be scheduled out of normal operating
> hours.
>
> Also make sure that anything/any protocols that may cause the new
> equipment to taking longer than 3 to 5 seconds to bring up the link is
> temporarily or permanently switched off. Traditional STP would be a
> prime example (make sure there isn't a loop in the network topology at
> all, or at least during the cut-over window if you're going to switch
> STP back on later). Bear in mind that your window from
> "working-to-working" is the 5 to 10 seconds (or 3 to 5 normally), so
> e.g., BGP sessions might come up within a few seconds, but if
> downloading the full route table, resolving the routes and putting
> them into the FIB is going to take more than 10 seconds, you'll have
> to do a proper service impacting outage at an appropriate time.
>
> Finally, remember that UDP and DCCP don't do recovery from packet
> loss, so if your apps are using them, they'll either have to be
> tolerant of packet loss of up to 10 (or 3 to 5) seconds, do recovery
> themselves, or should be rewritten to use TCP or SCTP.
>
> <snip>

One last thing, you also need to know how the characteristics of and
how persistent your reliable protocols are attempting to recover from
packet loss. If your reliable protocol gives up within the 3 to 5 or 5
to 10 second window, your customers/users will suffer an outage. TCP,
for example, doesn't give up easily. If I recall correctly, it will
try for up to around 9 minutes, and tries at doubling intervals up
until 64 seconds and then each 64 seconds i.e., attempts at 1, 2, 4,
8, 16, 32, 64, 64, 64, ... seconds.