[AusNOG] Crashes all round on Tuesday

Paul Wilkins paulwilkins369 at gmail.com
Wed Jul 1 16:40:23 EST 2015


Mark,
It's implementation specific (depends what options you pass to
setsockopt/sockstream).

There's problems resulting from having TIME_WAIT too long, with wandering
duplicate. Supposedly RFC1337 says TIME_WAIT should be at least 2 minutes,
but on my Linux box, I just timed a dropped socket, and it timed out after
one minute.

Paul Wilkins

On 1 July 2015 at 15:17, Mark Smith <markzzzsmith at gmail.com> wrote:

> On 1 July 2015 at 15:11, Mark Smith <markzzzsmith at gmail.com> wrote:
> > On 1 July 2015 at 14:56, Mark Smith <markzzzsmith at gmail.com> wrote:
> >> On 1 July 2015 at 12:33, Ross Wheeler <ausnog at rossw.net> wrote:
> >>>
> >>>
> >>> I had several links went down at 10:00 (give or take a few seconds) -
> well,
> >>> not mine so much as my upstream - and it's been blamed on this issue.
> >>>
> >>
> >> So from a little bit of Human Computer Interaction (HCI) I studied
> >> many years ago, I remember that humans will wait for some sort of
> >> response for between 3 to 5 seconds. So if the period of your packet
> >> loss and the retransmission to recover from it is short enough, the
> >> humans effected may notice a slight delay, but they won't take any
> >> remedial actions themselves (i.e, they won't push the submit button
> >> again, and won't complain about it.)
> >>
> >
> > This can also be particularly useful to know when cutting a set of
> > links over from an old piece of equipment to a new one. 3 to 5 seconds
> > is a bit tight to move the link, you can push people's response
> > expectations out in the outage notice (e.g., "between 7 and 8 am, we
> > will be conducting network maintenance. During this period, you may
> > encounter system delays of up to 5 to 10 seconds). I think asking
> > people to wait any longer than 10 seconds means this is a service
> > impacting outage and should be scheduled out of normal operating
> > hours.
> >
> > Also make sure that anything/any protocols that may cause the new
> > equipment to taking longer than 3 to 5 seconds to bring up the link is
> > temporarily or permanently switched off. Traditional STP would be a
> > prime example (make sure there isn't a loop in the network topology at
> > all, or at least during the cut-over window if you're going to switch
> > STP back on later). Bear in mind that your window from
> > "working-to-working" is the 5 to 10 seconds (or 3 to 5 normally), so
> > e.g., BGP sessions might come up within a few seconds, but if
> > downloading the full route table, resolving the routes and putting
> > them into the FIB is going to take more than 10 seconds, you'll have
> > to do a proper service impacting outage at an appropriate time.
> >
> > Finally, remember that UDP and DCCP don't do recovery from packet
> > loss, so if your apps are using them, they'll either have to be
> > tolerant of packet loss of up to 10 (or 3 to 5) seconds, do recovery
> > themselves, or should be rewritten to use TCP or SCTP.
> >
> > <snip>
>
> One last thing, you also need to know how the characteristics of and
> how persistent your reliable protocols are attempting to recover from
> packet loss. If your reliable protocol gives up within the 3 to 5 or 5
> to 10 second window, your customers/users will suffer an outage. TCP,
> for example, doesn't give up easily. If I recall correctly, it will
> try for up to around 9 minutes, and tries at doubling intervals up
> until 64 seconds and then each 64 seconds i.e., attempts at 1, 2, 4,
> 8, 16, 32, 64, 64, 64, ... seconds.
> _______________________________________________
> AusNOG mailing list
> AusNOG at lists.ausnog.net
> http://lists.ausnog.net/mailman/listinfo/ausnog
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ausnog.net/pipermail/ausnog/attachments/20150701/f154c1b4/attachment.html>


More information about the AusNOG mailing list