[AusNOG] Crashes all round on Tuesday

Wed Jul 1 18:31:56 EST 2015

The maximum timeout a particular application can withstand a dropout
without the session getting torn down, (which is implementation dependent),
and the maximum timeout you can experience without _any_ applications being
affected, are different things. If a TCP session is closing, and you pull
the plug, the reset may be left wandering the network. If the network
returns later than TIME_WAIT, there may be issues.

Paul Wilkins

On 1 July 2015 at 17:38, Mark Smith <markzzzsmith at gmail.com> wrote:

> On 1 July 2015 at 16:40, Paul Wilkins <paulwilkins369 at gmail.com> wrote:
> > Mark,
> > It's implementation specific (depends what options you pass to
> > setsockopt/sockstream).
> >
> > There's problems resulting from having TIME_WAIT too long, with wandering
> > duplicate. Supposedly RFC1337 says TIME_WAIT should be at least 2
> minutes,
> > but on my Linux box, I just timed a dropped socket, and it timed out
> after
> > one minute.
> >
>
> So I just checked Stevens Volume 1, which is where I read about this
> (back in 1998 or earlier IIRC). The timer that triggers retransmission
> is the Round Trip Timeout or TRO, which is measured and updated for
> the TCP connection (as, for example, the topology of the network could
> change while the TCP connection is active). Once the RTO times out,
> the retransmission intervals I mentioned occur.
>
> The TIME_WAIT timer you're describing is the one used after the TCP
> connection has closed, and it is there to ensure any TCP segments that
> belong to the closed TCP connection that might still be floating
> around the network expire.
>
>
>
> > Paul Wilkins
> >
> > On 1 July 2015 at 15:17, Mark Smith <markzzzsmith at gmail.com> wrote:
> >>
> >> On 1 July 2015 at 15:11, Mark Smith <markzzzsmith at gmail.com> wrote:
> >> > On 1 July 2015 at 14:56, Mark Smith <markzzzsmith at gmail.com> wrote:
> >> >> On 1 July 2015 at 12:33, Ross Wheeler <ausnog at rossw.net> wrote:
> >> >>>
> >> >>>
> >> >>> I had several links went down at 10:00 (give or take a few seconds)
> -
> >> >>> well,
> >> >>> not mine so much as my upstream - and it's been blamed on this
> issue.
> >> >>>
> >> >>
> >> >> So from a little bit of Human Computer Interaction (HCI) I studied
> >> >> many years ago, I remember that humans will wait for some sort of
> >> >> response for between 3 to 5 seconds. So if the period of your packet
> >> >> loss and the retransmission to recover from it is short enough, the
> >> >> humans effected may notice a slight delay, but they won't take any
> >> >> remedial actions themselves (i.e, they won't push the submit button
> >> >> again, and won't complain about it.)
> >> >>
> >> >
> >> > This can also be particularly useful to know when cutting a set of
> >> > links over from an old piece of equipment to a new one. 3 to 5 seconds
> >> > is a bit tight to move the link, you can push people's response
> >> > expectations out in the outage notice (e.g., "between 7 and 8 am, we
> >> > will be conducting network maintenance. During this period, you may
> >> > encounter system delays of up to 5 to 10 seconds). I think asking
> >> > people to wait any longer than 10 seconds means this is a service
> >> > impacting outage and should be scheduled out of normal operating
> >> > hours.
> >> >
> >> > Also make sure that anything/any protocols that may cause the new
> >> > equipment to taking longer than 3 to 5 seconds to bring up the link is
> >> > temporarily or permanently switched off. Traditional STP would be a
> >> > prime example (make sure there isn't a loop in the network topology at
> >> > all, or at least during the cut-over window if you're going to switch
> >> > STP back on later). Bear in mind that your window from
> >> > "working-to-working" is the 5 to 10 seconds (or 3 to 5 normally), so
> >> > e.g., BGP sessions might come up within a few seconds, but if
> >> > downloading the full route table, resolving the routes and putting
> >> > them into the FIB is going to take more than 10 seconds, you'll have
> >> > to do a proper service impacting outage at an appropriate time.
> >> >
> >> > Finally, remember that UDP and DCCP don't do recovery from packet
> >> > loss, so if your apps are using them, they'll either have to be
> >> > tolerant of packet loss of up to 10 (or 3 to 5) seconds, do recovery
> >> > themselves, or should be rewritten to use TCP or SCTP.
> >> >
> >> > <snip>
> >>
> >> One last thing, you also need to know how the characteristics of and
> >> how persistent your reliable protocols are attempting to recover from
> >> packet loss. If your reliable protocol gives up within the 3 to 5 or 5
> >> to 10 second window, your customers/users will suffer an outage. TCP,
> >> for example, doesn't give up easily. If I recall correctly, it will
> >> try for up to around 9 minutes, and tries at doubling intervals up
> >> until 64 seconds and then each 64 seconds i.e., attempts at 1, 2, 4,
> >> 8, 16, 32, 64, 64, 64, ... seconds.
> >> _______________________________________________
> >> AusNOG mailing list
> >> AusNOG at lists.ausnog.net
> >> http://lists.ausnog.net/mailman/listinfo/ausnog
> >
> >
> >
> > _______________________________________________
> > AusNOG mailing list
> > AusNOG at lists.ausnog.net
> > http://lists.ausnog.net/mailman/listinfo/ausnog
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ausnog.net/pipermail/ausnog/attachments/20150701/8f829fd1/attachment.html>