[AusNOG] Crashes all round on Tuesday

Paul Wilkins paulwilkins369 at gmail.com
Wed Jul 1 21:49:11 EST 2015


There's a lot of ways you can meddle with socket options and timers,
enabling TCP keepalive for example.

But I was thinking more, what's the worse case you can take a line down
for, and not have any sessions fail. Worst case, as far as I can make out,
is where there's a parallel path available, you drop one socket, and
another session, taking a different path, has its TCP sequence wrap, then
the former socket's packet matches the sequence, and either resets the
socket, or corrupts the data.

Paul Wilkins

On 1 July 2015 at 19:02, Mark Smith <markzzzsmith at gmail.com> wrote:

> On 1 July 2015 at 18:31, Paul Wilkins <paulwilkins369 at gmail.com> wrote:
> > The maximum timeout a particular application can withstand a dropout
> without
> > the session getting torn down, (which is implementation dependent), and
> the
> > maximum timeout you can experience without _any_ applications being
> > affected, are different things.
>
> Are you describing a scenario where a specific application has changed
> TCP's default operating parameters e.g., timeouts?
>
> If you are, and the applications TCP parameters have been set so low
> that the application will not tolerate a 3 to 5 second period of
> transient packet loss, then you wouldn't be able to do what I've
> suggested you do would you? Of course, you're making assurances that
> no possible transient event in your network that could impact that
> particular application's traffic will take any longer than what you've
> lowered the TCP timeout parameters to, and you would know you've made
> those assurances.
>
> > If a TCP session is closing, and you pull
> > the plug, the reset may be left wandering the network.
>
> Not endlessly. Either its TTL/Hop Count will reach zero, or it will be
> dropped because the destination is unreachable.
>
>  If the network
> > returns later than TIME_WAIT, there may be issues.
> >
>
> The reset will be gone by then.
>
> > Paul Wilkins
> >
> > On 1 July 2015 at 17:38, Mark Smith <markzzzsmith at gmail.com> wrote:
> >>
> >> On 1 July 2015 at 16:40, Paul Wilkins <paulwilkins369 at gmail.com> wrote:
> >> > Mark,
> >> > It's implementation specific (depends what options you pass to
> >> > setsockopt/sockstream).
> >> >
> >> > There's problems resulting from having TIME_WAIT too long, with
> >> > wandering
> >> > duplicate. Supposedly RFC1337 says TIME_WAIT should be at least 2
> >> > minutes,
> >> > but on my Linux box, I just timed a dropped socket, and it timed out
> >> > after
> >> > one minute.
> >> >
> >>
> >> So I just checked Stevens Volume 1, which is where I read about this
> >> (back in 1998 or earlier IIRC). The timer that triggers retransmission
> >> is the Round Trip Timeout or TRO, which is measured and updated for
> >> the TCP connection (as, for example, the topology of the network could
> >> change while the TCP connection is active). Once the RTO times out,
> >> the retransmission intervals I mentioned occur.
> >>
> >> The TIME_WAIT timer you're describing is the one used after the TCP
> >> connection has closed, and it is there to ensure any TCP segments that
> >> belong to the closed TCP connection that might still be floating
> >> around the network expire.
> >>
> >>
> >>
> >> > Paul Wilkins
> >> >
> >> > On 1 July 2015 at 15:17, Mark Smith <markzzzsmith at gmail.com> wrote:
> >> >>
> >> >> On 1 July 2015 at 15:11, Mark Smith <markzzzsmith at gmail.com> wrote:
> >> >> > On 1 July 2015 at 14:56, Mark Smith <markzzzsmith at gmail.com>
> wrote:
> >> >> >> On 1 July 2015 at 12:33, Ross Wheeler <ausnog at rossw.net> wrote:
> >> >> >>>
> >> >> >>>
> >> >> >>> I had several links went down at 10:00 (give or take a few
> seconds)
> >> >> >>> -
> >> >> >>> well,
> >> >> >>> not mine so much as my upstream - and it's been blamed on this
> >> >> >>> issue.
> >> >> >>>
> >> >> >>
> >> >> >> So from a little bit of Human Computer Interaction (HCI) I studied
> >> >> >> many years ago, I remember that humans will wait for some sort of
> >> >> >> response for between 3 to 5 seconds. So if the period of your
> packet
> >> >> >> loss and the retransmission to recover from it is short enough,
> the
> >> >> >> humans effected may notice a slight delay, but they won't take any
> >> >> >> remedial actions themselves (i.e, they won't push the submit
> button
> >> >> >> again, and won't complain about it.)
> >> >> >>
> >> >> >
> >> >> > This can also be particularly useful to know when cutting a set of
> >> >> > links over from an old piece of equipment to a new one. 3 to 5
> >> >> > seconds
> >> >> > is a bit tight to move the link, you can push people's response
> >> >> > expectations out in the outage notice (e.g., "between 7 and 8 am,
> we
> >> >> > will be conducting network maintenance. During this period, you may
> >> >> > encounter system delays of up to 5 to 10 seconds). I think asking
> >> >> > people to wait any longer than 10 seconds means this is a service
> >> >> > impacting outage and should be scheduled out of normal operating
> >> >> > hours.
> >> >> >
> >> >> > Also make sure that anything/any protocols that may cause the new
> >> >> > equipment to taking longer than 3 to 5 seconds to bring up the link
> >> >> > is
> >> >> > temporarily or permanently switched off. Traditional STP would be a
> >> >> > prime example (make sure there isn't a loop in the network topology
> >> >> > at
> >> >> > all, or at least during the cut-over window if you're going to
> switch
> >> >> > STP back on later). Bear in mind that your window from
> >> >> > "working-to-working" is the 5 to 10 seconds (or 3 to 5 normally),
> so
> >> >> > e.g., BGP sessions might come up within a few seconds, but if
> >> >> > downloading the full route table, resolving the routes and putting
> >> >> > them into the FIB is going to take more than 10 seconds, you'll
> have
> >> >> > to do a proper service impacting outage at an appropriate time.
> >> >> >
> >> >> > Finally, remember that UDP and DCCP don't do recovery from packet
> >> >> > loss, so if your apps are using them, they'll either have to be
> >> >> > tolerant of packet loss of up to 10 (or 3 to 5) seconds, do
> recovery
> >> >> > themselves, or should be rewritten to use TCP or SCTP.
> >> >> >
> >> >> > <snip>
> >> >>
> >> >> One last thing, you also need to know how the characteristics of and
> >> >> how persistent your reliable protocols are attempting to recover from
> >> >> packet loss. If your reliable protocol gives up within the 3 to 5 or
> 5
> >> >> to 10 second window, your customers/users will suffer an outage. TCP,
> >> >> for example, doesn't give up easily. If I recall correctly, it will
> >> >> try for up to around 9 minutes, and tries at doubling intervals up
> >> >> until 64 seconds and then each 64 seconds i.e., attempts at 1, 2, 4,
> >> >> 8, 16, 32, 64, 64, 64, ... seconds.
> >> >> _______________________________________________
> >> >> AusNOG mailing list
> >> >> AusNOG at lists.ausnog.net
> >> >> http://lists.ausnog.net/mailman/listinfo/ausnog
> >> >
> >> >
> >> >
> >> > _______________________________________________
> >> > AusNOG mailing list
> >> > AusNOG at lists.ausnog.net
> >> > http://lists.ausnog.net/mailman/listinfo/ausnog
> >> >
> >
> >
> >
> > _______________________________________________
> > AusNOG mailing list
> > AusNOG at lists.ausnog.net
> > http://lists.ausnog.net/mailman/listinfo/ausnog
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ausnog.net/pipermail/ausnog/attachments/20150701/e30a4fed/attachment-0001.html>


More information about the AusNOG mailing list