[AusNOG] Crashes all round on Tuesday

Mark Smith markzzzsmith at gmail.com
Wed Jul 1 19:02:23 EST 2015


On 1 July 2015 at 18:31, Paul Wilkins <paulwilkins369 at gmail.com> wrote:
> The maximum timeout a particular application can withstand a dropout without
> the session getting torn down, (which is implementation dependent), and the
> maximum timeout you can experience without _any_ applications being
> affected, are different things.

Are you describing a scenario where a specific application has changed
TCP's default operating parameters e.g., timeouts?

If you are, and the applications TCP parameters have been set so low
that the application will not tolerate a 3 to 5 second period of
transient packet loss, then you wouldn't be able to do what I've
suggested you do would you? Of course, you're making assurances that
no possible transient event in your network that could impact that
particular application's traffic will take any longer than what you've
lowered the TCP timeout parameters to, and you would know you've made
those assurances.

> If a TCP session is closing, and you pull
> the plug, the reset may be left wandering the network.

Not endlessly. Either its TTL/Hop Count will reach zero, or it will be
dropped because the destination is unreachable.

 If the network
> returns later than TIME_WAIT, there may be issues.
>

The reset will be gone by then.

> Paul Wilkins
>
> On 1 July 2015 at 17:38, Mark Smith <markzzzsmith at gmail.com> wrote:
>>
>> On 1 July 2015 at 16:40, Paul Wilkins <paulwilkins369 at gmail.com> wrote:
>> > Mark,
>> > It's implementation specific (depends what options you pass to
>> > setsockopt/sockstream).
>> >
>> > There's problems resulting from having TIME_WAIT too long, with
>> > wandering
>> > duplicate. Supposedly RFC1337 says TIME_WAIT should be at least 2
>> > minutes,
>> > but on my Linux box, I just timed a dropped socket, and it timed out
>> > after
>> > one minute.
>> >
>>
>> So I just checked Stevens Volume 1, which is where I read about this
>> (back in 1998 or earlier IIRC). The timer that triggers retransmission
>> is the Round Trip Timeout or TRO, which is measured and updated for
>> the TCP connection (as, for example, the topology of the network could
>> change while the TCP connection is active). Once the RTO times out,
>> the retransmission intervals I mentioned occur.
>>
>> The TIME_WAIT timer you're describing is the one used after the TCP
>> connection has closed, and it is there to ensure any TCP segments that
>> belong to the closed TCP connection that might still be floating
>> around the network expire.
>>
>>
>>
>> > Paul Wilkins
>> >
>> > On 1 July 2015 at 15:17, Mark Smith <markzzzsmith at gmail.com> wrote:
>> >>
>> >> On 1 July 2015 at 15:11, Mark Smith <markzzzsmith at gmail.com> wrote:
>> >> > On 1 July 2015 at 14:56, Mark Smith <markzzzsmith at gmail.com> wrote:
>> >> >> On 1 July 2015 at 12:33, Ross Wheeler <ausnog at rossw.net> wrote:
>> >> >>>
>> >> >>>
>> >> >>> I had several links went down at 10:00 (give or take a few seconds)
>> >> >>> -
>> >> >>> well,
>> >> >>> not mine so much as my upstream - and it's been blamed on this
>> >> >>> issue.
>> >> >>>
>> >> >>
>> >> >> So from a little bit of Human Computer Interaction (HCI) I studied
>> >> >> many years ago, I remember that humans will wait for some sort of
>> >> >> response for between 3 to 5 seconds. So if the period of your packet
>> >> >> loss and the retransmission to recover from it is short enough, the
>> >> >> humans effected may notice a slight delay, but they won't take any
>> >> >> remedial actions themselves (i.e, they won't push the submit button
>> >> >> again, and won't complain about it.)
>> >> >>
>> >> >
>> >> > This can also be particularly useful to know when cutting a set of
>> >> > links over from an old piece of equipment to a new one. 3 to 5
>> >> > seconds
>> >> > is a bit tight to move the link, you can push people's response
>> >> > expectations out in the outage notice (e.g., "between 7 and 8 am, we
>> >> > will be conducting network maintenance. During this period, you may
>> >> > encounter system delays of up to 5 to 10 seconds). I think asking
>> >> > people to wait any longer than 10 seconds means this is a service
>> >> > impacting outage and should be scheduled out of normal operating
>> >> > hours.
>> >> >
>> >> > Also make sure that anything/any protocols that may cause the new
>> >> > equipment to taking longer than 3 to 5 seconds to bring up the link
>> >> > is
>> >> > temporarily or permanently switched off. Traditional STP would be a
>> >> > prime example (make sure there isn't a loop in the network topology
>> >> > at
>> >> > all, or at least during the cut-over window if you're going to switch
>> >> > STP back on later). Bear in mind that your window from
>> >> > "working-to-working" is the 5 to 10 seconds (or 3 to 5 normally), so
>> >> > e.g., BGP sessions might come up within a few seconds, but if
>> >> > downloading the full route table, resolving the routes and putting
>> >> > them into the FIB is going to take more than 10 seconds, you'll have
>> >> > to do a proper service impacting outage at an appropriate time.
>> >> >
>> >> > Finally, remember that UDP and DCCP don't do recovery from packet
>> >> > loss, so if your apps are using them, they'll either have to be
>> >> > tolerant of packet loss of up to 10 (or 3 to 5) seconds, do recovery
>> >> > themselves, or should be rewritten to use TCP or SCTP.
>> >> >
>> >> > <snip>
>> >>
>> >> One last thing, you also need to know how the characteristics of and
>> >> how persistent your reliable protocols are attempting to recover from
>> >> packet loss. If your reliable protocol gives up within the 3 to 5 or 5
>> >> to 10 second window, your customers/users will suffer an outage. TCP,
>> >> for example, doesn't give up easily. If I recall correctly, it will
>> >> try for up to around 9 minutes, and tries at doubling intervals up
>> >> until 64 seconds and then each 64 seconds i.e., attempts at 1, 2, 4,
>> >> 8, 16, 32, 64, 64, 64, ... seconds.
>> >> _______________________________________________
>> >> AusNOG mailing list
>> >> AusNOG at lists.ausnog.net
>> >> http://lists.ausnog.net/mailman/listinfo/ausnog
>> >
>> >
>> >
>> > _______________________________________________
>> > AusNOG mailing list
>> > AusNOG at lists.ausnog.net
>> > http://lists.ausnog.net/mailman/listinfo/ausnog
>> >
>
>
>
> _______________________________________________
> AusNOG mailing list
> AusNOG at lists.ausnog.net
> http://lists.ausnog.net/mailman/listinfo/ausnog
>


More information about the AusNOG mailing list