[AusNOG] Crashes all round on Tuesday

Mark Smith markzzzsmith at gmail.com
Wed Jul 1 17:38:39 EST 2015


On 1 July 2015 at 16:40, Paul Wilkins <paulwilkins369 at gmail.com> wrote:
> Mark,
> It's implementation specific (depends what options you pass to
> setsockopt/sockstream).
>
> There's problems resulting from having TIME_WAIT too long, with wandering
> duplicate. Supposedly RFC1337 says TIME_WAIT should be at least 2 minutes,
> but on my Linux box, I just timed a dropped socket, and it timed out after
> one minute.
>

So I just checked Stevens Volume 1, which is where I read about this
(back in 1998 or earlier IIRC). The timer that triggers retransmission
is the Round Trip Timeout or TRO, which is measured and updated for
the TCP connection (as, for example, the topology of the network could
change while the TCP connection is active). Once the RTO times out,
the retransmission intervals I mentioned occur.

The TIME_WAIT timer you're describing is the one used after the TCP
connection has closed, and it is there to ensure any TCP segments that
belong to the closed TCP connection that might still be floating
around the network expire.



> Paul Wilkins
>
> On 1 July 2015 at 15:17, Mark Smith <markzzzsmith at gmail.com> wrote:
>>
>> On 1 July 2015 at 15:11, Mark Smith <markzzzsmith at gmail.com> wrote:
>> > On 1 July 2015 at 14:56, Mark Smith <markzzzsmith at gmail.com> wrote:
>> >> On 1 July 2015 at 12:33, Ross Wheeler <ausnog at rossw.net> wrote:
>> >>>
>> >>>
>> >>> I had several links went down at 10:00 (give or take a few seconds) -
>> >>> well,
>> >>> not mine so much as my upstream - and it's been blamed on this issue.
>> >>>
>> >>
>> >> So from a little bit of Human Computer Interaction (HCI) I studied
>> >> many years ago, I remember that humans will wait for some sort of
>> >> response for between 3 to 5 seconds. So if the period of your packet
>> >> loss and the retransmission to recover from it is short enough, the
>> >> humans effected may notice a slight delay, but they won't take any
>> >> remedial actions themselves (i.e, they won't push the submit button
>> >> again, and won't complain about it.)
>> >>
>> >
>> > This can also be particularly useful to know when cutting a set of
>> > links over from an old piece of equipment to a new one. 3 to 5 seconds
>> > is a bit tight to move the link, you can push people's response
>> > expectations out in the outage notice (e.g., "between 7 and 8 am, we
>> > will be conducting network maintenance. During this period, you may
>> > encounter system delays of up to 5 to 10 seconds). I think asking
>> > people to wait any longer than 10 seconds means this is a service
>> > impacting outage and should be scheduled out of normal operating
>> > hours.
>> >
>> > Also make sure that anything/any protocols that may cause the new
>> > equipment to taking longer than 3 to 5 seconds to bring up the link is
>> > temporarily or permanently switched off. Traditional STP would be a
>> > prime example (make sure there isn't a loop in the network topology at
>> > all, or at least during the cut-over window if you're going to switch
>> > STP back on later). Bear in mind that your window from
>> > "working-to-working" is the 5 to 10 seconds (or 3 to 5 normally), so
>> > e.g., BGP sessions might come up within a few seconds, but if
>> > downloading the full route table, resolving the routes and putting
>> > them into the FIB is going to take more than 10 seconds, you'll have
>> > to do a proper service impacting outage at an appropriate time.
>> >
>> > Finally, remember that UDP and DCCP don't do recovery from packet
>> > loss, so if your apps are using them, they'll either have to be
>> > tolerant of packet loss of up to 10 (or 3 to 5) seconds, do recovery
>> > themselves, or should be rewritten to use TCP or SCTP.
>> >
>> > <snip>
>>
>> One last thing, you also need to know how the characteristics of and
>> how persistent your reliable protocols are attempting to recover from
>> packet loss. If your reliable protocol gives up within the 3 to 5 or 5
>> to 10 second window, your customers/users will suffer an outage. TCP,
>> for example, doesn't give up easily. If I recall correctly, it will
>> try for up to around 9 minutes, and tries at doubling intervals up
>> until 64 seconds and then each 64 seconds i.e., attempts at 1, 2, 4,
>> 8, 16, 32, 64, 64, 64, ... seconds.
>> _______________________________________________
>> AusNOG mailing list
>> AusNOG at lists.ausnog.net
>> http://lists.ausnog.net/mailman/listinfo/ausnog
>
>
>
> _______________________________________________
> AusNOG mailing list
> AusNOG at lists.ausnog.net
> http://lists.ausnog.net/mailman/listinfo/ausnog
>


More information about the AusNOG mailing list