[AusNOG] Detecting "hung" ssh sessions.

Mon Feb 22 14:07:41 EST 2016

Hi Noggers.

Looking for a "bright idea" or a point in the right direction.

I have a bunch of remote devices that live behind nat and firewalls, in 
uncontrolled environments. It's not always (or even frequently) possible 
to get those in charge of said NAT boxes to do PAT to my devices, so 
instead I have each device ssh to one of my hosts and create a reverse 
tunnel. (The tunnels are bound only to the loopback interface on my 
server, so the end devices are not significantly exposed to the outside 
world).

As and when I need to access remote boxes, I ssh to the terminating host, 
ssh to the appropriate port and have immediate shell access on the remote 
box.

Each remote box also periodically (cron) checks that the ssh session is 
(still) running (simple ps) and (re)starts it if not.

This generally works well.

Alas, this morning, my provider had a brief oopsee (no explanation 
forthcoming) where 100% of my external connectivity dropped for a few 
minutes.

This resulted in every last one of these tunnels breaking, but they've 
broken in such a way that they didn't restart. The terminating host shows 
no connections from any of the remote devices, yet all of the remote 
devices still have their ssh session running. They simply are not passing 
any traffic. Yes, I have keepalives enabled.

Does anyone know of a simple, effective, reliable way to detect (from the 
client end) the loss of end-to-end function of a tunnel like this without 
going completely overboard - installing replacement versions of ssh isn't 
going to work for me, running autossh similarly.

Things I've looked at but lucked out with include adding a static route to 
my server and looking for either byte counters or last-used timers with 
netstat, looking for per-process traffic or tcp counters and a number of 
other failed avenues.

I could add ipfw and pass traffic through a rule to observe if its passing 
traffic or not, but that has lots of other negative impacts, especially on 
a few machines that are already balls-to-the-wall on their network 
interfaces.

I considered tcpdump to see when a packet was last received, but it too 
has lots of other overheads.

I'm sure I'm not the only person to have ever faced this, lots of people 
will have overcome it, but I can't find any information on it. (Any amount 
of help for unresponsive/stuck *interactive* sessions, but that doesn't 
help me!).

Fingers crossed someone here can throw me a line....

RossW