[AusNOG] Upstream PMTUD broken? Packets blackhole

Wed Sep 14 16:51:23 EST 2016

Thanks guys. 

Mark: Yes, this is an end-user firewall, and one on which I’ve done most of my troubleshooting work (big thanks to their ISP for their assistance). This firewall was the first place I checked. The MTU is set to 1484 to check whether the firewall is indeed at fault. It has been set much lower as well (1400), with no effect. Unfortunately, it’s sending ICMP 3:4 correctly, they’re just not making it back to the sending device. Testing through AAPT to this firewall works – just not when the path runs through Vocus.

Jon: That’s one theory I was looking at. It seems to be incredible difficult gleaning any information from upstream providers about configuration particulars with IXs unless you’re directly responsible for the link, let alone have corrections made. The knock-on effects of breaking the link during troubleshooting are too great. If you have any ideas of how I can request a check though, I’m all ears. So far, I’ve been having to work outwards from the affected devices. I can tell that changes the helpful ISP made did partially help. Another client’s tail with the same ISP can now use a larger MTU than they could before (no idea why that problem existed, but it was collaterally rectified during this round of troubleshooting.

Adam: I’m so glad (though also sorry) to hear that others have seen issues as well. I’ve been fighting hard against the “check your MTU” pushback. No one seems to listen when I tell them the MTU can be tiny, but if larger packets are being dropped upstream somewhere, and the notification of that action never reaches the sender, then it’s completely pointless. Additionally, “fixing” the problem hasn’t involved actually looking for the source of the problem. I can change the MTU value and clamp MSS on my side, but I can’t do anything about traffic inbound that never reaches the client devices in the first place, or from devices that purposely break standards by sending large packets straight off the bat. Kludges, while partially helpful in the short term, are not the final solution. They should be used as a stop-gap until the root cause is fixed. Another kludge has been to strip all traffic of the DF bit, or simply ignore it completely. That doesn’t actually help that much, as it breaks a wide array of protocols – especially VPNs and remote access. Frustrating!

Cameron: Thanks for that link. Definitely a resources I can point my managers to when they have trouble understanding the answers I give to their questions about why this is a big problem.

I’m stuck with what else to do at the moment. I’ve just had a call from an end-user product support company (in Sydney – funny that) regarding difficulties with remote access to a client’s head office. Inbound on Telstra TWI, the connection keeps resetting. Switched them over to try via a Vocus based connection (the only two connections available). Same issue. Both run through NSW IXs, and nothing I can do about their ISP’s connection.

Whatever is causing this is the root of so many other varied, intermittent, and “low-priority” problems that no one looks any further. * Delayed email eventually makes it through – low priority, because email made it in the end. 

* Email with attachments sometimes go missing, and it’s unnoticed for weeks – mainly because emails without attachments still get through. It’s only when one of those attachments was an account invoice that an issue is raised – they’re the only emails that ever get followed up.

* VPN clients drop out – it reconnected, so no problem. Only it happens time after time.

* Internet “slow” – it’s not completely offline, so must be congestion. Anyone think that congestion could be significantly reduced if the same data didn’t have to retransmitted 100 times before it goes through?

* Webpages time out – refresh the page and it loads properly. Yep, most of the page is already cached, that’s why. Banking websites being an exception, and even they’ve started rolling out updates so MTU issues affect them less so than before.

Clients I have overseas only experience problems with Australia, and not just with networks I control. That definitely points to a wider issue, and one I can’t troubleshoot without help from upstream.

If I’ve messed up somewhere, I’ll be the first to admit fault.

Kind Regards,

-Dave

-----------------------------

Ethernet firewall? Why is the MTU so low?

These problems are virtually always caused by a firewall blocking ICMP. Usually it’s a firewall operated by the person who is complaining about PMTUD not working :-)

  - mark

-----------------------------

-----------------------------

Hope you figure it out! One thought is to do with passing traffic through IX

If there's a change in MTU at the IX, you may have trouble. IX IP address space is often not globally routable, so an ICMP response coming from a router using that address space can be lost.

Cheers,

Jon

-----------------------------

-----------------------------

I had the same problem with 2 separate Telstra ADSL Connections, they were working fine for many many years. All of a sudden traffic over a VPN to NZ was impacted and some Customer loyalty traffic to Europe was broken.

I had to lower the MTU even more and it fixed it. Many other people had the same problem

- Adam

-----------------------------

-----------------------------

http://shouldiblockicmp.com/  share with your friends! Great resource

- Cameron

-----------------------------

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ausnog.net/pipermail/ausnog/attachments/20160914/4d2164a9/attachment.html>