[AusNOG] QoS on Internet traffic

Tony Miles tmiles42 at gmail.com
Tue Aug 15 10:14:46 EST 2017


Hi all,


I'm not sure if anyone else is having this issue, but we are recieving an
increasing number of request to give priority/preference to specific
Internet traffic.

Apologies in advance for the lengthy post.

The typical example might be a customer that has five sites that we provide
a 20Mbps private WAN tail into (per site) and then we have a centralised
hosted firewall that all sites access the internet via. The speed on the
central firewall might be capped to something like 50M (all abbreviations
using "M" refer to "Mbps" hereafter). The WAN we provide supports QoS so
that if a client has an application that is important to them it can be
tagged and put in an appropriate queue and treated accordingly. Examples of
this might be that they have an RDP server at the head office site or they
have VoIP PBX gear at each location. The central Internet access is
oversubscriber 2:1 in this example (100M of WAN tails on 50M of Internet).
At this point I think this is all fairly standard stuff that a lot of the
people on this list would be familiar with (hopefully?). When I am using
this example, it is just an example, this is of course multiplied by the
number of clients we have, who are all generically fairly similar, but with
each one having different specific details (different speeds, different
things they consider important).

With the move to cloud everything clients are moving from hosting stuff
themselves (ie. on their own servers/WAN) to things that are hosted
generically on the Internet. This might be their accounting application,
might be video conferencing or voip services or any number of other things
that for whatever reason they have chosen to procure "as a service" rather
than buying the thing and hosting it locally on premises.

When everything is running normally and there is no excess volume of
traffic nobody complains, but the first time $someone_important is on a
video conference call to an interstate office and the quality is crap
because Windows updates are sucking all of the Internet bandwidth the
question then becomes "please fix this, we purchase a WAN with QoS". The VC
one is particularly nasty because the conference bridge is in the cloud and
so a VC session between three locations that are all on the same private
WAN (with potentially plenty of bandwidth) is effectively 3x VC session to
the Internet.

Historically our answer has been "it's the Internet, there is no QoS",
which has sufficed for a while, but it's gotten to the stage where
EVERYTHING is now "in the cloud" and that answer is slowly losing traction.
This combined with the fact that others out there are promising (rightly or
wrongly) that they can solve the problem for the client and we can continue
to ignore it at our peril.

I should probably add that we DO provide on-net VoIP & VC services for
clients that we can (and do) support properly with QoS but clients are free
to use or not use them as they wish and there are any number of reasons why
they might choose a different Internet based provider of these services
(price, features, integration, historical, etc). There is also the whole
range of other hosted applications that a client might want to access that
we don't host internally and can't get some sort of cross connect or other
arrangement in place to bring the traffic in via something other than
Internet transit.

Our Internet topology is like this (arrows indicating inbound/downstream
traffic flow):

[$transit_provider] ---> [border router] ---> [core router] ---> [firewall]
---> {private WAN}


Right now we shape outbound/egress on the core router towards the firewall
to the speed that is purchased by the client (eg. in above example 50M). It
makes no difference what sort of policy we apply, right now it's just a
plain "shape default queue to x". We COULD in theory apply a proper QoS
policy that puts stuff in queues and provides the required bandwidth to
those queues. The only thing preventing this is the classification of the
traffic (ie. how to decide what goes in each queue). To do this effectively
would (I imagine) require something that can do L7 inspection of traffic to
see that something is "https://important_site.com" and apply appropriate
DSCP marking to the packets. This is of course something that our core
routers can not do (L7 classification).

Options that I've considered:

1. Continue with "Internet => no QoS" - the whole point of this post is
that this position is becoming less viable as everything moves to being
"cloud based" or as we like to call it "Internet hosted". We can continue
this stance at our own peril, but we all know that it is 10x easier to
retain existing clients than try and find new ones so to retain existing
clients.

2. increase bandwidth to the firewalls - in the above example the firewall
bandwidth is 50M and the total of the WAN tails is 100M. We could (ignoring
the screams coming from the accountants for now) simply increase the
bandwidth to each firewall so that there is no longer any oversubscription
(eg. 100M in my example). This wouldn't solve the problem however as the
entirety of the bandwidth to the firewall could still be consumed and not
enough left for the "important" things. All we've done is give the clients
more Internet bandwidth, but not actually solved the problem. It also
doesn't help if there is WAN congestion between the sites as all Internet
traffic is still going to be treated equally in the case of congestion.

3. Not shape/police to the firewall - instead use a firewall that can
classify traffic and shape/queue outbound on it's LAN interface (ie.
towards the private WAN cloud). This seems attractive in the first
instance, but there are a couple of things going against it. The first is
that a lot of the firewalls are provided as managed firewalls by us and so
we control them, BUT a number of clients (mostly the larger ones with their
own IT resources) have their own firewall (hosted in our racks) that they
manage. Telling clients that they are required to shape their firewall to
<speed> and not shaping it for them (upstream) seems like a very trusting
thing to do and I don't think that would go well (surely nobody would abuse
it ?!). The way of preventing the abuse is simplt to police inbound on the
core router the LAN of the firewall is connected to, so that if client
doesn't shape to (eg.) 50M, then it gets policed to 50M anyway and their
QoS becomes broken by the policer.

4. Find some device to classify traffic - ideally if we could stick a
device of some sort between the border routers and core routers that could
do L7 calssification of traffic and tag DSCP appropriately then we could do
what we need without too many other changes. Does such a "thing" exist ?
Can anyone point me in the direction of something that would do this ?


Having the traffic classified and tagged (DSCP) is the ideal solution as
this then allows the QoS on the WAN portion to work as well. No point
eliminating the firewall/Internet as the problem only to have the VC
session be crappy because there is a file transfer happening between two
sites.


Talking about firewalls, can anyone recommand a firewall that do what is
required for option #3 above. Need something that can classify traffic, tag
DSCP on it and then shape/queue outbound on the LAN interface
appropriately. Needs to be a VM device or something the supports proper
virtualisation for separate individual clients properly (and can manage
clients individually as well). This possibly seems like it might be the
best option if we can find the appropriate platform to do what we require
that fits all of the other requirements as well.


I think that's all I've got for now. Thanks for your patience in even
reading this far. Happy to discuss privately with people if you don't want
to post something publicly.


Thanks again,
Tony.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ausnog.net/pipermail/ausnog/attachments/20170815/375b540a/attachment.html>


More information about the AusNOG mailing list