Accommodating for large-scale SurfProtect use

Supporting large organisations with SurfProtect

Right now, we’re getting closer to the launch of SurfProtect Quantum, the latest version of our content filtering software, redesigned from the ground up to give users a more effective, customisable and powerful service, introducing a number of features previously only available through expensive hardware-based filtering options.

Creating the all-new design, our team has encountered quite a range of technical challenges, some of which we’ll be covering in subsequent blog posts. Throughout the build, we’ve been using an “early release” version of SurfProtect Quantum with a number of customers of various sizes to ensure that the product fills all customer needs.

In order to act as an effective filter, SurfProtect intercepts traffic, and only allows connections once it’s established that requested content is allowed for the individual user. As SurfProtect Quantum integrates with Active Directory, administrators can set parameters for specific users and IP addresses. Our proxies, in turn, request usernames and passwords from the browser before letting the traffic through.

However, querying the user’s identity essentially doubles the number of requests passing through a user’s firewall - the initial request has to be refused in order for the web browser to issue credentials to our service. With each browser tending to open at least 8 connections to our proxies under normal operation (and more in cases of heavy use), the total number of connections per site add up quickly, particularly given how many schools are now using ultrafast connectivity, like the 300 Mbps speeds now provided as standard with DarkLight.

This isn’t a problem for most, with firewalls able to handle roughly 64,000 outgoing connections per IP address to a single location - like our SurfProtect service. However, with thousands of users on the go, some of our biggest customers were, until recently, reporting some unusual behaviours at peak times.

Under the hood

Firewalls are designed to handle a large amount of traffic, and a major part of how they do this revolves around being able to forward a large number of connections, often using NAT - converting the non-unique IP address on the LAN to the publicly exposed one configured on the firewall itself.

Once a firewall port has been assigned to an individual user, and after communication has finished, it can be recycled for another connection, allowing another person to connect - the TCP TIME-WAIT state specifies how long the firewall waits before recycling the service, ensuring that no two sessions overlap - imagine it as a system that prevents a previous tenant’s mail from being sent to you when you move into a house.

As standard, this time is normally set at 120 seconds, waiting 2 minutes from the cessation of use to reassign the port. While this is normally effective, it’s not optimised for filtered traffic, like that passing through SurfProtect.

With thousands of devices connected to the firewall, each requiring a unique port so that SurfProtect filtering can be applied, the number of ports waiting to be freed up can come to dominate a large proportion of the ~64,000 available. The graph below demonstrates this clearly, depicting the state of connections, the majority waiting to be released and recycled.

A look at the traffic being handled by SurfProtect over the course of a day.

One of the issues that can occur here is the exhaustion of available ports, meaning new connections can not be mapped and accepted, making browsing slower than it should be.

Another side of the overall issue comes from the particular setup used by some of the firewalls we provided customers. As opposed to the standard 120 seconds, these firewalls were configured to use a time of just 1 second.

This isn’t an issue in and of itself, and in many cases, would result in a significant reduction in the number of blocked ports, which can be a positive for particularly busy firewalls, allowing devices to effectively use gigabit-level traffic.

However, in cases where filtering is applied, and traffic heading to a single destination - the proxy in this case, the very quick port cycling results in new connections becoming mixed with previous connections - causing mayhem, and long delays when trying to browse.

How we’re resolving the issue

To diagnose issues like this, we need to be able to inspect the user’s connection on both sides of the firewall. As such, when we were informed of the problem affecting some customers, and in the absence of an obvious explanation, we sent out Raspberry Pi units set up to let our team monitor connectivity performance. After some intensive analysis and some thorough analysis of packet traces, we were able to narrow the issue down to the points discussed above.

Having diagnosed the problem, our team’s been working to completely remove this issue for the affected customers over the last couple of days, ensuring that Quantum works effectively for users at any scale from now on.

Firstly, we’ve updated our setup procedure for the firewalls for all customers going forwards, changing the TIME-WAIT setting from 1 second to something a little less aggressive. This change ensures that the vast majority of traffic can be processed quickly enough, while still avoiding the technical issues that the TIME-WAIT state is designed to prevent - in particular, accidental crossover of delayed segments between connections.

For the largest-scale users, we’ve also introduced another option - adding additional IP ranges to their firewall to support the high volumes of traffic being pushed between SurfProtect and the user’s firewall, ensuring that all users get excellent performance at all times.

Want to understand this better ?

If you’re technically curious about this issue, we can wholeheartedly recommend Vincent Bernat’s blog and his excellent post on TCP TIME-WAIT, which was a major help when we first designed Quantum, and made us aware of this particular TCP limitation, more of an effect on Quantum than we’d expected.

Vincent’s also very kindly the Debian (a very popular Linux distribution) maintainer for ExaBGP, our open source networking Swiss Army Knife, making sure the application is available to the whole Debian community with a single command.