CDN — As an Operations Tool?

Published in

BBC Product & Technology

9 min readOct 14, 2020

In this article Thomas Cave, Operations Engineer in the BBC’s Online Technology Group, gives an overview of one of the tools he has at his disposal to keep the BBC online in times of high traffic.

What is a CDN?

A visual representation of a Content Delivery Network which shows distribution across a world map. — An example Content Delivery Network spanning the globe.

A CDN (Content Delivery Network) is a group of servers distributed geographically which can overcome some of the challenges that delivering content to millions of audience members, across the globe, presents! Servers will be placed in strategic locations, which are known as a PoP (Point of Presence), which can accelerate the delivery of content by various layers of caching and network optimisation.

CDN Benefits

CDNs have a number of benefits, in this article we’re going to look at:

Response Time decreases
Increase to bandwidth and decrease in bandwidth cost
Increase to availability & redundancy
Increase to security

1. Response Time Decreases

A graph showing that as the response time increase, so does the chance that the audience members will use another website. — Source: Think with Google / SOASTA Research, 2017

In 2017, a report by Think with Google shows that if a page takes 3 seconds to load, then audience members are very likely to give up and use another website instead. This means, to keep our audience interested we have to keep our response time down.

Without a CDN, you’ll probably only have a one (if not, a few) origin servers which will be typically placed geographically close to your main office. This works fine — if your target audience is located in (relatively) the same location! But what if you’re looking to attract audience members from abroad?

Anything, whether it’s IP packets or otherwise, can only travel at the speed of light — this is just the law of physics! This means that theoretically (so excluding any delays introduced routers, switches or processing time at the servers end), the quickest we could transfer content from the UK to Australia would be 0.05 seconds each way (based on a distance of 15,196km and the speed of light being 3x10⁸) via fibre optic. Therefore, a round trip (so a request and a response) would be 0.1 seconds — 3% of the 3 seconds loading time would be lost due to the time to transfer the data if there are no devices in the chain!

CDNs have a number of PoPs placed strategically around the globe — with direct contact to ISPs (Internet Service Providers) to allow for peering. By bringing the edge nodes closer to the audience members, we can greatly reduce the route (and therefore time) time taken for the data to reach their devices.

2. Increase to bandwidth and decrease in bandwidth cost

Through various means of caching and the vast number of PoPs in a CDN’s network, CDNs can offer an increase in bandwidth by offering your audience members to connect to any PoP. As the CDNs manage the relationship to ISPs, there’s a decrease to cost (in terms of engineers’ time and money) for this extra bandwidth. Without this, Network Engineers would have to spend time liaising with ISPs at public and private peering locations, Infrastructure Engineers would have to spend time to install equipment at these locations and, finally, Network (and NOC) Engineers would have to configure and maintain the equipment. This all takes time, money and effort!

3. Increase to availability and redundancy

Say you only had one origin server with no redundancy, when you wanted to complete maintenance on this server you would have to bring down your entire online presence to complete it — this clearly isn’t feasible! As such, when designing your architecture, you would build in redundancy right from the beginning.

CDNs can offer exceptionally high availability and redundancy because, should they have an issue or be completing maintenance in one of their PoPs, they can reroute traffic to the next nearest PoP. Typically, this would be done automatically, as rerouting traffic between their PoPs would not cause the CDNs too much overhead.

4. Increase to security

More and more services are being enhanced by the internet, this provides an attacker with the potential to do more damage. If internet security isn’t implemented on a website, attackers can exploit this and gain valuable information about you or your audience members.

As a CDN can power a number of different websites from different clients, CDNs build security into their network as a core competency. CDNs can provide protection from DDoS (Denial of Service) attacks, can filter bad traffic from clean traffic with help of a WAF (Web Application Firewall), improvements to SSL (Secure Sockets Layer) certificates and a number of other improvements.

How does a CDN work?

CDNs have various levels of servers — known as tiers. Content needs replication across these tiers to make CDNs efficient. CDNs provide two slightly different types of content replication — passive and proactive. Passive content replication means that once a request is made, the response is stored throughout the servers that request was passed through — the key here is that content is stored after a request is made. Proactive, on the other hand, is where content is replicated across the CDN network before a request is made. The difference in content replication presents a trade-off between cost and performance — ultimately, a business decision.

Once a request is made to the CDN from an audience member, a POP is selected by a DNS-based routing protocol. One method of DNS-based routing is geolocational routing.

But first, what is DNS?

Visual representation showing the chain of DNS nameservers. — If the nearest DNS server doesn’t know the address, it will query the upstream server and cache the result.

DNS (Domain Name System) is the system where domain names (the nice, human-readable www.bbc.co.uk) are translated to IP addresses (x.x.x.x) which devices can read. This translation is known as “resolving” and is where your device will make a request to the DNS server (typically your home router, but other well known DNS Servers include Google (8.8.8.8 / 8.8.4.4) and CloudFlare (1.1.1.1)) with the query: “Do you know where I can find www.bbc.com?”. If the DNS server knows, it’ll respond: “Yes, here it is: 212.58.233.247”, but if the DNS server doesn’t know, it will query its upstream DNS server. Once the DNS server receives a response, it will cache the result (speeding up the process for future queries) and provide the answer back to your device.

This is how the foundation of the internet works and to give a scale as to how many DNS queries might be issued, a StackOverflow answer shows that StackOverflow receives ~1–2MM website visits per day — translating to ~180 DNS queries per second!

Now, how does Geolocational DNS Routing work?

There are several extensions to the DNS protocol, one of these is the key to providing geographically targeted content to an end-user. RFC-2761 first describes this extension mechanism for DNS as EDNS0, which later was updated to EDNS(0) under RFC-6891. This RFC defines the mechanism for the use of EDNS Client Subnet (ECS) within DNS queries. This is achieved by recursive resolvers providing part of the client IP address (the first 3 octets, eg: 1.2.3.0/24) to the authoritative DNS nameserver.

When a DNS query is made to an authoritative DNS nameserver, the query is first checked to see if the DNS resolver sent the query with ECS enabled. If ECS is enabled and the record being queried has ECS enabled, the first 3 octets of the end-user’s IP address will be used — this means that a highly accurate GeoIP lookup can be used to determine the end-user’s geographical location.

If a query has been issued without ECS enabled, the IP address of the server from which the query was received will be used. This typically happens if the end-user is using a public resolver, such as Google’s Public DNS or OpenDNS, and means that this calculation will not be very accurate because this server could be located far away from the end-user.

Once the authoritative DNS nameserver has determined which IP address to use, the IP address will be looked up in real-time against a database to determine the latitude and longitude location. The authoritative DNS nameserver will then apply logic to answer the query with the IP address of the PoP which is the most efficient to serve the request to the end-user. This has a few benefits:

High Availability (if one DNS server goes down, any resolver in the network continue to serve requests)
Fast response times (as there are many DNS servers, networking protocols mean that the request will be routed to the server which located closest — in terms of network topology — speeding up the response times)

Remember those benefits stated earlier (quicker response times due to receiving content from the closest servers)? Well, this is one method to determine the closest server. Geolocational DNS ensures you are given the IP address of the closest edge node, and failing that, the least congested node. A great way of seeing this is to use a free tool such as: https://www.whatsmydns.net/#A/www.bbc.com. Straight away, you can see that the different locations query www.bbc.com and are given different IP address — the IP address of the PoP which is located closest to them.

I understand what a CDN is now and how the audience connects to it! How can you use it as a service availability tool within Operations?

As a CDN is essentially a high availability, global fleet of servers. At times of high traffic (or for times where things might not be going to plan and the need to reduce our visitor traffic becomes apparent!) an Operations Engineer could deploy an emergency change to switchover from our traffic managers to a CDN provider.

Source: Neil Craig, Twitter

Taking the example of the breaking news notification of “Boris Johnson in intensive care”, where the BBC received a sustained period (~20 minutes) of high load (~31.5k requests per second), it soon became clear that this high load was starting to impact other production services. Digital 24/7 Operations became alerted, sought authorisation and deployed a change to push the ~31.5k requests per second to their CDN provider, all without interrupting the traffic flow. This can be seen in the below graph:

Traffic Spikes can be unpredicatable. This is a prime example of just that where the BBC’s traffic spiked from ~14k requests/s up to ~31.5k requests/s in under 2 minutes!

Right… I think I understand?

To give a graphical representation of the incident…

Imagine a motorway junction where multiple roads are trying to merge and join onto the motorway. The motorway is only a fixed width, which will have a maximum flow of x amount of cars per second before traffic starts to build up. With BAU traffic levels, cars flow through the junction with no problem at all — in fact, there’s capacity to spare! In other words, there’s enough redundancy built in that if one lane shuts, traffic continues to flow fine.

Let’s say there’s a match day nearby, traffic through this junction increases tenfold. There are now too many cars wanting (requesting) to go through this junction. In other words, the request rate to enter is greater than the capacity of the junction (ie, the ratio between the request rate vs the capacity is greater than 1). This presents a problem, as cars are now starting to queue to enter the junction. This can be directly related to computer networks — say the request rate is greater than the capacity of the network, IP packets (namely TCP packets) will start to queue (for a short period of time) before being timed out and the audience member receives a 5xx (server-side error) response — or worse, nothing.

The only way out of this situation is to increase the capacity (this is where the real-life metaphor of cars in a junction falls apart, but bear with me!). Let’s say we can increase the capacity of the motorway after 5 minutes — essentially, the TTL (time to live). If extra lanes can be added to the motorway in 5 minutes, this will bring the ratio between the request rate vs the capacity back down below 1, clearing the backlog (traffic jam in this case). In networking terms, we would need Network Engineers to increase the core infrastructure to support the higher load, and Infrastructure and SysAdmins to install and set up extra servers to serve content to the higher request rate. This is clearly not feasible — and definitely can’t be achieved in the seconds that the rate of change of the traffic takes!

What if there’s a motorway nearby which has many times the capacity of the motorway that the traffic outgrew? We could change the routing of the cars to use this larger capacity motorway and the traffic would clear! This is what happens when Digital 24/7 Operations execute a CDN switchover.

The Operations Engineer will change the DNS that www.bbc.com/www.bbc.co.uk resolves against and so, once the TTL expires, traffic will be routed to a CDN edge node — selected by geolocational DNS routing. This meant that any new request to the BBC would be routed to the CDN (in the sense of the metaphor, the larger motorway), and any connection which was established to the BBC directly, would remain until a new request is made to the DNS server. This means that, while still serving the ~31.5k requests per second, traffic was shifted away to a CDN provider — and the audience members would not notice a thing!