I enjoy going to sporting events. I have to admit that I am a NASCAR fan and have attended several races. They are exciting and loud. The race cars seem to go faster than what it seems like when you watch a race on television. But, I do not like arriving to the event and leaving. When there are over 100,000 people attending the event, traffic gets a little messy. Of course, everyone wants to arrive and leave at the same time. The venue can hold everyone, but the roads and transit system was not designed to handle the bursts of traffic that these events produce.

When the event is over, people end up waiting in long lines that move slowly as the vehicles reach the highway access points. This burst of traffic causes most vehicles to be delayed significantly while a continuous, small trickle of people makes it to the highway. If it were possible to adjust the roads to handle the increased capacity, then traffic would flow without any problems and people would be happy.

In my earlier post, I described the DNS infrastructure within the communications service provider (CSP). All of these DNS environments in the CSP require the ability to manage large volumes of requests. If there is an increase in the volume of subscriber traffic, like what is seen when there is a major disaster, there is a corresponding increase in the number of DNS queries being handled by the local DNS servers. All of this traffic drives traffic in the control plane and that DNS infrastructure will see an increase as well. The provider’s domains could be stressed when the home page of the subscribers’ applications and browsers point to the CSP’s site.

When there is an oversubscription problem, the DNS protocol is designed to attempt to recover. First, the DNS client will continue to query if there is not a response within a certain amount of time. The DNS client will typically wait 2 seconds and retry the request 3 times. Sometimes, the DNS clients will send the query to a different DNS server if the first one is not responding or is generally not available. This means that a subscriber will wait approximately 6 seconds before the resource they are trying to reach is determined to be unavailable. Note that it is important to state that the resource is available. The subscriber just does not know how to reach it because their application does not know the IP address associated with the resource.

Any delay in the responsiveness of the application affects the subscribers’ Quality of Experience. Two seconds or more is a long time when discussing Internet response times. It is said that a subscriber will cancel browsing and go to a different website if it takes more than seven seconds to load the page. People expect instant results on the Internet and providing scalability to the DNS infrastructure is an important piece of the solution.

Build It in Case They Come

Service Providers need to build a DNS infrastructure that can support the tens of millions of users that connect to their network. While not all users will be connected to their network at any given point in time, they need to build a DNS solution that will not break under the potential load pressures. Individual servers can only support so many DNS requests per second based on current technology and hardware. The typical high-end server will support approximately 25,000 to 75,000 DNS queries per second. To scale beyond this, a farm of DNS servers behind an application delivery controller (ADC) is typically used. Strategically, multiple farms are built in diverse geographic locations. When the CSP uses an application delivery controller to manage a pool of servers, they gain scalability to handle high volumes of DNS queries. They also gain the benefit of local availability since the application delivery controller has the ability to health check the status of the servers and DNS processes within them. If the DNS service is not available, DNS requests are not sent to that specific server, but instead, distributed among the other servers available within that pool.

Local DNS diagram

The dependency upon the performance of the servers can be reduced dramatically with the inclusion of a high performance DNS cache. The DNS cache can sit in front of the DNS servers and when a response is sent back to a client, the DNS cache will copy the entry and when a future query is seen for that same entry, the cache will respond directly. The DNS cache can also be a DNS resolver and actually reach out to the Internet DNS infrastructure to find the answer through recursive queries. Because the cache utilizes memory and entries based on recently accessed information instead of using a software based process, the DNS cache can respond to a request much faster and more efficiently than a typical DNS server.

The DNS infrastructure provided by CSPs is used by their customers as their Local DNS server in addition to managing addresses that the CSP owns. This means that the DNS server resolves queries for names that the CSP owns (mail.provider-Y.com) as well as queries for any address on the Internet. If the CSP is not the owner for that address, its DNS servers must send out queries to find out which servers are responsible and obtain an answer from those servers to deliver an address. This sequence of events can take several recursive queries to obtain the final response that is delivered back to the client. The DNS cache can eliminate the need to perform this sequence of recursive queries beyond the initial query. Once the DNS server has obtained the answer the first time, the DNS cache can retain that answer and deliver it to the requester more efficiently. This saves large amounts of processing and network resources.

With the size of the CSP networks and number of concurrent subscribers that are accessing the Internet through it at any given time, it is essential for the CSP to provide a DNS infrastructure that can handle the expected and unexpected loads that it may be subjected to. Like the NASCAR event, it is important for the host to develop a plan and architect a solution that allows everyone to get to their destination.  It is the responsibility of the CSP to utilize technology and solutions that have the ability to handle the bursty nature of the DNS traffic and still deliver responses in a timely fashion.