I haven’t heard the term “graceful degradation” in a long time, but as we continue to push the limits of data centers and our budgets to provide capacity it’s a concept we need to revisit.

storyfailwhaletwitter You might have heard that Twitter was down (again) last week. What you might not have heard (or read) is some interesting crunchy bits about how Twitter attempts to maintain availability by degrading capabilities gracefully when services are over capacity.

Twitter Down, Overwhelmed by Whales” from Data Center Knowledge offered up the juicy details:

blockquote The “whales” comment refers to the “Fail Whale” – the downtime mascot that appears whenever Twitter is unavailable. The appearance of the Fail Whale indicates a server error known as a 503, which then triggers a “Whale Watcher” script that prompts a review of the last 100,000 lines of server logs to sort out what has happened.

When at all possible, Twitter tries to adapt by slowing the site performance as an alternative to a 503. In some cases, this means disabling features like custom searches. In recent weeks Twitter.com users have periodically encountered messages that the service was over capacity, but the condition was usually temporary. At times of heavy load for more on how Twitter manages its capacity challenges, see Using Metrics to Vanquish the Fail Whale.

I found this interesting and refreshing at a time when the answer to capacity problems is to just “go cloud”, primarily because even if (and that’s a big if) “the cloud” was truly capable of “infinite scale” (it is not) it is almost certainly a fact that most organization’s budgets are not capable of “infinite payments” and cloud computing isn’t free.

It’s been many years, in fact, since the phrase “graceful degradation” has been uttered within my hearing, but that’s really what the article is describing and it’s something we don’t talk enough about. Perhaps that’s because it’s difficult to admit that there are limitations – whether technical or financial – on the ability to scale and meet demand. But there are, and if organizations are wise they’ll include in their application delivery strategy the means by which applications and services can “degrade gracefully.”

Twitter’s solution, the disabling of specific features, is a particularly easy way to implement such a strategy for Web 2.0 applications; at least it’s particularly easy if you have a network-side scripting capable solution mediating for the applications.



The reason it’s particularly easy to gracefully degrade Web 2.0 applications is that there is generally a 1:1 mapping between “functions” and “URIs.” This is often true for the web-facing interface, almost always true for RESTful APIs, and always true for SOAPy endpoints.


What you need to do is identify those “premium” URIs, i.e. those that can be disabled without negatively impacting core services, so that they can be “degraded” in the face of an overwhelming volume of requests.

You also need an intermediary. This can be a Load balancer, assuming it’s capable of providing the flexibility in configuration necessary to enable and disable service to specific URIs, i.e. it must be layer 7 aware. It has to be an intermediary through which all requests are routed because individual servers do not have the visibility required to be able to “see” the total requests and all responses. The fact that a server is throwing back 503 (Internal Error) errors indicates it doesn’t have the resources available to respond to a request, which means it won’t be able to respond to any requests, including those to disable services. Only an architecture that includes an intermediary of some kind (a reverse proxy) can achieve this solution.

The network-side script, which is deployed on the application delivery platform (load balancer), should implement logic that triggers degradation based on receiving 503 errors. It should probably not trigger on a single 503 or multiple 503s from the same application instance as such behavior could be indicative of a problem with that one instance as opposed to being produced due to a lack of capacity. That means the scripting solution needs to be able to take action based on a pattern of behavior coming from all application instances in conjunction with the total number of requests being received from users.

Yes, it has to be context-aware.

Once it’s determined that the errors are being generated due to a lack of capacity, the scripting solution needs to disable one or more of the specific URIs determined to be “premium” or ancillary. The intermediary can then respond to subsequent requests for the disabled URIs with custom content based on the expected response type. For example, if it’s an API call it might be appropriate to return a pre-formatted response in the appropriate data format indicating service is currently unavailable. Many network-side scripting solutions are capable of returning pre-formatted responses or they can be customized to provide more detail – it’s really up to the implementer to decide what information is included and how.

The premise is that as premium or ancillary services are degraded (disabled) that application instances will be able to focus on servicing core requests and return service to normal for those pieces of the application. When the volume of requests returns to within normal operating parameters for the capacity available, the intermediary can restore service to the previously degraded services.



From a technological point of view “infinite scale” is not possible. At some point the volume of requests will reach boundaries that simply cannot be overcome, be they limitations on the load balancer (there is a limit to how many servers can ultimately be load balanced, and bandwidth is not unlimited) or on the application infrastructure itself. After all, you can’t launch a new instance of an application if there are no physical resources left on which to launch it.

It is almost certainly the case, however, that before reaching the technical limits of an “infinitely scalable” environment that you will hit a financial limitation. Or it may be the case that you haven’t jumped on the “cloud” bandwagon and what you see is what you get: a limited number of physical resources running a finite number of application instances, and that’s it. In either case, there are limitations on capacity and at some point you may reach them. How you respond to those limitations is an organizational decision, but graceful degradation in a controlled manner is probably more desirable than random, uncontrolled service outages.

Graceful degradation is an acceptable strategy for responding to availability issues and is especially easy to implement for a Web 2.0 application or API. It’s certainly more appealing than the alternative, which leaves every user essentially playing a game of Russian Roulette with availability of your web application.

Follow me on Twitter    View Lori's profile on SlideShare  friendfeed icon_facebook

AddThis Feed Button Bookmark and Share