Individual servers in a farm may be expected to fail, but the site - that's a different story

Tom's Hardware has an interesting look at an architecture I'm going to call "built to fail". This architecture is focused on building a fault tolerant site, not necessary a fault tolerant web application infrastructure.

While the author of the article implies that this architecture is something new, it's really not except in the sense that today's Web 2.0 app providers might not care if a server is lost because it's cheap to replace while other, more cost conscious organizations are likely reluctant to casually eat the cost of even a single server in the data center. The basic concept of fault-tolerant web sites and applications is one that's been around a long time, it's the disposable nature of these "built to fail" architectures that's pretty new.

Disposable servers

If you’re a Web 2.0 service, you use the cheapest motherboards you can get, and if something fails, you throw it away and plug in a new one. It’s not that the Website can afford to be offline any more than an ATM network can. It’s that the software running sites like Google is distributed across so many different machines in the data center that losing one or two doesn’t make any difference.

The reason I love (and simultaneously hate) articles discussing fault tolerant web sites is that they often ignore the most critical piece of the infrastructure; the piece that actually distributes those requests and makes a "built to fail" architecture capable of maintaining the availability of a site - Web 2.0 or otherwise.

It's great that HP and IBM are answering the call for cheap computing power that is essentially as close to disposable as you can get, but alone they can't 'maintain availability of an entire site. For that you need something that understands when one of those cheap servers fails and distributes requests elsewhere to compensate for the failure.

[ Cue Bette Midler's "Wind Beneath My Wings" ]

That "something" is an application delivery controller, the unsung hero of the data center that makes it possible for organizations to implement a "built to fail" application infrastructure without real concern for the availability of the site. Given the availability issues plaguing Twitter of late, it's easy to agree that if you're going to make it big in the Web 2.0 world you not only have to build something people want, but you have to keep it available for them. That's the role of the application delivery controller; keeping track of all those commodity servers and ensuring that if (when) one fails, requests are distributed appropriately to the others such that the site is always available.

While you could use clustering technology to distribute requests, if you're putting the cluster controller on one of those cheap servers, well...you can imagine what happens when the server distributing requests stops distributing requests. Yeah, exactly - site down. You could purchase more resilient hardware, but then you're defeating the "cheap and disposable" nature of your low-cost web infrastructure, not to mention the additional cost in software licenses necessary to implement clustering and the lack of flexibility inherent in software clustering solutions. 

While it's perfectly acceptable to build that "built to fail" architecture in order to keep your costs low - sites like Google prove it can work - if you're thinking of going there, you'll need to protect that un-investment and make sure that your site is available even when your "built to fail" data center does.

Imbibing: Coffee