Topics


Blogs


Forums


Samples


Media


Labs


Resources

 




DevCentral > Weblogs > Lori MacVittie - Two Different Socks
 To Boldly Go Where No Production Application Has Gone Before
posted on Wednesday, July 01, 2009 4:14 AM

The importance of stress-testing in production

Everyone is still a-twitter over the problems the web experienced last week right after the news of Michael Jackson’s death. There have been

numerous stories on the fact that the Internet nearly fell over itself and died under the strain of trying to support the rush of millions of users as they queried, clicked, watched video, read blogs and news reports on the subject.

The Internet itself, of course, was just fine. The infrastructure comprising our electronic highway was humming along, routing packets happily here and there. What fell down were web applications and servers, overwhelmed with more concurrent connections than they could possibly handle despite their (I assume) high-availability infrastructure.

Obviously this is a difficult scenario to test for – or at least it appears to be. No organization is really able to test its production environment against an onrush of such a magnitude. The cost in software or hardware alone just to generate that much traffic would break most budgets even in normal economic times.

But to avoid the embarrassment of being “unavailable” it’s a necessity. Organizations need to test against worst case scenarios, i.e. an onslaught of visitors of a magnitude most sites can only dream of seeing, to ensure not only that their web applications can handle the load but to work out any kinks that may be lurking in their infrastructure, waiting for a heavy load to appear.


CONDITIONAL ERRORS

Developers know that sometimes defects only show up under certain conditions. Sometimes it’s extended use (memory leaks) and sometimes it’s high capacity (out of memory, artificial limits on arrays, etc…). Those conditions are hard to simulate let alone find in a test environment because all too often organizations don’t have the means to adequately force those conditions. Time and traffic cost money, and any one who has worked inside an enterprise organization knows that QA and testing often have more limited budgets than most IT departments.

And sometimes, just sometimes, it’s not even the applications’ fault. Sometimes a problem will appear when an application is under heavy load in the infrastructure; a configuration choice or scripting error can appear suddenly at higher loads. It may have never have impacted the caution-loadlimit availability or performance of an application at “normal” load but brings it right to its knees at high load. For example, I was testing some application network infrastructure some years ago that exhibited strange behavior only when the device reached 48,000 concurrent users. I could reproduce the issue at will, and it led to the discovery that the device was not freeing up connections from its session state table. It was full and couldn’t handle any more. Once the engineers knew that they could fix it and it was able to handle hundreds of thousands of concurrent users. But it wasn’t a problem with the application, it was the infrastructure. Without the ability to stress that infrastructure beyond its typical usage we would have never found the problem and it would have been discovered, unhappily, by a real customer in a real situation.

I was chatting with an old colleague who is now with a startup, SOASTA, that offers cloud-based stress testing of web applications. They’ve seen the same types of problems in production environments. Spinning up a web application for a very busy seasonal rush an organization called on SOASTA to perform some stress testing. All was going well until the application hit heavy load – 80,0000 or more transactions a second – and then suddenly things went wonky (that’s official testing terminology there, really, trust me ;-)). The load balancer began sending all the requests to one pool, overwhelming it with connections and causing performance and availability to rapidly degrade. They quickly discovered the source of the problem (a conditional configuration error), addressed it and tried again the next night. Voila! The application and its supporting infrastructure performed as expected: supporting hundreds of thousands of users and transactions per second at a more than acceptable performance rate.


ENGAGE, NO. 1

I can’t stress enough (sorry, pun intended) the importance today of stress-testing web applications in production. If your business whether because of revenue or presence requires high availability and well-performing applications then you can’t afford not to stress test your application before the inevitable occurs. It’s expensive to do it yourself, which is why so many web applications haven’t been tested to their full limits, but an increasing number of cloud-based services are making it not only affordable, but a no-brainer.

There are many other “EVENTS” which can occur that will drive usage of your web application to its limits. There are many other “EVENTS” that will drive your infrastructure past its limits, too, and those two limits may not be the same after all. It behooves any organization that depends on web applications to test both its application and infrastructure capacity before it buckles under the load of some “event” – the timing of which cannot be foreseen.

So get out there, take a deep breath, and boldly go where no production application has gone before: stressed to its limits purposefully just to see if – and when - it will break.

 

P.S. You get +8 geek points if you know the reference for EVENTS being capitalized.Tweet me if you think you know the answer.

Follow me on Twitter View Lori's profile on SlideShare friendfeedicon_facebook AddThis Feed Button Bookmark and Share

Related blogs & articles:



 
      

Feedback


9/22/2009 4:07 AM
Gravatar Load Balancing on the Inside
Lori MacVittie
 Leave Feedback
Title  
Name  
Email
Url
Comments   
Please add 3 and 7 and type the answer here: