Topics


Blogs


Forums


Samples


Media


Labs


Resources

 




DevCentral > Weblogs > Lori MacVittie - Two Different Socks
 Putting a Price on Uptime
posted on Friday, October 16, 2009 3:15 AM

A lack of ability in the cloud to distinguish illegitimate from legitimate requests could lead to unanticipated costs in the wake of an attack. How do you put a price on uptime and more importantly, who should pay for it?

A “Perfect Cloud”, in my opinion, would be one in which the cloud provider’s infrastructure intelligently manages availability and performance such that when it’s necessary new instances of an application are launched to ensure meeting the customer’s defined performance and availability thresholds. You know, on-demand scalability that requires no manual intervention. It just “happens” the way it should.

Several providers have all the components necessary to achieve a “perfect cloud” implementation, though at the nonce it may require that customers specifically subscribe to one or more services necessary. For example, if you combine Amazon EC2 with Amazon ELB, Cloud Watch, and Auto Scaling, you’ve pretty much got the components necessary for a perfect cloud environment: automated scalability based on real-time performance and availability of your EC2 deployed application.

Cool, right?

Absolutely. Except when something nasty happens and your application automatically scales itself up to serve…no one.


AUTOMATIC REACTIONS CAN BE GOOD – AND BAD

BitBucket’s recent experience with DDoS shows that no security infrastructure is perfect; there’s always a chance that something will sneak by the layers of defense put into place by IT whether that’s in the local data center or in a cloud environment. The difference is in how the infrastructure reacts, and what it costs the customer.

Now, a DDoS such as the one that apparently targeted BitBucket was a UDP-based attack, meaning it was designed to flood the network and infrastructure and not the application. It was trying to interrupt service by chewing up bandwidth and resources on the infrastructure. Other types of DDoS, like a Layer 7 DDoS, specifically attack the application, which could potentially consume its resources which in turn triggers the automatic scaling processes which could result in a whole lot of money being thrown out the nearest window.

Consider the scenario:

  1. An application is deployed in the cloud. The cloud is configured to automatically scale up (launch additional instances) based on response time thresholds.
  2. A Layer 7 DDoS is launched against the application. Layer 7 DDoS is difficult to detect and prevent, and without the proper infrastructure in place it is unlikely to be detected by the infrastructure and even less likely to be detected by the application.
  3. The DDoS consumes all the resources on the application instance, degrading response time, so the infrastructure launches a second instance, and requests are load balanced across both application instances.
  4. The DDoS attack now automatically targets two application instances, and continues to consume resources until the infrastructure detects degradation beyond specified thresholds and automatically triggers the launch of another instance.
  5. Wash. Rinse. Repeat.

How many instances would need to be launched before it was noticed by a human being and it was realized that the “users” were really miscreants?

More importantly for the customer, how much would such an attack cost them?


THIS SOUNDS LIKE A JOB FOR CONTEXTUALLY-AWARE INFRASTRUCTURE

The reason the perfect cloud is potentially a danger to the customer’s budget is that it currently lacks the context necessary to distinguish good requests from bad requests. Cloud today, and most environments if we’re honest, lack the ability to examine requests in the context of the big picture. That is, it doesn’t look at a single request as part of a larger set of requests, it treats each one individually as a unique request requiring service by an application.

Without the awareness of the context in which such requests are made, the cloud infrastructure is incapable of detecting and preventing attacks that could context potentially lead to customer’s incurring costs well beyond what they expected to incur. The cost of an attack in the local data center might be a loss of availability, an application might crash and require the poor guy on call to come in and deal with the situation, but in terms of monetary costs it is virtually “free” to the organization, excepting the potential loss of revenue from customers unable to buy widgets who refuse to return later.

But in the cloud, this lack of context could be financially devastating. An attack moves at the speed of the Internet, and a perfect cloud is hopefully designed to react just as quickly. Just how many instances would be launch – incurring costs to the customer – before such an attack was detected? For all the monitoring offered by providers today it’s not clear whether any of them can discern and attack scenario from a seasonal rush of traffic, and it’s further not clear what the infrastructure would do about it if it could.

And once we add in the concept of intercloud, this situation could get downright ugly. The premise is that if an application is unavailable at cloud provider X according to the customer’s defined thresholds, that requests would be directed to another instance of the application in another cloud, and maybe even a third cloud. How many cloud deployed versions of an application could potentially be affected by a single, well-executed attack? The costs and reach of such a scenario boggle the mind.

My definition of a perfect cloud, methinks, needs to be adjusted slightly. A perfect cloud, therefore, in addition to its ability to automatically scale an application to meet demand must also be able to discern between illegitimate and legitimate users and provide the means by which illegitimate requests are ignored while legitimate requests are processed and only scaling when legitimate volumes of requests require such.


PUTTING A PRICE ON UPTIME
The question I think many people have, I know I certainly do, is who pays for the resulting cost of such an attack?

It’s often been said that it’s difficult if not impossible to put a price on downtime, but what about uptime? What about the cost incurred by the launch of additional instances of an application in the face of an attack? An attack that cannot be reasonably detected by an application? An attack that is clearly the responsibility of the infrastructure to detect and prevent; the infrastructure over which the customer, by definition and design, has no control?

Who should pay for that? The customer, as a price of deploying applications in the cloud, or the provider, as a penalty for failing to provide a robust enough infrastructure to prevent it?

Follow me on Twitter    View Lori's profile on SlideShare  friendfeed icon_facebook

AddThis Feed Button Bookmark and Share

Related blogs & articles:



 
      

Feedback


10/16/2009 5:11 AM
Gravatar The party best equipped to deal with a risk should be the party bearing the costs. The provider is the only one able to do anything about it and worse, externalising it to the customer provides an incentive for the provider to turn a blind eye.

Sam
Sam Johnston

10/16/2009 5:20 AM
Gravatar Great post, Lori.

I was looking at moving some web sites to a cloud service and while I found a few that were cost effective from a computing and storage point of view, I had a persistent voice in my head "What about a DDoS attack?" Autoscaling is certainly a big deal, but even without autoscaling, paying metered access for network traffic can be costly depending on how long the DDoS goes for. Autoscaling just makes it worse.
Mike Fratto

10/16/2009 7:56 AM
Gravatar As a Service Provider, I would dread having to explain to the client why instead of their customary $3400 bill for December, they got a $250,000 invoice. Ouch!! Talk about a lump of coal in your stocking ;-)

I think the need for some kind of a governor on how much you are permitted to flex your cloud infrastructure is essential. I frankly rate the probability of a malicious attack as being much lower than a simple coding error in the application or the orchestration software. Many of the orchestration engines have scripting capabilities and event triggers that can be as complex as application code in some circumstances. I have seen far more runaway processes than I have denial of service attacks in my career.

I think an obvious measure to minimize the impact and increase the likelihood of early detection would be for cloud service providers to allow an organization to assign quotas on how many resources can be consumed. For example, you may indicate a hard upper limit of instances that can ever be spawned, or a "not to exceed" charge for services consumed in an hour. Alerting mechanisms could be simple and standardized (SMTP, SNMP, etc.) so that you got alerts at 75%, 90% and a hard fail at 100% of quota.

At the risk of shamelessly pushing CSC's Cloud Orchestration vision, I think this level of vision and transparency are ultimately going to prove essential to widespread cloud computing adoption that preserves economic value for the client. Really, instance sprawl is just an much slower moving example of the kind of negative financial impact that can be incurred than the attack scenario you outlined. You need visibility to the costs in the cloud, but also the rate of change of the costs if you are to be able to properly manage the IT infrastructure of the future.

A great posting and a great summary of a problem that needs to be confronted and effectively addressed by consumers and service providers alike.

~Randy
Randy Arthur

10/16/2009 8:08 AM
Gravatar @Randy

Thanks and you bring up a good point - the service provider isn't going to be happy with the situation *either*.

Maybe this is a place where controls are needed to cap costs/instances as well as alerting based on trending/thresholds? Something that allows customers to say the typical use of this application is about X requests / sec (minute/hour) and if that doubles, do not launch more instances but instead alert (provider, me, twitter, the press, whatever)?

The more I think about it the more I see that this is another one of those "requires manual intervention" situations that, despite all our technological capabilities for automation and codification of intelligence, are still necessary in many different aspects of IT.

Lori
macvittie

10/16/2009 9:50 AM
Gravatar These concepts marry up real well with my concept for domains in an SOA, where each domain represents a set of attributes, such as a particular security level or quality-of-service, and services in that domain will are guaranteed to adhere to those attributes.
JP Morgenthal

10/16/2009 4:32 PM
Gravatar Randy--

I read your comment and the words "rollover minutes" sprang into my head. I imagine that providers with some flexibility in service provisioning (along the lines of Lori's thoughts) could include some upper-maximum in resources in a given month, with unused "service-minutes" rolling over to add to the ceiling in the next month, with some reasonable expiration policy to prevent people from building up impossible amounts of credit. Plus, there would be headroom in the system so the user could be warned when they started eating into their stash, so they could be prompted to provide more money, should they wish to do so.

~tom
Thomas Maufer

11/23/2009 3:02 AM
Gravatar Interesting,

Keep up the good work...

Anyway, thanks for the post
Software companies UK

1/11/2010 3:21 AM
Gravatar When Did Specialized Hardware Become a Dirty Word?
Lori MacVittie
 Leave Feedback
Title  
Name  
Email
Url
Comments   
Please add 7 and 2 and type the answer here: