Topics


Blogs


Forums


Samples


Media


Labs


Resources

 




DevCentral > Weblogs > Lori MacVittie - Two Different Socks
 How Sears Could Have Used the Cloud to Stay Available Black Friday
posted on Wednesday, December 03, 2008 3:10 AM

The prediction of the death of online shopping this holiday season were, apparently, greatly exaggerated. As it's been reported, Sears, along with several other well known retailers, were victims of heavy traffic on Black Friday. One wonders if the reports of a dismal shopping season this year due to economic concerns led retailers to believe that there would be no seasonal rush to online sites and therefore preparation to deal with sudden spikes in traffic were unnecessary.

sears

Most of the 63 objects (375 KB of total data) comprising sears.com home page are served from sears.com and are either images, scripts, or stylesheets. The rest of their site is similar, with a lot of static data comprising a large portion of the objects.

That's a lot of static data being served, and a lot of connections required on the servers just for one page.

Not knowing Sears internal architecture, it's quite possible they are already using application delivery and acceleration solutions to ensure availability and responsiveness of their site. If they aren't, they should, because even the simple connection optimizations available in today's application delivery controllers would have likely drastically reduced the burden on servers and increased the capacity of their entire infrastructure.

But let's assume they are already using application delivery to its fullest and simply expended all possible capacity on their servers despite their best efforts due to the unexpected high volume of visitors. It happens. After all, server resources are limited in the data center and when the servers are full up, they're full up.

Assuming that Sears, like most IT shops, isn't willing to purchase additional hardware and incur the associated management, power, and maintenance costs over the entire year simply to handle a seasonal rush, they still could have prepared for the onslaught by taking advantage of cloud computing.

Cloudbursting is an obvious solution, as visitors who pushed Sears servers over capacity would have been automatically directed via global load balancing techniques to a cloud computing hosted version of their site. Not only could they have managed to stay available, this would have also improved performance of their site for all visitors as cloudbursting can use a wide array of variables to determine when requests should be directed to the cloud, including performance-based parameters.

A second option would have been a hybrid cloud model, where certain files and objects are served from the local data center while others are served from the cloud. Instead of serving up static stylesheets and images from Sears.com internal servers, they could have easily been hosted in the cloud. Doing so would translate into fewer requests to sears.com internal servers which reduces the processing power required and results in higher capacity of servers.

I suppose a third option would have been to commit fully to the cloud and move their entire application infrastructure to the cloud, but even though adoption appears to be imminent for many enterprises according to attendees at Gartner Data Center Conference, 2008 is certainly not "the year of the cloud" and there are still quite a few kinks in full adoption plans that need to be ironed out before folks can commit fully, such as compliance and integration concerns.

Still, there are ways that Sears, and any organization with a web presence, could take advantage of the cloud without committing fully to ensure availability under exceedingly high volume. It just takes some forethought and planning.

Yeah, I'm thinking it too, but I'm not going to say it either.

Follow me on Twitter View Lori's profile on SlideShare friendfeedicon_facebook AddThis Feed Button Bookmark and Share


Reblog this post [with Zemanta]

 



Email This
  del.icio.us
      

Feedback


12/3/2008 7:43 AM
Gravatar An interesting claim considering as the author states, he knows nothing of the internal architecture of the sears website.

A couple of things I do know baout large sites however.

1. Even the home page often requires access all the way back through the system (Webserver-WebChannel-BLS-database) in order to create a persistent session for the user at the start (And to check any existing persistent session exists for the user). And even stuff that really looks to be static often isn't.

2. F5's still don't understand the need for limiting sessions to a website. They understand simultaneous connections fine. But in the web world that's almost completely useless. It's a waste of time harping on about sites that should 'Use The Cloud' like it's some sort of demi-god able to create capacity out of nothing, when we can't get the support from the load balancer to manage the number of sessions coming into a site (Which would make it a doddle to throw new sessions a 'hold-on we're busy' page until there was a free session and spare capacity for it)
Hamish

12/3/2008 8:22 AM
Gravatar @Hamish,

Yes, *she* knows nothing of the internal architecture. Which is why assumptions were made. Because we don't know and even if we did, we wouldn't be sharing that info publicly.

Yes, many stuff that looks static isn't. However, if you look closely at the Firebug capture you can see that much of the page likely is static. It's rare that CSS and images are dynamically generated.

Persistent sessions. Ah, yes, persistence. Something we've written a lot about in the past.

Persistence vs Persistent

Enabling Session Persistence with iRules

Session persistence based on source IP

Affinity|Session Persistence

BIG-IP doesn't understand session limiting unless you tell it to care about session limiting. Off the top of my head there are two ways to do this:

1. The app server must be able to indicate that it is reaching session capacity so the BIG-IP can use that information to stop sending it requests. This could be done through monitoring using a page that indicates how many more sessions could be handled, and then BIG-IP can act upon that information.

2. Using the persistence tables in iRules or an HTTP derived "session specific" statistics profile, you can track the number of sessions on any given server and use that information in determining which servers (if any) are available.

I'm sure many folks would prefer this functionality be "built in" but it's so variable depending on the configured session length, size of data stored in sessions, etc... that it's not something that's easily abstracted and genericized. Better to allow the customer the ability to specifically configure for their environment than pre-bake a solution that only solves the problem for a few customers.

Thanks!
Lori
Lori MacVittie

12/4/2008 3:57 AM
Gravatar I agree with your approach here. When providing horizontal scalability you need to inspect every facet of a complex n-tiered system. Reducing the number of or employeeing advanced caching techniques for static objects could help.

It is diffucult to say exactly what was the root cause of the slowdowns. However, my experience leads me to believe that a database transaction or some other backend integration point was having difficulty.
John D'Esposito - Techout.com

12/4/2008 5:10 AM
Gravatar Actually, speaking as a former architect for Walmart.com and knowing several of the guys over at Sears and how smart and experienced they are, I would have to say the following:

1. We never had load balancing problems. We had a sweet little trick that made Cisco or Big IP-type devices much less critical to our capacity
2. Cloud bursting wasn't around but wasn't necessary. These sites are over-provisioned (running at 30% utilization most of the time). And, all the static content is being cached or offloaded to servers other than those running the app.

So what is the problem? The app server layer is doing lots of heavy-weight I/O to the DB, JMS, underlying web services and more.

The #1 bottleneck for me back when working on websites was always Oracle. We partitioned the DB (ex: put half your users in each of 2 db instances and you get twice the capacity). We offloaded the DB by creating non-relational storage and coordination engines (sleepycat, NFS servers, in-memory caching). And we cached the DB on a fine-grained basis.

But you would be amazed how much workload the modern machine can handle if its not I/O bottlenecked to back-end data stores. In 2002, a 1.2GHz machine running Tomcat could render 2K pages / sec. Running 100 servers gives you 200K requests / sec which is big enough for the top 100 web sites in the world. But, once the full-stack app starts talking to the DB, the server starts generating more like 4 or 10 or 50 requests / sec.

That's the trick. Get per-server requests / sec efficiency as high as you can by alleviating I/O bottlenecks to expensive data stores like Oracle and you can go as fast and scalable as you want. Don't worry about load balancing or cloud bursting as much. At least that's my experience.

Here's the key question: how do I offload the DB without needing to do as much work or be as experienced as Amazon.com, EBay.com, Walmart.com, etc.? That's what I spend my days on now...building a system to offload the DB that everyone can use that really frees the application to scale and simultaneously is very easy to operate / run.

Actually I would assert that just having an HTTP Session clustering / replication layer that works (scales linearly with # of nodes and # of users) would make life very easy for most apps.

Cheers,

--Ari
Ari Zilka

12/4/2008 10:05 AM
Gravatar From what I'm reading and hearing there is not much security on the cloud. What I mean by this is that I'm not seeing how you can keep your servers separated from other cloud-client's servers. How does this play into PCI compliance and security of trade secrets? I'm interested in hearing more about this topic.
Tom

12/4/2008 11:48 AM
Gravatar @Tom

That's an excellent point. You can't keep your servers separated from other cloud client servers, because all the physical hardware is a shared resource, necessarily. The separation occurs at the virtual machine layer, if virtualization is the core technology on which the cloud is built.

The jury is still out regarding applications and PCI/SOX compliance in the cloud. While some PCI requirements, such as application firewall/code scanning could certainly be addressed in the cloud, still there are other points in the requirements that may not be adequately addressed in the cloud.

Lori
Lori MacVittie

12/4/2008 11:52 AM
Gravatar @Ari

Regarding your point #1: aren't you going to share what it was?? Shame on you for piquing our interest and not sharing! :-)

Regarding I/O bottlenecking: Amen. I agree that the JMS/middleware/database layers are the real bottlenecks in most high volume situations. It's always been a bit surprising that database caching solutions never made an impact on the market as they really had an interesting solution to the problem that could certainly alleviate some (not all) of the problems associated with these layers of the app infrastructure.
Lori MacVittie
 Leave Feedback
Title  
Name  
Email
Url
Comments   
Please add 1 and 5 and type the answer here: