The Dark Side of the Cloud

Cloud computing is often seen as a silver bullet for traditional IT problems. The flexible nature of this virtual outsourcing model lets your staff focus on writing great applications, making systems management somebody else's problem. And since pretty much everything can be managed in software, it should be easy to recover from even quite severe hardware outages.

Except it doesn't always work that way.

Amazon.com (AMZN 3.43%) suffered another long outage in its suite of cloud services yesterday. The event really highlights some best practices for using the cloud. We just caught a glimpse of which online businesses are managing their cloud systems the right way -- and who's cutting corners where it might hurt.

What happened?
This day-long outage was a very different beast from the widely covered Amazon Web Services downtime we saw last spring. That time, a massive thunderstorm took Amazon's east coast AWS data center off the grid, causing downtime for tons of major customers. I'd argue that the weather is a bit beyond Amazon's control.

This one's different -- but not unique.

Amazon's Elastic Block Storage service started suffering performance problems around 1:30 p.m. EDT. The problem spread to a larger number of customers over the next few hours, and when these chunks of data storage became unusable, this took down whatever AWS services had been designed to use the bad storage.

That's a big problem, but not necessarily a show-stopper. Properly designed cloud services and applications could mirror their block storage to other data centers in Amazon's fold, or even to another company's storage service. Many programming tool kits make it trivially simple to back your vital Amazon-hosted apps up with alternatives such as Rackspace Hosting (RAX). You do end up paying a bit more when storage is backed up to another location, but that's hardly news -- tape drives and racks full of backup hard drives aren't free, either.

Now, Amazon's service agreement for AWS and related products promises 99.95% uptime, or no more than about 4.5 hours of downtime per year. The company could probably dodge that commitment here, since the uptime promise is technically broken only when the service was classified as "unavailable." That didn't happen here, as Amazon slapped a "degraded performance" label on the event instead. But I still expect the company to send some service fee refunds this month.

Why? Because this isn't the first time we've seen pretty much exactly this scenario. The same thing happened in 2011, and that four-hour outage yielded 10 days of service refunds for affected customers. That spot of trouble was cause by a bungled network equipment upgrade; I'm hoping the root cause is different this week. You'd expect Amazon to learn from its mistakes, after all.

Lessons learned
So here's the deal. Amazon promises flawless performance, but nobody's perfect. These things happen. And if your business depends on one tool in one place, with no workaround when things go bad, you kind of deserve to suffer. Some corners just can't be cut without suffering the consequences.

Who saw their sites go down in a blaze of cost-saving regret? Slaps on the wrist go out to Reddit, Foursquare, Pinterest, and Imgur, as well as the fantastically popular multiplayer game Minecraft. Boo, hiss, get a clue!

But you know the bright side of that ignominious list? You don't own shares in any of these companies.

That's right -- every service that went down (to the best of my knowledge) was a private company with no responsibility to public shareholders. They may have ticked off their private equity owners and perhaps lost a few loyal users, but the big names who really, really need their stuff to work had already designed their wares the right way.

Netflix (NFLX -0.63%) is perhaps Amazon's best-known customer, since the company offloaded most of its IT needs to the AWS platform. Pretty much everything except actual movie streams flows through Amazon's servers, and Netflix ticked along undisturbed on Monday. That's hardly surprising, given that Netflix plans for mishaps and even randomly breaks stuff on purpose -- there's just no substitute for hands-on experience with unlikely problems.

When NASDAQ OMX Group (NDAQ -0.28%) built its Market Replay service, it needed to store "ten years of historical tick data down to the millisecond." Amazon's platform was a natural tool for the job, combining low cost with simple management -- and massive scale. The service took a licking and just kept on ticking. Nasdaq is now thinking about moving more services into the cloud, encouraged by Market Replay's success. And I don't think this week's events changed its mind about that.

The list of major customers goes on for miles. Spotify runs its music services right off Amazon's cloud storage. Washington Post (NYSE: WPO) uses AWS for data management. Pfizer (NYSE: PFE) runs large-scale research there. None of these customers complained about hiccups on Monday, and they'll continue to buy cloud services with confidence.

The end of cloud computing as we know it?
So if you were getting ready to sell or short Amazon and Rackspace based on the vulnerability you saw yesterday, I'd ask you to relax that trigger finger and back away slowly.

Cloud services are only scary if you don't know how to manage them properly. The big boys are already perfectly capable of handling temporary outages like this one, and the rest will learn from their early mistakes.

NASDAQ: AMZN

Amazon

Related Articles

NASDAQ: AMZN

Amazon

Related Articles

Premium Investing Services