When Amazon.com's (NASDAQ:AMZN) Web Services infrastructure suffered cascading problems on Sunday, Sept. 20, one notable media outlet, techtimes.com, reported it as a "monstrous" outage.
Um, no. The Internet did not fall to pieces.
Shoppers still browsed Amazon, Tinder users still found dates, Redditors... did something else, and my wife and I still managed to binge-stream episodes of Person of Interest via Netflix (NASDAQ:NFLX), the highest-profile site affected by the outage.
How it happened
According to Amazon's account of the outage, AWS's DynamoDB experienced what the company called a "service event" due to problems with how the database handles metadata -- i.e., data that describes data -- and the storage servers used to capture information in tables.
In its postmortem, Amazon said:
On Sunday morning, a portion of the metadata service responses exceeded the retrieval and transmission time allowed by storage servers. As a result, some of the storage servers were unable to obtain their membership data, and removed themselves from taking requests.
Why you should care
Still confused? In the simplest terms, DynamoDB "timed out" over and over and over again, leading to a cascading flow of problems. Think of it as the digital equivalent of clogged pipes. Eventually, nothing gets through.
Frankly, I think Amazon did about as well it could have under the circumstances. Outages happen. Only the (small-f) foolish don't plan for them.
In Amazon's case, it took about six hours to recover to full service -- not quite the scale of the outage that plagued Netflix on Christmas Eve 2012. That one cut off viewers for 7.5 hours, making grinches out of millions who hoped to celebrate the holiday spirit watching Netflix.
Why you should love Netflix even more today
Netflix had far fewer problems this go-round. We didn't even notice. To be fair, that's at least partially because of where we live -- Colorado -- as most of the problem was centered in Amazon's East Coast data center operations. Yet that's not the only reason; Netflix has created a whole system to account for problems in delivering streams to its 60 million-plus customers around the world.
According to an insightful account at TechRepublic, the company has built automated troublemakers dubbed its "Simian Army." Their only job: poke, prod, and otherwise try to break the network and expose weaknesses. You know how the government hires hackers? The Simian Army is to Netflix what hackers are to the feds; they help surface and fix issues before they become problems.
It's because of this level of "chaos engineering," as Netflix calls it, that the company was able to easily weather the fallout of the DynamoDB outage. And it's why I still won't sell my shares, despite the clear premium at which they trade.
According to TechRepublic, Netflix was able to quickly redirect traffic to datacenters in an unaffected area. "Netflix was able to do this because it practices what it refers to as multi-region, active-active replication - where all of the data needed for its services is replicated between different AWS regions in a way that allows rapid recovery from failures," TechRepublic reported.