The recent Amazon Web Services outage was one of the most severe ones. It lasted for upwards of 3 days from some companies/applications that were delivering on top of the same.
Some of the prominent services / applications partially or fully affected by it were
- Quora (www.quora.com)
- Reddit (www.reddit.com)
- Heroku (www.heroku.com)
Surprisingly services like Netflix (www.netflix.com) which are heavily dependent on AWS saw low or no reported outage.
Heroku(check http://status.heroku.com/incident/151) maintained a great status page for this incident and has provided a lot of insights on both how the problems faced and how they tackled it. Heroku’s escalation procedures and decision matrix are absolutely exemplary.
Some observations
- AWS still remains one of the most robust Cloud Services providers. Their last critical outage was in June 2008.
- Services like EC2 where restored in few hours.
- Organisations which choose “Multi-Region Redundancy” over “Multi-Zone Availibility” e.g. NetFlix did not face any issues
- Dependence on specific AWS services like EBS caused a longer outage from some providers.
- Amazon needs to be more transperant about the internal architecture it is using to provide its services to shore up consumer confidence.
- Having your own organisational cloud DR (Disaster Recovery) plan in place is absolutely necessary.
- If you are looking at porting a Critical application/service to the cloud then a “Multi” Cloud approach would be very prudent.
References