Amazon’s S3 outage over the weekend did not affect Gnip’s live service. Gnip uses S3 for system state archival/backup purposes, but the live data flow through Gnip was not affected as we keep it in local instances (memory/local disk). We weren’t able to backup data while S3 was down, but its outage was intermittent, so during online windows, we did our backups. Eventually the S3 outage was “over” and balance between local-storage and S3/remote storage was restored. At some magical point if S3 simply wasn’t coming back online, we’d move our backups to another service.
Building scalable, redundant, highly-available, systems is the next big game. It actually has been for decades, but now a larger web application audience is becoming accutely aware of its importance, and subsequently, how to accomplish it. At the end of the day, everything fails. The game becomes isolating the weak points, butressing the critical points of your service to ensure “instant” recovery from all the failures you can anticipate, and minimizing complete system setup/restart time in case everything craters and you have to scramble to come back online.
I hope Gnip never has it’s day in the searing outage sun, but we’re not naive.
Brush your teeth before bed, eat right, exercise, and eliminate your Single Points of Failure.