More than a decade ago, the concept of the ‘blameless’ postmortem revolutionized the way tech companies acknowledge failures at scale. Coined by John Allspaw during his time at Etsy, the idea behind postmortems was to control the natural instinct to point fingers when an incident occurs. Instead of blaming individuals, the focus should be on understanding how the accident occurred, treating everyone involved with respect, and learning from the event.
Looking back at some of the most honest and blameless public postmortems from recent years, we can glean valuable insights.
In 2017, GitLab suffered a significant outage that resulted in the loss of 300GB of user data in seconds. The incident was triggered by the failure of the secondary database to sync changes due to increased load. As engineers attempted to manually resync the database, a mistake led to the accidental deletion of a large amount of user data. Despite challenges with backups and slow network disks, the team managed to fully restore the service after 18 hours.
From GitLab’s postmortem, we learned the importance of analyzing root causes with the “Five whys” method, sharing a roadmap of improvements with customers and the public, and assigning ownership to backups.
In a similar vein, Tarsnap, a one-person backup service, experienced a complete outage in the summer of 2023. Despite the catastrophic filesystem damage that required rebuilding the service from scratch, user backups were safe thanks to the use of S3 object storage. After meticulous data restoration efforts, Tarsnap was back up and running.
Lessons from Tarsnap’s outage include regularly testing disaster recovery plans, updating processes to match technological advancements, and incorporating human checks into automated recovery processes.
In another instance, Roblox faced a 73-hour outage in 2021 due to contention issues in the Consul cluster. By identifying and addressing root causes related to contention and downstream bugs, the team was able to restore service after extensive collaboration and troubleshooting.
From Roblox’s postmortem, we learned the importance of avoiding circular telemetry systems and looking beyond immediate issues to identify interconnected root causes.
Overall, these transparent and insightful postmortems offer valuable lessons for tech companies in handling failures with honesty, accountability, and a focus on continuous improvement.