Background
On Feb 1st, Gitlab suffered a irrecoverable data loss for a period of 6 hours.
https://about.gitlab.com/2017/02/01/gitlab-dot-com-database-incident/
(In case that link goes stale, here’s a copy: https://gist.github.com/8b9449ec4260583d0e644c7cdc94f3be)
My first thought is that it’s a horrible experience both for the users who lost data and for the engineers involved in the process at Gitlab. The feelings of anger, self doubt and frustration are hard to bear. I wish them all the best in recovering and getting back to work. My heart goes out to them for this experience.
After being floored by the possibility of permanent data loss, my thoughts went next to consider how their experience could inform my team’s decisions with regard to our own processes.
None of this is intended to as backseat driving the situation Gitlab suffered. It is intended as constructive discussion of systems failing to discourage human error, of which we are all susceptible.
Summary of Events
The tl;dr was PG replica got behind. Engineer1 went into debugging after their shift was over. Then the engineer believed they were SSH’d into the replica, but were really SSH’d into the primary. At this point Engineer1 tried to run a command to start replication. They had trouble with command and assumed they needed to wipe out the data directory fully where postgres stores databases. They ran a variant of “rm -rf” and removed the 300GB of data. Engineer1 realized the issue and stopped the deletion when only a few gigabytes remained. The data was unrecoverable from data directory. At this point Engineer1 handed off the baton due to realizing the mistake and already being heavily fatigued.
Their 5 backup systems all failed them. Their latest mostly complete backup was 6 hrs out of sync. Their webhooks data is lost or 24 hrs out of sync.
Repeating that… all 5 backups failed! That is a very very worst case.
That said, their data from 24 hrs ago seemed like valid backups and their backup from 6 hours before was valid. That means backup system 6 and 7 were working decently.
Ways to Limit Risk in Future
My takeaways from their incident:
- Check your backup system works the way you think it does. Ideally this means occasional automated and manual occasions when backups are loaded into system and verified.
- Use buddy system when doing potentially dangerous things on production.
- This would lessen the likely of executing commands while SSH’d into wrong box
- Talk through actions before doing them when on production. Have team mate confirm each step.
- Take an airline pilot checklist approach to these situations to fend off some of the avoidable mistakes.
- Do not make big decisions under time crunch. The engineer was trying to leave at end of shift w/ hard stop timeline. They were rushed and stressed. Having replication lag way longer and handing off to other person could have offset the much worse disaster that they induced. Twelve hours of partially degraded service might be worthwhile trade instead of a complete loss of 6 hr of data.
- Tiredness leads to mistakes. Tap out and hand off the baton.
- Take a backup manually before operating like this on production systems. A 5 minute operate of streaming exporting via pg_dump to AWS S3 would help narrow the window from 6 hr loss to minutes or zero time (assuming app was in full maintenance mode during database replication). I take advantage of this technique before doing potentially destructive database actions. Create a full db snapshot if it’s a db level change or a table level snapshot if limited to single table. Commit your action, validate findings, and then wipe out the snapshots if space is precious.
Conclusions
Humans make mistakes when working with complicated systems. Well designed systems and policies help put safeguards in place to reduce the likelihood of irrecoverable & disasterous events.
I anticipate that the engineering team is working on a clear blameless post-mortem to bring closure to this event.If you’re unfamiliar with blameless post-mortems, check out this article by John Allspaw: https://codeascraft.com/2012/05/22/blameless-postmortems/. During the post-mortem they’ll identify the actions taken and circumstances of the incident along with systems and protocols that can be improved to make these circumstances likely to recur.
PS - I went and checked our various backups for production systems after this event. The hourly, daily, weekly, monthly backups are in good order for Mongo, Postgres and Redis. The automated backups of Redshift look good, as do the manual checkpoints from before major changes. The S3 copies that are permanently stored for varying durations for Mongo are in good shape as well. The realtime replication of Mongo to Postgres is in good shape and has preserved us from data loss when an incident occured. I’ll be ever nervous about data loss, but I think we’re in generally good shape.