I broke the production

I made 20,000 API calls a minute and broke production for a day. Here's why the system was to blame, not me.

#1about 6 minutes

A personal story of breaking production at scale

The speaker recounts causing a major production outage by running a backfill script that overwhelmed the Facebook API and halted data updates.

#2about 2 minutes

Judging intentions versus actions during incidents

We tend to judge others by their actions but ourselves by our intentions, so we should assume good intent from colleagues during incidents.

#3about 2 minutes

Why individual blame is a counterproductive response

When a production issue occurs, it's a system failure, not an individual's fault, as responsibility is shared across developers, reviewers, and processes.

#4about 3 minutes

How to build a psychologically safe blameless culture

Shifting to a blameless culture requires fostering trust, understanding intentions, practicing self-awareness, and owning mistakes without displacing frustration.

#5about 2 minutes

Using blameless postmortems for system-level learning

Blameless postmortems, originating from aviation and healthcare, focus on investigating root causes to strengthen systems rather than assigning individual blame.

#6about 3 minutes

The power of positive feedback in code reviews

Applying the five-to-one ratio of positive to negative interactions can improve team dynamics, especially by adding positive comments during code reviews.

#7about 2 minutes

Using pre-mortems to proactively prevent failures

Pre-mortems are a proactive exercise where teams imagine a project has already failed in order to identify potential risks and edge cases beforehand.

#8about 3 minutes

Incident resolution and key cultural takeaways

The incident took 20 hours to fully resolve but was a valuable learning experience that exposed system flaws and reinforced a healthy team culture.

#9about 2 minutes

Q&A on customer impact and worst production breaks

The speaker answers audience questions about customer reactions to the outage and shares a story about his worst production break involving a failed form.