The Safety Leak | Article 1: The Blame Gradient
I’ve learned that the fastest way to kill my team’s telemetry is to treat a technical failure as a character flaw.
When a production incident happens, my human instinct is to find the “who.” I want to know whose hands were on the keyboard, whose PR was approved, and whose judgment lapsed. I call this The Blame Gradient. It’s the tendency for fault to slide down the org chart until it hits an individual.
In a high-safety environment, the gradient is flat. I look at the system. In a low-safety environment, the gradient is steep. I look for a person to blame.
The “Who” vs. The “How”
When I ask “Who did this?” I am signaling to my team that I believe our systems are perfect and they are the only broken variable. This is almost never the case.
Engineering is about managing complexity. If a single engineer can take down a cluster by running a command, I don’t have an “engineer error”—I have a Design Error.
- The Blame Approach: I discipline the engineer. Result: Everyone else learns to hide their mistakes, double-check their “social safety” before their code, and stop reporting near-misses.
- The Systems Approach: I ask why the system lacked the guardrails to prevent the command. Result: I fix the infrastructure, and my team feels safe telling me when they almost broke something.
The “Individual Patch” Fallacy
When I try to “patch” people instead of code, I create a Telemetry Blackout.
Engineers are rational actors. If I make the “cost” of an error a public reprimand or a hit to a performance review, they will stop sending the data. They won’t stop making mistakes—because humans are non-deterministic—they will just stop reporting them.
I effectively trade a recoverable technical error for a permanent cultural leak. By the time I find out there’s a problem, it’s usually too late to fix it.
My Ego is the Gradient
I’ve realized that the Blame Gradient is usually powered by my own fear. When a project I lead fails, I feel the heat from my own stakeholders. My instinct is to shield myself by pointing to a specific “underperformer.”
But as a leader, I know that blame is a debt I cannot afford. Every time I individualize a failure, I am borrowing against the future transparency of my team. Eventually, the interest on that debt becomes so high that I am the last person to know when reality is diverging from the plan.
How I Patch the Leak
To flatten the gradient, I have to shift from “Accountability” (who gets punished) to “Responsibility” (how we prevent recurrence).
- I start with the mirror: If a junior dev made a catastrophic choice, did I provide the right context? Did I review the design?
- I ignore the “First Order” cause: The first cause is “The dev pushed bad code.” I ignore that. I focus on the Second Order cause: “The CI/CD pipeline didn’t catch the regression.”
- I reward the “Near-Miss”: When an engineer says, “I almost deleted the production DB today,” I don’t cringe. I thank them. That near-miss is the highest-value data I can get to fix a system before it actually breaks.
The Diagnostic
I look at the last Post-Mortem I chaired:
- Did the document contain a person’s name in the “Root Cause” section?
- Did the action items focus on “Training” (fixing the person) or “Automation” (fixing the system)?
If I am fixing people, I am leaking safety. If I am fixing systems, I am building a firewall.