I’ve spent most of my career maintaining web services. That is, code that others depend upon running on servers somewhere (often owned and maintained by a third party1). I’ve had the good fortune to be able to inspect running services and to take action to correct problems2. This might look like restarting a “stuck” server, deploying an emergency fix, adding more servers to accommodate increased load, etc.

Being able to take manual action against a running system is a double edged sword. On one hand, you can fix problems. On the other: you can cause them.

And oh boy, is it easy to cause them. Remember the great S3 outage of 20173? Famously caused by manual action. The Great Facebook Outage of 2021, also manual action4. Examples abound.

A knee jerk response here is to try to get rid of operators — those human beings entrusted with managing a running service5. Ensure the system “can’t go down”. This is a worthy goal. How can we make our system as self-healing as possible and minimize the need for manual interaction?

Attempting to do so, however, comes with a tradeoff. The cost of the level of reliability and testing needed to “completely” eliminate operator intervention in a modern web service (or even critical parts of such a service) is astronomical.

So. No one does that6.

There are always operators. And those operators are entrusted with the power to take destructive action (either foreseen or arbitrary) against the running service. Now we do our best, of course, to remove foreseeable interventions, and we do our best to make any intervention as safe as possible. We do this by scripts/tooling, runbooks of blessed and tested sequences of manual action, and roles of elevated but not arbitrary scope.

Should an emergent issue requiring manual action lend itself to being solved with prepared tools or runbooks running under an elevated but not arbitrary permission set, then we pat ourselves on the back and admire our foresight.

If that’s not the case – if we’re seeing an emergent, novel issue – when an operator is paged in to resolve an issue, we have an extremely elevated risk profile. Chances are the engineer will need to 1) use elevated permissions 2) take untested action against the system. You see the problem.

How do we minimize the risk associated with a just-paged-in, sleep-deprived, stressed engineer using God Mode making arbitrary changes against our system? The answer that was hammered into my skull at AWS: don’t go it alone.

The two person rule (2PR) — obtain another operator’s explicit approval before running any destructive action against prod.

The principle here being the same behind code reviews — two pairs of eyes are better than one. Your partner may well spot non-obvious issues in the proposed intervention. Together, y’all can figure out a safer approach.

What about those foreseen actions (e.g., runbooks), run through tightly controlled ops tooling? Do those need to be run under 2PR? In the general case, I don’t know. What’s the level of automation, the opportunity for operator error, the risk, your risk tolerance, the availability of additional operators, etc., etc.? This is a gray zone where a team discussion on individual cases makes sense.

What about when there’s not a second operator available? Again, in the general case, I don’t have a good answer. Why can’t there be a second operator available? What’s our risk tolerance and level of resources we’re willing to dedicate to minimize risk? Regardless, if using the two person rule to mitigate the risk of operator intervention, it’s worth formalizing the plan around resourcing in a moment of calm and not attempting to wing it in a moment of stress.

I don’t know of hard data that shows that the two person rule is effective in mitigating risk of operator intervention. If you do, I’d love to hear about it. The best I can do is cite personal experience of coworkers having better ideas than mine and cite the rule’s usage in mature technical organizations. Argument by anecdata and by appeal to authority? Perhaps. But I wouldn’t lightly dismiss the very pragmatic lived experience and hard-earned lessons of operators over the years. That’s something to consider, interrogate, and apply if it makes sense.


  1. A.k.a. “the cloud”. ↩︎

  2. Those environments where you can’t do this are tightly controlled, differ from the sort of web service under study here, and are out of scope for this post. ↩︎

  3. Want to feel old? That was almost a decade ago. If you do indeed remember it, then welcome, old timer. I’m betting you also remember 9/11, dial up, and probably already know the 2PR rule. Thanks for referring to the summary here. ↩︎

  4. https://en.wikipedia.org/wiki/2021_Facebook_outage. “During maintenance, a command was run to assess the global backbone capacity, and that command accidentally disconnected all of Facebook’s data centers.” The keen reader will note the use of passive voice. ↩︎

  5. We can split hairs here. I use “operator” to mean any engineer responsible for the upkeep of a running service. I use “oncall” to refer to that operator currently carrying a pager. For simplicity — and given that we’re concerned about anyone modifying a service — we stick with just “operator” for this post. ↩︎

  6. If you have counterexamples in the context of modern web services, please see the contact page↩︎