Proactive identification of reliability issues

There you are. You’ve built a service that people love! They love it so much, they’ve started to rely on it. Which puts you in a bind, because the love that folks feel? It’s starting to be put at risk by their dependency on your service.

Frankly: folks are annoyed that your service keeps going down. You’ve gotten buy-in to do something about it. That something: come up with a plan to keep the service up.

We’re going to make some assumptions. You haven’t done this formally before, either because you’ve just found yourself in a position of seniority that demands doing so or because this hasn’t been part of your company’s culture and you’re looking to drive some change.

Either way, congrats! Welcome.

We’re also going to assume that you’ve addressed low-hanging fruit. You’ve fixed known failure modes. But. You still aren’t sleeping easy at night. A shift from reactive to proactive is in order.

Here’s the method I like for proactively identifying latent reliability issues in critical services. There are doubtless other methods. This one has served me well, and I hope you find use or inspiration in it.

A word of caution: know what you’re signing up for here. Making systems reliable is work. We’re going to write a lot of tickets here. Be prepared. And lastly, the more complex the system, the more tickets we’ll write and the harder it’ll be to make sound. Some of the best work you can do in making systems reliable is to start with a simple system to begin with.

Lastly, before we jump in: since this is expensive, target this work. This is a multi-day exercise in depth, even for small systems. Only perform it, in its full glory, for your most critical paths.

I’m the map!¹

What I recommend: start with an architecture diagram. Look, it doesn’t have to be complicated. Boxes for deployed application code / dependencies, cylinders for databases, lil pipe-looking guys for queues. That’s probably all you need. Components connected with arrows².

We’re after a way to visualize all system components and their interactions. Ideally, this is a simple picture³.

Study the diagram. Know it. Ensure that it’s a correct model of the actual system. We can’t improve something we don’t understand.

We’ll be analyzing each and every system component here separately (a system-wide failure is the result of one or more system components failing). The method that we’re building up starts like this:

for component in system:

Identify system-impacting entities

Next, for each and every system component (box on the architecture diagram), identify all entities (nouns) that:

can take an action that impacts the component in any form (e.g., customers)
or for which a state change will impact the component. (e.g., a database that fills up; a dependency that returns 5xxs, a network that all traffic passes over, etc.)

Examples:

Oncall operators
Customers
Deployment infra
Every single other component that the component under study depends on in any way
Cloud provider
Service dependency (S3, internal microservice, etc)
Network
Developers
Upstream data integrations
Malicious attackers⁴
Web bots
External business partner
Sibling team with admin privileges
DNS server
Cert provider

We’re up to:

for component in system:
  for entity in interacts_with(component):

This list of entities? These are everything that can cause the component under study to fail.

Identify interactions and resultant failure modes

Let’s list out every single way that each entity interacts with the system component. E.g.,

component: control plane web-server
entity: oncall operator

interactions:
- checks logs / metrics (ReadOnly role)
- scales up service via console (Admin role)
- terminates running instances via console (Admin role)
- fat-fingers unintended action via console (Admin role)
- runs emergency deployments
- triggers rollbacks

Every single interaction between an entity and a system component introduces possible failure modes into the system. Let’s identify the failure modes that can come out of these interactions. We’re brainstorming here, casting a wide net. Continuing our example⁵:

component: control plane web-server
entity: oncall operator

interactions and associated failure modes:
- checks logs / metrics (ReadOnly role)
  - sets up script that does this in automated way, exhausts observability 
      API limits, leaves operators blind during incident response
- scales up service via console (Admin role)
  - scales to zero, compromising availability
  - scales way too high, triggering increased load on dependencies and 
      associated collapse
  - scales up wrong deployment, compromising separate deployment, leaving 
      web-server deployment in compromised state
  - scales up wrong environment, leaving web-server deployment in compromised state
- terminates running instances via console (Admin role)
  - terminates all running instances, compromising availability
  - terminates wrong instances for unrelated service
  - terminates instances in wrong environment
- fat-fingers unintended action via console (Admin role)
  - deletes entire cluster, other scary scary sev-1 events
- runs emergency deployments
  - deploys wrong version, mayhem ensues
- triggers rollbacks
  - rolls back unrollbackable change, mayhem ensues

A quick, partial example on an entity where a state change affects the system component:

component: control plane web-server
entity: backing database

interactions and associated failure modes:
- web server queries database for results
  - database offline for planned maintenance
  - database offline due to deployment
  - database offline for extended periods due to failed deployment
  - database query slow due to incorrect sizing
  - database query slow due to zombie long-running queries
  - database query slow due to postgres deciding to ignore your carefully chosen 
      indexes 
  - database query slow due to inefficient indexing
  - database query slow due to unbounded data volume
  - database query fails due to exhausted connection count
  - [...]

Of note: here’s a great chance to check the identified failure modes against those noted in real incident reports from your system in the past and from similar systems at your company. Take liberal inspiration; generalize. What entities, interactions, or failure modes have you missed?

We’re up to the following in our procedure:

for component in system:
  for entity in interacts_with(component):
    for interaction in interactions(component, entity):
      failure_modes = brainstorm_failure_modes(interaction)

Identify steps to prevent or mitigate

We’re making great progress! We have a list of failure modes that can happen to our system. Given that we understand the interaction that engendered the failure mode, we’re in a good place to prevent and to mitigate each of them.

For each and every failure mode, this will be a two part procedure. We’ll want to:

Figure out how to make sure that the failure mode will never come to fruition.
Figure out how to limit the impact of the failure mode when it happens anyways.

Let’s take an example:

component: control plane web-server
entity: oncall operator
interaction: scales up service via console (Admin role)
failure mode: scales to zero, compromising availability

prevent:
- Add dedicated ops tooling (CLI, etc) that enforces reasonable limits, set 
beforehand, on scaling.
- Two person rule (2PR).  Forbid usage of _any_ destructive manual action against 
prod without someone looking over the operator's shoulder, giving explicit 
approval of each action.
- Forbid console usage for destructive actions.  Only allow oncall tool usage.

mitigate: 
- Create alarms on running container count
- Create canaries / synthetic monitors that check liveness properties on hot path
- Create a runbook for this particular disaster recovery (likely too fine-grained 
    / heavyweight)
- Have expert operators available to be paged in the case of extreme system 
    failure (like having all pods deleted) who we trust to correctly restore 
    system state.

Mitigating work often takes two other forms:

Observability – recognizing that the system component has a problem. You can’t fix something you don’t know is broken.
Decreasing time to resolution – a generalization of the point above. Generally, finding ways of reducing diagnosis time and quickly (and safely!) taking corrective action.

At this point, the work’s starting to take shape! This seems like a lot, but there’s going to be a lot of overlap between work to address different failure modes. The same failure mode will almost certainly show up in multiple places! This becomes an organization problem – figure out a system that makes sense for you to avoid duplicated work, both in listing failure modes and in tracking the work to address them. When listing failure modes, we cast a very broad net. Now it’s time to consolidate.

for component in system:
  for entity in interacts_with(component):
    for interaction in interactions(component, entity):
      failure_modes = brainstorm_failure_modes(interaction)

      for failure_mode in failure_modes:
        work_to_prevent, work_to_mitigate = think_for_five_minutes_about(failure_mode)

Next steps

I’m assuming you work with a people manager or project manager. Someone who’s determining work priority. They need to know how 1) expensive each task is, 2) the failure mode’s impact, and 3) its likelihood. When communicating the work coming out of this exercise, it’s useful to come prepared to discuss each of these three points.

During this phase, you (and your team) may elect to accept certain risks or not attempt to mitigate them should they come to pass. This makes sense. Some of the work coming out of this exercise will be expensive, of limited impact, or exceedingly rare. Mitigating the risk of us-east-1 network traffic being completely disrupted, for instance? Extremely expensive, extremely high impact, very rare⁶. Perhaps, for your workload, that’s a risk worth accepting. That said, it’s still useful to think for five minutes about mitigation and not immediately reach for accepting the risk. Within the specifics of the system, there might be room for non-obvious solutions and simplifications.

As a last note: I hope that this exercise has made painfully clear that the more “stuff” there is in a system, the harder it is to make reliable. While we’ve talked about performing this exercise post-hoc, on existing systems, it’s also worth thinking about when designing systems. In the design phase, thinking about not just how the system is constructed (traditionally shown on architecture diagrams), but also what interacts with the system, how it will fail, and how we limit the impact and recover from those failures when they happen. A rigorous eye for reliability will also improve the overall system design. Your future self, your operators, and your customers will thank you.

For those disconnected from reality (but somehow not from this blog [we applaud your choices]), there’s this show, “Dora the Explorer” (“Dora l’Exploratrice” en français which is 100% the normal second language to watch Dora in), that features a talking map. The map? Absolutely adorable. And its tagline? “I’m the map!” (en français: “J’suis la carte!”). It’s a map! Just like the architecture diagrams we’re about to discuss! ↩︎
I recommend having arrow direction represent “dependency” (not “data flow”), arguably my hottest take, the weirdest hill I’ll die on, and the subject for another post. ↩︎
See https://www.seangoedecke.com/great-software-design/ ↩︎
“Oooooh”, says you, “This sounds like a security review!”. Yes, yes there is overlap. For treating the interactions with this particular reliability-harming entity in particular, I recommend consulting a professional. That is, find your local friendly security engineer. They LOVE talking about this stuff. They’d be very happy to help you working through and thinking about addressing risk associated with threat actors. ↩︎
You probably see additional interactions and/or risks in even this limited example. And being the astute reader that you are, you’re thinking, “Wow, what about reviews of interactions and risks? We could use another pair of eyes here!”. Yes, agree. ↩︎
Har har har, </ joke about us-east-1 reliability> ↩︎

I’m the map!1#

Identify system-impacting entities#

Identify interactions and resultant failure modes#

Identify steps to prevent or mitigate#

Next steps#

I’m the map!¹

Identify system-impacting entities

Identify interactions and resultant failure modes

Identify steps to prevent or mitigate

Next steps