About
So you’ve got a known bug in your system. It’ll take a week or two to fix, but that hasn’t happened yet. You can be sure that when you go oncall, you’ll be woken up by it 2-3 times. There are a couple of common approaches to this situation.
Option 1: silence the alarm
The worst approach: change your alarms to ignore the bug1. Take the batteries out of the smoke alarm. This behavior should land that engineer on the uncomfortable side of some very hard conversations. Here, the engineer in question is actively degrading system health and confidence.
This is a toxic loop. Initial system instability results in further degraded system state. And so on.
Option 2: keep quiet
Another common approach: do nothing. Wake up at 3am. Wait for the alarm to pass or take the known manual action to get out of alarm. Go back to bed. Sleep fitfully, disrupted by that hit of adrenaline and cortisol. Maybe mention “Oh I got paged” in standup, but don’t make a big deal out of it.
Over time, this approach results in unhappy engineers, fostering oncall burnout. System instability degrades individual health. This is not sustainable. Here lies decreased morale and attrition.
Option 3: complain
The third: complain. Same as above, but in addition, pull your colleague aside and complain. If we’re being honest, here, the conversation is probably complaining about the bug, but also about the manager. As in, “We never work on oncall issues”. System instability degrades team health and morale. Vilifying the manager makes productive conversation harder when this bubbles over into overt conflict, even further hurting team health.
Option 4: close the feedback loop
The problem with all of the above approaches is that of a broken feedback loop. In each case, the manager who can allocate team resources isn’t aware that there’s a bug or its severity. Your manager cannot read your mind. They cannot know your pain unless you tell them. Here they’re allocating resources elsewhere because they don’t know that anything’s wrong or the cost that it’s causing their team.
This leads us to the “if you see something, say something” approach. Namely, each time something causes you pain, find the person who can do something about it and let them know just how bad it is. It is polite and professional to do so. Keeping silent deprives the manager of the information they need to do their job.
Effective feedback
Let’s talk about how to deliver this feedback effectively. First, don’t be a jerk. I’m generally grumpy after not sleeping, but a raised temper here isn’t productive. Focus on the facts. Especially numbers. Be explicit. Doing so helps make sure that the feedback will be well received. Avoid judgments, which have a tendency to distract and to put the listener on the defensive2.
As for the content, focus on 1) the information the manager needs to assess severity 2) effort to fix. When conveying severity, this means customer impact (how many customers, for how long), cost to the company, horribleness of the band-aid hack/workaround, length of oncall engagement, expected frequency, expected future impact, etc. Any badness that you can quantify. For effort, hopefully you have some ideas here. If not, find the tech lead / system owner and talk with them.
Bad: “The alarm fired last night”.
Good: “I got woken up at 3am by the alarm. Waiting for recovery took thirty minutes. One hundred customers were unable to access our endpoint during that thirty minute time. I only slept four hours total. This is the third time this week. We expect the issue to keep happening at that frequency. After talking with Jenny, it looks like fixing will take one engineer about one week”.
Do this each and every time. Let’s say you told your manager about the bug happening last week. They’re busy with quarterly planning and don’t have this particular issue top and center of their brain. Help them have the same view of the system that you do. Without reporting pain each and every time, you give your manager a false sense of the severity of an issue.
The tech lead’s role
You can help engineers on your team do this. If you’re a tech lead, that’s your responsibility. You’re watching oncall tickets, and you also have a sense of the team’s pain points. If you notice what looks like someone having a bad time, check in with them. Depending on the situation, perhaps you coach them on how to surface the issue. Perhaps you advocate for them. Perhaps it lands somewhere in between.
Have your teammate’s back. Full stop. This is an area where you can actively help and support.
The manager’s role
Managers bear responsibility here as well. Namely, for listening and seeking to understand, especially when engineers who are still learning to give feedback fall short. They bear responsibility for taking that feedback seriously, for being transparent around competing priorities, for keeping their team’s trust, and for prioritizing pain points. They bear responsibility for creating an environment where engineers feel comfortable speaking up and speaking frankly.
Generalizations
We’ve focused here on oncall pain. The points here, however, generalize to all pain. Security updates shutting down your computer during the work day, CI systems taking longer and longer to build, infra-level issues.
The teams responsible for that pain need to know what’s happening. They can’t know your pain, and they can’t know how bad it is unless you tell them. And keep telling them.
The advice above isn’t rigid. It can be adapted for other circumstances. For slow build times, you might not ping on every single build, but pinging once a week? Totally reasonable. You also can’t provide level of effort to fix, but you can absolutely provide estimates of hours of lost productivity per dev per week. Tailor to the circumstance and stay persistent.
Positive effects
Saying something when you’re in pain helps your organization make intelligent decisions. A manager receiving feedback on “something’s bad” and then acting on that feedback also helps build team confidence and cohesion. Finally, saying something about pain helps reduce the individual’s perceived pain.
That is: making pain visible gives it the chance to be properly addressed, helping your organization, your team, and yourself.
-
This is a valid approach when the “bug” isn’t in fact a bug. Often, when launching a new system, some of the alarm thresholds are provisional. You aren’t quite sure of them. It’s commonplace, in the first month or two of a new system to need to relax a couple of thresholds that were set too tightly. ↩︎
-
There are many books written on the skill of having difficult conversations. They treat all of these topics in much greater depth that I could ever hope to achieve in a single paragraph. These are useful to check out. ↩︎