The one true way to optimize software

About

The endpoint is too damn slow, the user’s workflow takes too long, your build speeds are slowing down deployment times. Congrats! You’ve got yourself an optimization problem. Along with writing simulations, solving optimization problems is my favorite technical work. They’re exciting! When working with an unoptimized system, you can often hit a 2-3 orders of magnitude improvement. The process¹ is easy and simple.

Instrument

First, you need to know where to optimize. This means numbers.

Optimizing intra-service code? Hook up a profiler. Get it to spit out flame graphs². We love flame graphs. Ideally, you can run this locally using a prod-parity setup. You can also sometimes run profilers remotely against dev machines. Reasonable minds can differ on manual profiling against prod. There’s a lot of “it depends” based on your setup. I’d, in general, rather you didn’t. Why can’t you profile against your local machine? Against dev? Is it a lack of prod-parity in those environments? I’d rather us invest time in fixing that lack of parity rather than introducing untested tooling into a customer-serving environment³.

Optimizing inter-service code – where you care about a call chain over multiple services? Find the metrics or tracing that will allow you to see which service or “connective tissue” between services (a message queue, TCP, etc.) is taking the longest.

Don’t yet have a profiler or metrics? Do something about it. It’s 2025, at the time of writing. There are lots of good profilers. Heck, even your browser ships with a profiler that produces lovely flame graphs. Add service-level metrics or tracing. For service-level analysis, a nice graph in Grafana/Cloudwatch/observability-tool-of-choice that shows aggregate timing breakdowns will work fine. You can also figure out a Cloudwatch Insights/Kibana/Loki query that will produce that same breakdown from logs/traces. A little bit of observability here goes a long way.

Do not, however, make the mistake of attempting to go further without having instrumentation. This is where wasted effort comes from. Without instrumentation, you come up with a Grand Theory of How Our System is Operating. I can tell you right now: that theory is wrong. Systems are humbling! There are all sorts of hidden configurations and feedback loops that we don’t understand. You’ll spend much less time adding instrumentation and listening to it, than by doing the engineering equivalent of throwing darts in the dark. You know there’s a dartboard. You’re pretty sure of where the board and bullseye are! But. You’ll have a much easier time if you just turn on the light.

Identify hot spots

Now’s the easy step. Read the results of your instrumentation! What’s taking the longest? Be single-minded here. Note findings that you think are “interesting”, but stay focused on the hot spot that’s taking the absolute longest.

Study that hot spot. What mechanism is driving the hot spot? Is it an inefficient algorithm or data structure? Is it an inappropriate call pattern of some sort? Or is it just that calling some primitive function is more expensive than you expected?

Eliminate

Once you understand mechanism, you have a good idea of how to eliminate. Use a better algorithm or data structure. Change your call pattern. Avoid that particular primitive function. Here’s where you get to be creative. Implement your good idea.

Repeat

Go back to instrumenting. Perhaps your performance is now acceptable! If so, congrats! Take off early, go eat a burrito. If not, go back to reading results, finding the worst hot spot, and eliminating it.

Conclusion

A few iterations through this cycle on an unoptimized system will quickly cut through orders of magnitude of inefficiency. It feels good! It feels productive! All of a sudden you’re saying lovely things like, “Single-digit millisecond latencies”, or “We cut our fleet size to a tenth of what it was previously”, or you find yourself eating that burrito, basking in the silence of your contented customers.

revealed here à la Big Tech hates him: local staff engineer reveals one weird trick to optimize everything ↩︎
https://www.brendangregg.com/flamegraphs.html ↩︎
Automated profiling against prod, however, I generally support. I’m talking about tools like Amazon’s CodeGuru Profiler, which you wire up and then deploy to all environments which then spit out flame graphs all the time. No introducing surprise, untested load to customer-serving environments. By all means, measure its performance overhead, but oh boy, there are stories I can tell you about these tools being useful⁴. Plus who doesn’t want flame graphs all the time? We love flame graphs. ↩︎
Oh look! We’re in a footnote labyrinth⁵! That’s fun! Okay, here’s one of the stories. There I was. On vacation. Come back. One of our sibling teams had a major performance problem. Things were Slow. As a result, we had scaled out to Too Many Machines. While I was in Paris, watching sunsets from Montmartre, the team was frantically trying to understand where this regression had come from. All usual suspects had been eliminated – no change in application code, etc. Balmy, relaxed from the time away, I asked, “So we’ve looked at flame graphs, right?”. We hadn’t, so with the team, we pulled ’em up. In CodeGuru, you can diff flame graphs from different date ranges. We did so, selecting a before ~~Paris~~ regression and after range. Looking closely, the culprit: InetAddress.getByName. This was all our SysDE friends needed to dive into DNS settings and find a deeply cursed issue with DNS host round-robining. Total investigation time looking at flame graphs: maybe ten minutes. We had the issue resolved later that day⁶. ↩︎
https://xkcd.com/1208/ ↩︎
Okay, good luck getting out of the labyrinth. ↩︎

About#

Instrument#

Identify hot spots#

Eliminate#

Repeat#

Conclusion#