Code

The Best Way to Improve Software Quality

And it doesn't involve technology

February 4, 2024

A few years ago I was working on a team that owned a web app for our enterprise customers. It had been forked from our original app specifically to meet the needs of larger clients. Despite both apps' shared heritage, they were not similarly reliable. My team’s app build almost never failed, yet the original app build was red a third of the time. Since a red build prevented new deployments of the app, it slowed a lot of people down.

Both apps were maintained by large numbers of software engineers, and the distribution of talent and experience was similar. I bandied together with some other frustrated engineers to try and solve the problem.

First we manually investigated each failed build, and categorized and counted the root causes. But apart from flaky tests, we found no pattern. Next we tried bisecting every new build failure and private-messaging the engineer responsible. This helped a little, but many engineers did not respond quickly, and some resented being pinged outside of their working hours. In response we built a bot that would bisect every broken build and announce the name of the perpetrator in a public Slack channel. Yet the broken builds continued.

Along the way I encountered many questionable code choices. One that stands out was an email send function that received an email object arg. The object could have provided a should_send method, but instead of polymorphism the authors went with an if/else statement with one branch for each email class. It was hundreds of lines long. Picture the poor engineer creating a new email class - what’s one more branch in email send versus refactoring the whole thing?

My favorite though was the controller route /crash which threw an unhandled exception. The docs said it was there to “test error logging”.

What was puzzling was the app engineers were also adding high quality commits every day. There just seemed to be an inherent complexity which turned it into a footgun factory. I sensed that we had a people problem. My suggestion that failed builds be tallied by team, and reported in the weekly manager meeting was not taken up (by management). Like a hyperactive traffic light, the app build continued to flip-flop between red and green. The problem seemed intractable, so we gave up.

Months later, an engineer heroically separated the troubled app into smaller apps and the problem solved itself. The difference was whilst our team owned the enterprise app, three teams shared ownership of the original. With no one accountable, a tragedy of the commons had taken hold.¹ ²

Notes

  1. The variations of Conway’s Law are also relevant.
  2. This effect has been observed in published research too.

I ignore CICD here because a lack of CICD doesn’t explain the difference between the two apps' build reliability.

You might be wondering if every app has a single owner, how to share code? We use these conventions:

Tags: software-quality code-review ownership