Performance Excuses Debunked
There are many legitimate arguments to be had about performance, but there are also plain old excuses. It's time to end the excuses.
Whenever I point out that a common software practice is bad for performance, arguments ensue. That’s good! People should argue about these things. It helps illuminate both sides of the issue. It’s productive, and it leads to a better understanding of how software performance fits into the priorities of our industry.
What's not good is that some segments of the developer community don’t even want to have discussions, let alone arguments, about software performance. Among certain developers, there is a pervasive attitude that software simply doesn't have performance concerns anymore. They believe we are past the point in software development history where anyone should still be thinking about performance.
These excuses tend to fall into five basic categories:
No need. “There’s no reason to care about software performance because hardware is very fast, and compilers are very good. Whatever you do, it will always be fast enough. Even if you prefer the slowest languages, the slowest libraries, and the least performant architectural styles, the end result will still perform well because computers are just that fast.”
Too small. “If there is a difference in performance between programming choices, it will always be too small to care about. Optimal code will only have 5 or 10% better performance at best, and we can always live with 10% more resource usage, whatever resource it happens to be.”
Not worth it. “Sure, you could spend time improving the performance of a product. That improvement might even be substantial. But financially, it’s never worth it. It’s always better for the bottom line to ignore performance and focus on something else, like adding new features or creating new products.”
Niche. “Performance only matters in small, isolated sectors of the software industry. If you don’t work on game engines or embedded systems, you don’t have to care about performance, because it doesn’t matter in your industry.”
Hostpot. “Performance does matter, but the vast majority of programmers don’t need to know or care about it. The performance problems of a product will inevitably be concentrated in a few small hotspots, and performance experts can just fix those hotspots to make the software perform well on whatever metrics we need to improve.”
These are all ridiculous. If you look at readily-available, easy to interpret evidence, you can see that they are completely invalid excuses, and cannot possibly be good reasons to shut down an argument about performance.
Of course, in order to make such a strong claim, I do have to be specific.
First, when I say performance, I mean the amount of resource consumption a program uses to do its job. CPU time, wall-clock time, battery life, network traffic, storage footprint — all the metrics that do not change the correctness of a program, but which affect how long a user waits for the program to complete, how much of their storage it occupies, how much of their battery life it uses, etc.
Second, when I say these are completely invalid excuses, I mean just that: they are obviously false when used as an excuse to justify ignoring software performance and dismissing arguments or data.
Importantly, that does not mean you can’t find examples where the basis for the excuse might be true. It is clearly possible to find a codebase that does have its performance concentrated into hotspots. It is also presumably possible to find a company somewhere where performance doesn’t affect their bottom line.
But a situation that sometimes happens does not support the use of a statement as a blanket excuse. For these to be valid excuses that relegate performance to an esoteric concern, they must be true in the common case. They must be true a priori, as things you can know about software in general before you have actually investigated the performance of a particular product or practice.
And the available evidence clearly demonstrates that these excuses are not true in general. To see this, all you have to do is look at the track record of successful software companies. If you do, it immediately becomes clear that none of these things could have been accurate statements about their projects.
For example, take Facebook. It's a huge company. It employs tens of thousands of software developers. It's one of the most valuable corporations on planet earth. And importantly, for our purposes, they are fairly open about what they're doing and how their software development is going. We can easily look back and see what happened to their software projects over the past decade.
Facebook
In 2009, Facebook announced the roll out of a new storage system. The entire rationale for this system was a performance improvement:
It took a “couple of years” for them to develop this system. The reason they gave for spending all this time and effort was that it allowed them to “have 50% less hardware”:
"In terms of cost, if it's twice as efficient, we can have 50% less hardware," said Johnson. "With 50 billion files on disk, the cost adds up. It's essentially giving us some [financial] headroom."
The following year, in 2010, they announced they were “making Facebook 2x faster”:
Why were they doing this? They said they had run experiments — corroborated by Google and Microsoft — that proved users viewed more pages and got more value out of their site when it ran faster:
At Facebook, we strive to make our site as responsive as possible; we’ve run experiments that prove users view more pages and get more value out of the site when it runs faster. Google and Microsoft presented similar conclusions for their properties at the 2009 O’Reilly Velocity Conference.
Was it easy to make Facebook twice as fast? Was it just a few engineers working on some “hotspots”?
Nope. It was an organization-wide effort that took “six months and counting”, and it followed a prior year-and-a-half of previous performance work:
From early 2008 to mid 2009, we spent a lot of time following the best practices laid out by pioneers in the web performance field to try and improve TTI … By June of 2009 we had made significant improvements … After looking at the data, we set an ambitious goal to cut this measurement in half by 2010; we had about six months to make Facebook twice as fast.
The effort involved the creation of completely new libraries and systems, as well as total rewrites of several components:
Cutting back on cookies required a few engineering tricks but was pretty straightforward; over six months we reduced the average cookie bytes per request by 42% (before gzip). To reduce HTML and CSS, our engineers developed a new library of reusable components (built on top of XHP) that would form the building blocks of all our pages.
…
We set out to rewrite our core interactions on top of this new library, called Primer, and saw a massive 40% decrease (after gzip) in average JavaScript bytes per page.
…
We call the whole system BigPipe and it allows us to break our web pages up in to logical blocks of content, called Pagelets, and pipeline the generation and render of these Pagelets.
In 2012, Facebook announced they had abandoned HTML5 and had rewritten their entire mobile app to be iOS native:
This was a six-month “ground up” rewrite using the Apple iOS SDK, even though the result “looked nearly identical to the old app”:
Facebook today announced the culmination of more than six months of work, a native version of the Facebook app for iOS that's twice as fast. “Up until now we've looked at scale,” iOS Product Manager Mick Johnson says, “but we've become aware that while we have a great mobile website, embedding HTML 5 inside an app isn't what people expect.” Facebook for iOS 5.0 was built from the ground up using Apple's iOS SDK, and looks nearly identical to the old app…
Why did they take six months to rewrite an entire application without adding any new features? To fix what they called the “app's largest pain points”, all of which were performance problems:
In building a native Facebook app for iOS, the company looked at improving three key places, “the app's largest pain points” all relating to speed: launching the app, scrolling through the News Feed, and tapping photos inside the News Feed.
Were they willing to make sacrifices to get these performance improvements? They absolutely were:
While Facebook for iOS is much faster than it was before, the speed comes with one compromise: the company can no longer roll out daily updates to one of its most popular apps.
In December of the same year, Facebook announced they did the exact same thing for Android, rewriting the application to be native for exactly the same reasons:
Facebook today announced the launch of its new Android app, which ditches HTML 5 “webviews” in favor of native code to speed up loading photos, browsing your Timeline, and flipping through your News Feed.
In 2017, Facebook announced a new version of React called “React Fiber”:
This was a complete rewrite of their React framework. It was meant to be API compatible, so why was this necessary? According to Facebook, the main focus was to make it “as responsive as possible” so that apps would “perform very well”:
The main focus here was to make React as responsive as possible, Facebook engineer — and member of the React core team — Ben Alpert told me in an interview earlier this week. “When we develop React, we’re always looking to see how we can help developers build high-quality apps quicker,” he noted. “We want to make it easier to make apps that perform very well and make them responsive.”
In 2018, Facebook published a paper describing how improving the performance of PHP and Hack became a priority for them, and they had to create increasingly more complicated compilers to get their code to run faster:
The paper describes a number of techniques employed in the compiler to work around the inherent limitations of these languages that make it difficult for compilers to generate fast code.
How much of a performance increase did they get? 21.7%, a percentage which took a “huge engineering effort” to achieve.
In 2020, Facebook announced that it had done another major engineering effort to reduce the footprint of Facebook Messenger by 75%:
How did they do this? By rewriting the entire application from scratch:
But now Facebook has put the iOS version of Messenger on an extreme weight-reduction plan. By rewriting it from scratch, it’s shrunk Messenger’s footprint on your iPhone down to an eminently manageable 30MB, less than a quarter of its peak size. According to the company, the new version loads twice as fast as the one it’s replacing.
How much work did this take? It was apparently a multi-year effort, and was “an even more vast undertaking than Facebook had anticipated”:
Code-named “LightSpeed” and announced at Facebook’s F8 conference in April 2019, the new version was originally supposed to ship last year; completing it was an even more vast undertaking than Facebook had anticipated. VP of Messenger Stan Chudnovsky compares the effort to remodeling a house and discovering new problems when contractors open up the walls: “You can only find stuff that is worse than you originally anticipated,” he says.
Why undergo this massive engineering effort to reproduce the same application in a smaller footprint? Because it was “good business” to do so:
Tweaking an app for sprightly performance isn’t just courteous to the folks who use it; it’s also good business, since it tends to increase usage. “We know that every time we make Messenger faster and simpler, it’s easier for people to communicate and they use it more,” says VP of engineering Raymond Endres
Just two months later, Facebook announced it was rebuilding the entire tech stack for facebook.com:
Why were they doing this? Because they realized that their existing tech stack wasn't able to support the “app-like feel and performance” that they needed:
When we thought about how we would build a new web app — one designed for today’s browsers, with the features people expect from Facebook — we realized that our existing tech stack wasn’t able to support the app-like feel and performance we needed.
How extensive was the work necessary to rebuild facebook.com? According to Facebook, it required “a complete rewrite”:
A complete rewrite is extremely rare, but in this case, since so much has changed on the web over the course of the past decade, we knew it was the only way we’d be able to achieve our goals for performance and sustainable future growth.
Why Facebook thought rewrites were “extremely rare” is an interesting question, since as we’ve already seen, they appear to rewrite things all the time. But regardless, this rewrite touched a huge cross-section of their technology stack, and they concluded by saying that the work done to improve performance was “extensive” and that “performance and accessibility can't be viewed as a tax on shipping features”:
Engineering experience improvements and user experience improvements must go hand in hand, and performance and accessibility cannot be viewed as a tax on shipping features. With great APIs, tools, and automation, we can help engineers move faster and ship better, more performant code at the same time. The work done to improve performance for the new Facebook.com was extensive and we expect to share more on this work soon.
Finally, we have one of my favorite Facebook announcements regarding performance. This post from 2021 announces a new release of the Relay compiler:
This was a complete rewrite of the compiler, in a completely different language. Why was this rewrite necessary? Because their “ability to incrementally eek out performance gains could not keep up with the growth in the number of queries” in their codebase:
But we haven't discussed why we decided to rewrite the compiler in 2020: performance.
Prior to the decision to rewrite the compiler, the time it took to compile all of the queries in our codebase was gradually, but unrelentingly, slowing as our codebase grew. Our ability to eke out performance gains could not keep up with the growth in the number of queries in our codebase, and we saw no incremental way out of this predicament.
…
The rollout was smooth, with no interruptions to application development. Initial internal benchmarks indicated that the compiler performed nearly 5x better on average, and nearly 7x better at P95. We've further improved the performance of the compiler since then.
What's so interesting about this announcement is that it's about a performance rewrite for a compiler, but one of the main reasons1 the compiler exists in the first place is because it’s needed to improve the performance of apps written with Relay. Relay without the compiler would be too slow, but the compiler itself was also too slow, so they had to rewrite the compiler.
It’s the “nesting doll” of performance rewrite announcements.
Excuses Revisited
Facebook all but tells us directly — over and over and over again — that none of the five excuses apply to their typical product:
If there really was “no need” to worry about software performance — it's always fast enough, no matter what language you pick, no matter what libraries you use — why did they have to do things like rewrite an entire compiler from JavaScript to Rust? They should have been able to use JavaScript and had the compiler just be fast enough. Why did they have to rewrite their entire iOS app using the native SDK? HTML5 should have just been fast enough. Why did they have to undertake a “huge engineering effort” to create new compiler technology to speed up their PHP and Hack code? Hack and PHP should have already been fast enough, right?
If performance improvements were always “too small” to care about, how did they get 2x the performance on their entire site? How did they shrink their executable by 75%? How did they get a 5x performance increase when they rewrote their compiler in Rust? How are they getting these massive, across-the-board performance improvements, if optimization can only ever make an insignificant difference?
If performance wasn’t “worth it” to their bottom line, why is Facebook — a publicly traded company — assigning entire divisions of their organization to rewrite things for performance? Why is a for-profit corporation devoting so much time and energy to something if it doesn’t affect their financial success? Why are they referring to customer research — apparently corroborated by Google and Microsoft — that customers engage more with their product if the performance of the product is higher? Why are they calling it “good business” to rewrite the exact same application from scratch just to get a 75% footprint reduction?
If performance was a “niche” concern, what is the niche? How is Facebook seeing the need for performance optimization and complete rewrites across every single product category? What kind of “niche” encompasses iOS apps, Android apps, desktop web apps, server back-ends, and internal development tools?
If Facebook’s performance problems were concentrated into “hotspots”, why did they have to completely rewrite entire codebases? Why would they have to do a “ground up” rewrite of something if only a few hotspots were causing the problem? Why didn’t they just rewrite the hotspots? Why did they have to rewrite an entire compiler in a new language, instead of just rewriting the hotspots in that language? Why did they have to make their own compiler to speed up PHP and Hack, instead of just identifying the hotspots in those codebases and rewriting them in C for performance?
How are people still taking these excuses seriously? There is no way to explain the behavior of even just this one company, let alone the rest of the industry, if you somehow believe one of these excuses.
Well, I suppose one way to keep believing one of these excuses is to believe that Facebook is unique. That they alone are so unwise, untalented, or unlucky as to have these performance problems, but no one else would.
In other words, you would have to believe that Facebook’s 20,000+ software engineers were a stark departure from the common case, and their codebases were very different from everyone else’s.
What does the evidence say about that excuse?
The Industry At Large
We could instead look at Twitter, who in 2011 announced that they had rewritten their entire search engine architecture because of increased search traffic:
They changed their backend from MySQL to a real-time version of Lucene and replaced Ruby-on-Rails with a custom-built Java server called Blender, all for the stated reason of improving search performance.
The following year they announced they had made an entire system for performance profiling so they could optimize their distributed systems:
In the same year, they also announced extensive optimizations to their front-end, which required undoing a bunch of architecture decisions they had made two years prior which proved to be bad for performance:
In 2015, they announced they completely replaced their analytics platform with a brand new system they wrote from scratch called “Heron”:
Unsatisfied with Heron’s performance, in 2017 they announced they’d done additional low-level optimizations on it:
Apparently those optimizations weren’t enough, because in 2021 they decided to replace Heron completely, along with several other pieces of their core infrastructure, to improve their back-end performance:
Of course we don’t have to stick with Twitter. If you’d prefer Uber, in 2016 they posted an article talking about how they had moved to “Schemaless”, a custom-written datastore:
They claimed this was necessary because if they continued to use their existing solution (Postgres), “Uber’s infrastructure would fail to function by the end of the year”. The move required a complete rewrite of the entire infrastructure, took “much of the year”, and involved “many engineers” from their engineering offices “all around the world”.
Also in 2016, they announced they had written PyFlame, a custom “Ptracing Profiler for Python”:
The first reason they cited for writing their own profiler was that — and I’m not making this up — the existing Python profiler was too slow to use accurately:
The first drawback is its extremely high overhead: we commonly see it slowing down programs by 2x. Worse, we found this overhead to cause inaccurate profiling numbers in many cases.
Why did they need a profiler in the first place? Because they wanted to keep their compute costs low:
At Uber, we make an effort to write efficient backend services to keep our compute costs low. This becomes increasingly important as our business grows; seemingly small inefficiencies are greatly magnified at Uber’s scale.
If you'd like an example of what kind of back-end services they had to profile and then rewrite, you need look no further than that new Schemaless datastore they’d announced the previous year:
Apparently they had written the entire thing in Python only to find that Python was too slow. They then had to completely rewrite all the worker nodes in Go for no reason other than to increase performance.
During that same time period, Uber was apparently rewriting their entire iOS application in Swift. This harrowing thread from December, 2020 details the series of development disasters caused by that decision:
The entire thread is an amazing read and details some of the heroic efforts required to ship a Swift app at all. Even so, Uber ended up having to take “an eight figure hit” to their bottom line because there was no way to get their Swift app size small enough to allow the inclusion of an iOS 8 binary for backwards-compatibility.
In 2020, Uber announced they were rewriting their Uber Eats app from the ground up in a complete rewrite that took an entire year:
Why was a complete rewrite necessary? They only gave two reasons, and one of them was performance:
The UberEats.com team spent the last year re-writing the web app from the ground up to make it more performant and easier to use.
In 2021, Uber announced another complete rewrite, this time of their fulfillment platform:
This process took two years and was necessary because, according to Uber, “the architecture built in 2014 would not scale”.
I can keep going like this as long as you want. Evidence that performance matters, and that companies are constantly taking measures to improve it, is easy to find at nearly every tech company that shares public information about their development processes. You can find it at Slack…
… at Netflix…
… at Yelp…
… at Shopify…
… LinkedIn…
… eBay…
… HubSpot…
… PayPal…
… SalesForce…
… and of course, Microsoft…
Conclusion
With so much evidence refuting the five excuses, hopefully it is clear that they are ridiculous. They are completely invalid reasons for the average developer, in the common case, to dismiss concerns about software performance. They should not be taken seriously in a professional software development context.
Crucially, the evidence against these excuses comes from some of the largest and most financially valuable companies in the world — companies that software developers actively try to work for, because they offer the industry’s most prestigious and highest-paying jobs. Unless your goal is to be an unsuccessful software developer at an unsuccessful software company, there is simply no support for an expectation that your project won’t be critically affected by performance concerns.
In fact, when considered as a whole, the last two decades would seem to show exactly the opposite of what excuse-makers typically claim. Software performance appears to be central to long-term business interests. Companies are claiming their own data shows that performance directly affects the financial success of their products. Entire roadmaps are being upended by ground-up performance rewrites. Far from what the excuses imply, the logical conclusion would be that programmers need to take performance more seriously than they have been, not less!
That said, as I mentioned at the outset, there’s still plenty of arguments to be had. I don’t want to stop the arguments — just the excuses.
For example, one argument would be that the evidence I’ve presented here is consistent with a strategy of quickly shipping “version one” with poor performance, then starting work on a high-performance “version two” to replace it. That would be completely consistent with the evidence we see.
But even if that turns out to be true, it still means programmers have to care about performance! It just means they need to learn two modes of programming: “throw-away”, and “performant”. There would still be no excuse for dismissing performance as a critical skill, because you always know the throw-away version has to be replaced with a more performant version in short order.
That kind of argument is great. We should have it. What we should not have are excuses — claims there is no argument to be had, and that performance somehow won’t matter anywhere in a product lifecycle, so developers simply don’t have to learn about it.
That said, if the prospect of learning about software performance sounds like bad news to you, let me leave you with some good news.
Although the five excuses aren’t true about software performance in general, times have changed. If you think achieving good software performance requires hand-rolling assembly language, almost nobody does that anymore. That is an incredibly niche, incredibly hotspot thing that there’s almost no need for, where the difference is small, and where it would be very unlikely to be financially worth it!
All five excuses actually are true about hand-rolled assembly today! And that wasn’t true about hand-rolled assembly in, say, the 1980s.
So the good news is that software performance today is not about learning to hand-write assembly language. It’s more about learning to read things like assembly language, so you can understand how much actual work you are generating for the hardware when you make each programming decision in a higher-level language. It’s about knowing how and why language A will be less efficient than language B for a particular type of program, so you can make the right decision about which to use. It’s about understanding that different architectural choices have significant, sometimes severe consequences for the resulting work the CPU, network, or storage subsystem will have to do, and carefully avoiding the worst pitfalls of each.
Although it does take some time to learn the skills necessary to make good performance decisions, nowadays it is a very achievable goal. It does not take several years of hand-writing assembly code like it used to. Learning basic performance-aware programming skills is something a developer can do in months rather than years.
And as the evidence shows, those skills are desperately needed at some of the largest and most important software companies in the world.
The other reason Facebook gives for needing a compiler is “stability”. According to their blog:
Relay has a compiler in order to provide stability guarantees and achieve great runtime performance.
In a 40-hours work week filled with depressing code-related absurdities, I can always hop on here for a quick dose of dopamine and a reminder that things can, and should, be better - and that I'm not crazy for thinking that.
Loved the video.
There is another common excuse though, I often hear when people talk about doing performant code that they are doing premature optimization, that you should only care for it when it becomes a real problem and you actually need to rewrite everything. (Even in a high demanding area such as game development)
I wonder what you opinion in this matter is? When is too early to do a performant solution? Is there really such a thing as premature optimization? If so is there a better approach to make our life’s easier so we don’t have to rewrite everything every time our solution is not performant enough?