Performance-Aware Programming Series Begins February 1st
This will be the first paid-subscriber series on Computer, Enhance!
How slow is modern software?
In the modern software era, it’s common to have a literal dollar value attached to the speed of the programs you write. If your code runs on a server “in the cloud”, your monthly bill is directly proportional to your software’s performance. The math is simple: the slower your software, the more server time (or servers in general) you have to buy.
Yet despite more economic incentives than ever to write reasonably fast code, software performance has gotten steadily and dramatically worse. A “very slow” program in the 1990s might take two times, five times, or maybe (in the worst offenders) ten times longer to do something than the basic capabilities of the hardware. Today you routinely see software running thousands or even tens of thousands of times slower than it should.
These multiples are not accidents. They stem directly from underlying characteristics of modern hardware, each of which creates an order-of-magnitude decrease in performance when ignored. Ignore them all, and 1000x slower programs are all but a guarantee.
Performance Aware Programming is my new series designed to teach you why modern software is so slow.
Not in the philosophical sense, since that’s open to debate - but in the technical sense, which is well-defined and open to direct measurement.
Beginning February 1st here on this Substack, Performance-Aware Programming is a multi-month course designed to teach you everything you need to know to understand why programs are slow - and to give you the knowledge you need to improve their performance substantially.
The course is based on the idea that most programmers incorrectly assume it to be extremely difficult and time consuming to write programs faster than the 1000x-slower versions they’re currently writing. This is an understandable thing to think: in the old days, “optimizing” a program involved doing things like writing meticulous machine-level programs in assembly language, leveraging detailed knowledge of precise hardware behavior applicable only to a specific make and model of CPU. Few people were skilled enough to do it, and it took a long time.
Even today, that sort of programming would be necessary to get 100% of the performance out of a specific (modern) CPU. But we don’t need 100% of the performance to make dramatic improvements to a program running 1000x slower than it should! Even targeting a full 10x slower than hardware would still be a massive 100x performance boost!
Every programmer can learn how hardware works, and use that knowledge to write faster programs, even if they never write a single line of assembly language.
Building a mental model of how a CPU performs operations, and how much time it takes to perform those operations, does necessitate understanding assembly language to a certain extent. Unlike higher-level languages, assembly language provides a rough transcription of what work the CPU will actually do. But unless you are doing extremely specific optimizations, you rarely have to actually write any assembly language.
Rather, the important skill is to learn to read assembly language, and understand the mapping between high level code you write, and the resulting assembly language that will be automatically generated by your compiler. Gaining this understanding takes some concentrated effort, but you only need to do that hard work once. After you do, it becomes straightforward to assess basic performance characteristics of all your programs, as well as look into unexpected performance issues when they arise.
Again, it’s never simple or straightforward to maximize the performance of a program. But you will be pleasantly surprised with how simple it is to be performance aware while you’re programming in any language. Once you know how modern CPUs work, avoiding the pitfalls that lead to 1000x slower code becomes straightforward, as does analyzing basic performance problems when they crop up.
The entire series will be based around a simple real-world example.
Our approach is going to be very straightforward: we’re going to take a Python snippet I based on a real question from Stack Overflow, where an actual programmer in the real world encountered a performance problem. We’re going to measure its performance as-is. We’re going to break down each individual piece, and make an estimate for the maximum performance our CPUs should achieve when doing the workload. We’re going to purpose-build a C program designed to get that maximum performance, and see how close we get.
Then we’re going to look at the difference between the C program and the Python program, and assess each piece to see how much performance is attributed to each. We’ll look at alternatives in Python that could get us closer to the C performance, as well as how much performance we might lose from the C version by making it more flexible, like the Python version.
Along the way, we’ll continually focus on learning a mental model of modern CPU performance. We’ll look at lots of assembly language snippets derived from the problem, learn how to quickly read them, and learn how to map them to microoperations (work the CPU actually does) whose performance characteristics we can look up directly. We’ll learn how the memory hierarchy works, and how caches can affect how microoperations behave.
When we’re done, you’ll be left with a simple set of reliable techniques you can use to estimate the performance of your critical loops, measure their actual performance under different conditions, and determine how your code should be modified to avoid all the “1000x slower” traps.
Should you want to become a more optimization-focused programmer, one who actually does spend a lot of their time looking at assembly language and meticulously crafting programs for high performance, this course will also leave you at an excellent starting point for that kind of work: you’ll know everything that’s involved in hard-core optimization, because you’ll have seen each piece at least once during the course.
We will be using x64 on Windows as the primary platform, although following along on Linux should be possible.
To learn the most from the series, I will be encouraging everyone to code along at home to the greatest extent possible. We will be using only the most basic features of the languages involved (simple Python, simple C), so even if you are not too familiar with either, my hope is that you’ll still be able to follow along. We’ll be focusing on reading ASM and how it translates to modern CPU operations, so we only really need C as a convenient way to generate that ASM!
To code along with the course, you will need a computer with an x64 CPU. Using an ARM CPU probably won’t be possible for most people. The concepts behind x64 performance are similar to high-performance ARM cores, but because the ARM instruction set is completely different, it would be very difficult for anyone new to assembly language to follow along on a chip other than an x64 CPU.
I will be using Windows in most of the course videos and source code. Following along on a platform other than Windows shouldn’t be particularly difficult, but some operations (like creating threads or performing asynchronous IO) vary in nature between operating systems. Since this is a course about learning performance by example, we will not be using any abstraction layers, because we want to see precisely how things work at the lowest level we can.
Everything other than thread and file operations is close enough between operating systems that run on x64 CPUs that there shouldn’t be many other issues. The bulk of the course should translate directly across x64 OSes.
See you February 1st!
I hope you’ll join me for the Performance-Aware Programming series. While hard-core optimization may remain a difficult and time-consuming endeavor, the basic discipline of writing reasonably-performant programs is well within every programmer’s time budget and ability. It takes some work to learn how modern CPUs see your high-level programs, but once you do, it is not that difficult to make (many) better decisions and avoid the catastrophic slow-downs pervasive in modern software.
does anybody know why there is a comma between the computer and enhance? Is it supposed to be like a command 😅? Like in the blade runner scene with the detective?
Hi Casey, thanks for this really looking forward to it. Do I have to keep up with the posts or can I do them at my own pace? And will this be a part of SCG?