This is the final video in the Prologue of the Performance-Aware Programming series. It summarizes the five performance multipliers from the previous videos, then explores how fast our original Python program can go now that we know what is necessary to make the summation loop run quickly. Please see the Table of Contents to quickly navigate through the rest of the course as it is updated weekly. A lightly-edited transcript of the video appears below.
In the past five videos I showed a practical example of how we take some code from 0.006 adds per cycle to 52 adds per cycle. That's over 8,000 times faster. So when I said 1000x or 10,000x as being typical, I wasn’t exaggerating. This loop barely does anything at all, and this massive multiple was there nonetheless!
Plus, I’m not a professional code optimizer. I do some optimization work, but it’s rare. So although I don’t know, I'm guessing that if somebody who did optimization for a living looked at our fastest C code, they may notice some things we could have done better. We haven’t done the difficult work of really scrutinizing the performance here, so we don't even know if this is the absolute maximum we could push this chip to.
We do know it’s not the biggest multiple we might see for this loop more broadly, though. This chip is old. It has only four cores, and limited IPC compared to more modern chips. So the 8,000x we’re observing here is likely to be a modest gap compared to what we would see if we looked at some crazy server CPU!
So hopefully this has demonstrated clearly that I’m not making up these orders of magnitude. They’re real, they exist, and they exist everywhere, no matter how simple the code appears to be.
But my goal with this course is not to train you to go after all 8,000x, or however large that gap is on your target platform. My goal is to show you how that gulf is created by these multipliers, in detail, so you understand where the performance gap comes from. Usually, by just doing a small amount of extra work that doesn't require intense profiling and study and maxing things out, you will be able to hit some reasonable multiplier in the gap between nothing and 8,000x.
That gap is so massive, you simply don’t need to spend the time to cross it entirely. It may not be worth your time to do all the work necessary to get the full speed-up, and the good news is, you don’t have to. Once you know what your options are, you can choose to go for 1,000x, or 100x, and still be getting massive speedups. That’s the silver lining of the pervasive, massive underperformance of modern software!
So that's our goal — and remember at the opening I said there were two different things we can do to increase performance: A) we can reduce the number of instructions, and B) we can increase the speed at which the instructions move through the CPU. That's all we've got: