This is the second video in Part 3 of the Performance-Aware Programming series. Please see the Table of Contents to quickly navigate through the rest of the course as it is updated weekly. The listings referenced in the video (listings 102, 103, 104, and 105) are available on the github.
We've reached the point in the course where things start getting intricate. To make progress, we will have to learn more about how modern CPUs, operating systems, and libraries work.
As we get into these details, it will be easy to get discouraged, or feel daunted by how much it seems like you have to know. You may think, “How could I ever possibly remember all of these things and take them into account?”
Don't get discouraged. You don’t have to know all these things cold. In fact, very few — if any — people do! Even performance-minded system architects rarely know all the details of the hardware and software on which they rely. I certainly don't know all these things, and it’s not necessary for performance-aware programming.
Instead, what we're trying to do here is experience all of these details for a particular system, at a particular point in time, so that we understand the kinds of things that can be involved. We will build a framework in our brains for understanding these sorts of complex systems so that, when we do need to make design decisions, we know what is likely to be at play. We can then quickly search for and incorporate the necessary new information about that part of the system we’re working with, and make informed decisions.
So really what we want to focus on is this ability to learn the details when we need to. By going through the nitty-gritty once, we gain the ability to more quickly acquire those details later when we need to. And, most importantly, we know that the details exist, how to search for them, and how to quickly understand the material we find.
In fact, that’s what I do myself. I simply don’t have time to keep up to date with all the ways in which CPUs, GPUs, and operating systems work. It would be a full time job just to keep on top of all that! Instead, I know enough about each to understand when I need more detail about something, and when it comes time to make decisions about that thing, I know how to quickly acquire and understand the information I need. That’s performance-aware programming.
With that said, it’s time to start looking at the fread performance we observed in the last post. We measured the throughput of that call in our actual application, and we were wondering, is that as fast as it should be? Because it will set an upper limit on our performance, after all… our application can never process data faster than it receives it!
To begin our investigation, we’re going to look at a concept I call repetition testing. I don’t know if there is an official term for this kind of testing, but it’s a common kind of performance test people run.
The idea behind repetition testing is that there's too much variability in something like a normal block profiler. It’s great for letting us know how our time is being spent in a real run of our application. But it doesn’t let us zero in on the asymptotic highest speed for a particular part of the program.
The reason for this is because there is simply way too much variability in modern hardware and software. There is all sorts of hidden state that we cannot observe directly — like what is in the CPU’s caches, or the operating system’s file cache, etc. — so when we run a program, we really have no idea how much of its behavior is being dictated by the random state of the computer or OS when it started.
We know that it is being affected by that, of course! We know this because each time we run the program, we get different profile results. So we can clearly see the variability. We just don’t know how to deal with it.
Repetition testing is a “solution” to this problem. I use quotes around the word because we’re not really solving the problem, we’re just bypassing the problem in order to get some idea of the (probably unachievable) best performance. It’s a way of testing what happens to a particular code path if all the stars align, and everything in the machine happened to be in its favor.
We're already familiar with the process of taking RDTSC’s around some block, and using that to determine how long it takes. In this case, it’s an fread:
That's a reasonable way to time an fread, even though we haven't yet learned all the nuances of RDTSC yet. We’ll have to learn more about it if we want to time smaller pieces of code. But fread’ing a gigabyte file, as we are in this case, is a huge amount of time, so this will be a perfectly reasonable way to measure it.
In our block profiler, we just took these time deltas and summed them up. That was great for finding out the total time taken in each part of our program.
In repetition testing, we make a small change to this process: instead of summing, we run a small piece of code repeatedly, and take the minimum amount of elapsed time we see.
In other words, we see what the absolute fastest speed is for a particular snippet of code if we run it over and over and over and over again:
By running the snippet repeatedly, we allow all the components of the underlying system — caches, branch predictors, etc. — to “train up” on this exact piece of code, to the exclusion of everything else in our program. By doing this, we hope to remove much of the variability that comes from having this components in unknown states when we run the program.
It’s not perfect, but it’s often the best we can do in user mode! Often, it’s actually pretty good at removing variability. In this particular case, it will be a little worse than usual, because the thing we’re trying to time involves a heavyweight operating system operation. There will be more variability in our timings as a result. We would generally expect better results when we work with blocks that remains entirely in our own code.
But nonetheless, we should be able to converge on a less variable profile this way. Let’s see what this looks like in practice.