Announcing the 2024 Halloween Spooktacular Challenge!
Are you brave enough to endure 15 days with one of Microsoft's most terrifying APIs?
In contrast to Linux, Windows makes it extremely difficult to access CPU Performance-Monitoring Counters (PMCs) from your own code. How difficult? So difficult that most Windows programmers, if they use PMCs at all, prefer to use Microsoft-supplied command line tools or third-party profiling software instead of trying to use the Windows API for this purpose.
But it is possible to access PMCs directly from the Windows API — or perhaps I should say, a vendor-selected subset of the PMCs. Specifically, in “recent” editions of Windows, the Event Tracing for Windows API does allow you to request PMC samples from the kernel.
In theory, this is a really big deal: PMCs provide direct introspection of the behavior of the CPU. They can tell us how many L3 cache misses have happened, or how many branch mispredictions have occurred, so we can know — rather than guess — why the CPU is performing the way it is. You would therefore expect Windows programmers to embed PMC sampling into their profiling builds so they could collect this information continuously like they do with things like network latencies or graphics frame rates.
In practice, however, nobody does this1. Why not?
RDTSC vs. RDPMC
On Windows, it is trivial to continuously monitor timings for anything you care about. If you want to integrate time measurement into your project, you can do it with literally two lines of code:
uint64_t StartTime = __rdtsc();
// ... whatever you want to measure goes here ...
uint64_t ElapsedTSC = __rdtsc() - StartTime;
The __rdtsc
intrinsic causes the compiler to place an rdtsc
instruction in your executable, which is extremely cheap to execute, and provides a 64-bit number that (on most platforms) increments on every CPU base frequency clock cycle. Subtracting the results at two points in the program provides you with an elapsed time measurement whose accuracy is proportional to the CPU’s base frequency. Since we live in the invariant TSC era, this provides reliable profiling even in heavily multithreaded code where the start and end points might be executed on different cores.
So, two lines of code, which are effectively free to execute, and you get elapsed time that’s accurate to a few billionths of a second2: this is fantastic, and everyone uses it. You get high-accuracy built-in profiling for near-zero effort.
Now, at the CPU level, it is equally cheap and easy to measure everything else, too. Want to measure L3 cache misses the same way? While you’ll have to add a tiny bit of setup3, the measurement code is essentially the same:
// ... assume CacheMissCounterIndex is set to a counter index that
// we selected at startup to measure L3 cache misses ...
uint64_t StartMissCount = __readpmc(CacheMissCounterIndex);
// ... whatever you want to measure goes here ...
uint64_t MissCount = __readpmc(CacheMissCounterIndex) - StartMissCount;
As with rdtsc
, it couldn’t be simpler: the __readpmc
intrinsic causes the compiler to emit the rdpmc
instruction, which reads the value of a PMC. Assuming you selected the counters you wanted at startup, you can then use rdpmc
to measure the value of a counter at two points in time, just like you would read the TSC. In some circumstances, you may have to do a little more work to avoid mismeasurement due to wrapping — otherwise, it’s really no harder than using rdtsc
.
But sadly, we aren’t at the CPU level. On Windows, we are typically running as a lowly ring-3 process, and for security reasons (at least ostensibly), we are explicitly prohibited from executing the rdpmc
instruction. If we tried, our application would fault.
So what can we do?
ReadThreadProfilingData
Since we can’t access rdpmc
directly, if we wanted to integrate PMC profiling into an application, we must ask Windows to do it for us — it’s either that, or we have to ask our users to approve the installation of a third-party kernel driver which could act on our behalf.
Amusingly, someone at Microsoft did actually write a sane API for PMC collection on Windows. It’s simple, readable, and provides exactly the operations you need. You can see how it works in this sample code I posted a long time ago. Minus the setup part (which you would have to do for rdpmc
as well), the API is just a simple function call that retrieves the values of the counters you’ve enabled. In place of rdpmc
, you just call ReadThreadProfilingData
.
Perfect! Or at least, it would have been perfect.
Unfortunately, Microsoft never actually wired up this API to an implementation. If you try to call it, it will return nothing. When I inquired about this, I was told that these functions were provided as an interface in case third-party drivers wanted to expose performance counters. For some reason, they were (I guess?) never intended to be used as an “available by default” API.
Since ReadThreadProfilingData
doesn’t work, there’s unfortunately only one other option available if you want to read PMCs on a stock installation of Windows — and it’s rather horrific.
Event Tracing for Windows
Lovecraft famously described a first encounter with Event Tracing for Windows in the following passage:
Then the men, having reached a spot where the trees were thinner, came suddenly in sight of the spectacle itself. Four of them reeled, one fainted, and two were shaken into a frantic cry which the mad cacophony of the orgy fortunately deadened. Legrasse dashed swamp water on the face of the fainting man, and all stood trembling and nearly hypnotised with horror.
This is par for the course with the ETW API. No matter what you’re trying to do with it, you will be exposed to unspeakable horrors. It’s not an API anyone would ever use unless they had to.
Unfortunately for us, if we want to do PMC collection on Windows without a third-party driver, we have to. Like Legrasse’s party in the tales of H. P. Lovecraft, we must be willing to confront the primal terrors few programmers dare to face.
This, ladies and gentlemen, is what The Computer Enhance 2024 International Event Tracing for Windows Halloween Spooktacular Challenge is all about. And yes, I added as many words to that title as I could, because I thought it was funnier that way.
For now, I’ll say only the following:
I have never seen anyone publish example code anywhere that demonstrates collecting PMCs through ETW such that it operates thread-safe, continuously, and on a running program, in the same manner as would
rdtsc
4 — meaning you can take a “begin” and “end” sample at arbitrary points in a running program, and get back elapsed PMCs for the bracketed code with minimal interference to the running program.It can be done, because after several days of torment, I was able to write a library to do it.
Having gone through it once, I consider it very difficult to figure out how to accomplish this without hints.
The ETW Spooktacular Challenge Begins Tomorrow
Tomorrow, October 16th, I will post a snippet of test code I made for my library that collects PMCs at arbitrary points in a running program using nothing but a vanilla install of Windows.
The challenge — should you choose to accept it — will be to implement this API yourself. Each subsequent day of October I will post a hint designed to help you make progress toward completing the challenge. Finally, on Halloween, I will post a video walking through my own implementation, so we can all compare notes.
Hint days will be broken into two groups. For the first group, the hints will be about getting PMC collection working through the ETW API at all. This alone is quite difficult, but technically there is some sample code out there that shows how to do it — if you can manage to find it. After posting all of these hints, I will link directly to the sample code, so everyone can “catch up” if they weren’t able to implement this part themselves.
The second group will be about the novel part of the implementation. I’ve never seen anyone describe how to do this part, nor even suggest that it was possible. But I can assure you, it is possible, and the hints will help guide the way to a working implementation.
In both sets, the hints will be ordered so that the least helpful hints come at the beginning, and the most helpful come toward the end.
So brace your souls, all those who wish to brave the horrors of the Event Tracing for Windows API. I’ll see you tomorrow for the kickoff of The Computer Enhance 2024 International Event Tracing for Windows Halloween Spooktacular Challenge!
If you’d like the rest of the Spooktacular Challenge to be delivered automatically to your inbox, you can select a subscription option here:
Well, except me now, and possibly all of you after you complete this spooktacular!
I say “a few billionths” because the rdtsc
instruction — or even the rdtscp
instruction, its serializing younger brother — will report a value that is roughly when they “occurred” in the instruction stream, but subtracting two of them won’t tell you exactly how many base-frequency clock cycles it took to complete all the instructions in between.
This inaccuracy stems from the fact that modern CPUs are highly pipelined, and out-of-order, so at an arbitrary point in the instruction stream, you can’t reliably predict which prior instructions have completed and which are still pending, or which subsequent instructions may have been executed in advance. You can always insert additional fussing to serialize the instruction stream at a particular point, but if you do, you’ll be measuring a different instruction stream than what you would have had if you hadn’t serialized, so it’s still not accurate to the cycle time the CPU would have taken had you not done the rdtsc
in the first place.
So typically, when you’re doing inline profiling, you tend to just use rdtsc
(or rdtscp
on more modern chips) without additional attempts to serialize things like memory accesses, and you time larger sections of code so that the out-of-order-induced inaccuracy is negligible.
CPUs can only record a limited number of PMCs at once, so typically you have to choose which ones you will be recording ahead of time. However, setup is trivial: you just “arm” the PMCs you want by writing a model-specific register, and from then on you can sample the PMC with rdpmc
anywhere you want.
More than that, I’ve never seen anyone even suggest that it can be done. In fact, until I managed to figure out a way to do it myself, I didn’t even think it was possible, because I’d never seen or heard of anyone doing it!
Not sure the best place to put this, but there's a typo in the second paragraph under "RDTSC vs. RDPMC" that says "Substracting". Excited and a bit frightened of implementing this API!
I assume you did not disassemble VTune right? That is not legal :)