How does QueryPerformanceCounter measure time?

The way QueryPerformanceCounter measures time has changed across revisions of the Windows operating system. What method does it use today?

Jun 20, 2023

In the previous post I introduced the RDTSC instruction. This instruction has been with us since the Pentium.

As its name implies, it reads the value of the time-stamp counter, which counts “cycles”. However, as we discussed in the previous post, when CPUs moved to the “invariant TSC” era, the “cycles” counted by the time-stamp counter no longer corresponded directly to the instruction cycles of a particular core. When cores boost up to higher frequencies, or when they throttle down to lower frequencies, their cycles tick by faster or slower than those of the now-shared time-stamp counter.

This creates an obstacle for benchmarking and performance analysis, which is what we're mostly concerned about in this class. In fact, one thing we'll learn as we go forward is — fortunately — there are other ways we can interrogate an x64 CPU to measure things like elapsed cycles. But before we do that, I’d like to underscore the importance of understanding RDTSC and what it does.

RDTSC is still very useful. Unlike many of the other methods we will explore, RDTSC is almost always available. You can issue an RDTSC instruction from user mode on most x64 processors, on most consumer operating systems, and get back results you can use. It's one of the few high frequency timing methods you can use that you know will always be available. All the other methods of gathering performance metrics are less dependable. They may require a specific operating system, a specific driver, or a specific type of processor in order to function.

But there’s another big reason to understand RDTSC: nowadays, it is actually used as the high precision timer for common operating system functions. This is not widely understood due to the transition in behavior that happened over the past several years.

Previously, when you called things like QueryPerformanceCounter — which we also discussed in the previous video — the operating system may have used some other time source, like an HPET. Now, though many people still believe they're getting something like an HPET source when they call these OS functions, they are often just being served a modified version of the TSC!

In this post, we’re going to look at how this happens in practice. Because you learned to read ASM in part one of this course, we have the opportunity to directly investigate the behavior of operating system calls like QueryPerformanceCounter. We no longer have to take anyone's word for it to find out what’s going on in simple OS function calls1.

Let's take a look at what happens if we use a debugger to step into QueryPerformanceCounter. The bulk of the implementation is actually quite brief:

00007FFE6EB901A0  mov             qword ptr [rsp+0x8], rbx
00007FFE6EB901A5  push            rdi
00007FFE6EB901A6  sub             rsp, 0x20
00007FFE6EB901AA  mov             r8b, byte ptr [0x7ffe03c6]
00007FFE6EB901B2  mov             rbx, rcx
00007FFE6EB901B5  test            r8b, 0x1
00007FFE6EB901B9  jz              0x7ffe6ec06daa
00007FFE6EB901BF  mov             r11, qword ptr [0x7ffe03b8]
00007FFE6EB901C7  test            r8b, 0x2
00007FFE6EB901CB  jz              0x7ffe6ec06d7e
00007FFE6EB901D1  mov             r9, qword ptr [rip+0x12adf0]
00007FFE6EB901D8  test            r9, r9
00007FFE6EB901DB  jz              0x7ffe6ec06daa
00007FFE6EB901E1  mov             r10d, dword ptr [r9]
00007FFE6EB901E4  test            r10d, r10d
00007FFE6EB901E7  jz              0x7ffe6ec06daa
00007FFE6EB901ED  test            r8b, r8b
00007FFE6EB901F0  jns             0x7ffe6eb90233
00007FFE6EB901F2  rdtscp          
00007FFE6EB901F5  shl             rdx, 0x20
00007FFE6EB901F9  or              rdx, rax
00007FFE6EB901FC  mov             rax, qword ptr [r9+0x8]
00007FFE6EB90200  mov             rcx, qword ptr [r9+0x10]
00007FFE6EB90204  mul             rdx
00007FFE6EB90207  mov             eax, dword ptr [r9]
00007FFE6EB9020A  add             rdx, rcx
00007FFE6EB9020D  cmp             eax, r10d
00007FFE6EB90210  jnz             0x7ffe6eb901e1
00007FFE6EB90212  mov             cl, byte ptr [0x7ffe03c7]
00007FFE6EB90219  add             rdx, r11
00007FFE6EB9021C  shr             rdx, cl
00007FFE6EB9021F  mov             qword ptr [rbx], rdx
00007FFE6EB90222  mov             rbx, qword ptr [rsp+0x30]
00007FFE6EB90227  mov             eax, 0x1
00007FFE6EB9022C  add             rsp, 0x20
00007FFE6EB90230  pop             rdi
00007FFE6EB90231  ret

As a brief aside, remember when I said that even though we studied just a few basic x86 instructions, you would be able to read full x64 ASM listings? Well, here is a very good example! This is literally a commercial x64 Windows operating system function, and what do we find but the exact same instructions we already know! mov, push, pop, cmp, jnz, add, test, etc. But enough about that.

If we look at this listing, we can see that the beginning of the function does basic things we would expect, like stack setup:

00007FFE6EB901A0  mov             qword ptr [rsp+0x8], rbx
00007FFE6EB901A5  push            rdi
00007FFE6EB901A6  sub             rsp, 0x20

Then, it tests a bunch of things, none of which seem to trigger in the common case:

00007FFE6EB901AA  mov             r8b, byte ptr [0x7ffe03c6]
00007FFE6EB901B2  mov             rbx, rcx
00007FFE6EB901B5  test            r8b, 0x1
00007FFE6EB901B9  jz              0x7ffe6ec06daa
00007FFE6EB901BF  mov             r11, qword ptr [0x7ffe03b8]
00007FFE6EB901C7  test            r8b, 0x2
00007FFE6EB901CB  jz              0x7ffe6ec06d7e
00007FFE6EB901D1  mov             r9, qword ptr [rip+0x12adf0]
00007FFE6EB901D8  test            r9, r9
00007FFE6EB901DB  jz              0x7ffe6ec06daa
00007FFE6EB901E1  mov             r10d, dword ptr [r9]
00007FFE6EB901E4  test            r10d, r10d
00007FFE6EB901E7  jz              0x7ffe6ec06daa
00007FFE6EB901ED  test            r8b, r8b
00007FFE6EB901F0  jns             0x7ffe6eb90233

This is very lucky for us, because if these tests did typically fire randomly, we would have to do a lot more work to reverse-engineer the behavior of this function. Since they don’t, we can safely ignore them since we’re only interested in seeing what QueryPerformanceCounter does in the common case on a particular machine2.

Which brings us directly to the actual source of time for QueryPerformanceCounter:

00007FFE6EB901F2  rdtscp          
00007FFE6EB901F5  shl             rdx, 0x20
00007FFE6EB901F9  or              rdx, rax

Right away, we arrive at a very simple sequence that tells us where the time is coming from: RDTSCP, followed by SHL, followed by OR.

So, without having to ask any questions or post on SlackOverflow or do any other kind of guesswork, you can already see for yourself just by stepping into QueryPerformanceCounter that it gets its time from RDTSCP. RDTSCP is exactly the same as the RDTSC instruction we looked at in the last post, but with two slight caveats.

RDTSCP writes the 64-bit time-stamp counter value in two 32-bit parts, just like RDTSC. Just like RDTSC, it puts the low 32 bits into EAX, and the high 32 bits into EDX. It does exactly the same thing in that respect.

However, the P part of RDTSCP refers to another register modification it will perform that RDTSC does not do. RDTSCP will also write the “processor ID”, or “PID”, into the ECX register.

The “processor ID” — which would be more appropriately called the “core ID”, if we’re honest about it — is a core-specific value which can be used to differentiate cores of the processor from each other. There’s actually an entirely separate instruction to read it by itself: RDPID. But because performance monitoring operations often want to know both the TSC value and the PID — so they can tell which core they’re running on — RDTSCP reads both values in a single instruction.

Of course, QueryPerformanceCounter doesn’t actually use the PID, so in this case, that behavior is irrelevant. But there is one more difference between RDTSC and RDTSCP. RDTSCP also has different behavior with respect to the core’s out-of-order instruction execution — something we do care about in benchmarking, and which we will be talking about more later in the course when we discuss out-of-order execution in more detail. But, much like the PID, it’s not that relevant to QueryPerformanceCounter.

So really, for QueryPerformanceCounter, this might just as well have been an RDTSC. We don’t really have to know anything about RDTSCP to understand what’s going on here.

Now, what happened right after the RDTSCP?

00007FFE6EB901F5  shl             rdx, 0x20
00007FFE6EB901F9  or              rdx, rax

What does that do? Well, 0x20 is 32 in hexadecimal. Since we know RDTSCP put the high 32-bits of the TSC in RDX, and the low 32 bits in RAX, this sequence is just doing a shift-and-or to reassemble the full 64-bit TSC in RDX. If written in higher-level notation, it’s just:

RDX = (RDX << 32) | RAX;

That’s really all QueryPerformanceCounter does to determine the time! It’s just a thunk down to RDTSCP.

But, of course, once it gets the time, we know it must do something else to it. How do we know that? Because in the previous post, we called QueryPeformanceFrequency and it reported 10 MHz. We also timed RDTSC, and the TSC frequency was essentially the base frequency of the processor: 4.2 GHz.

So, if QueryPerformanceCounter is returning a 10 MHz timer, but it’s getting its time from a 4.2 GHz timer, it must be reducing the precision somehow, right? It can’t just return us the value directly, or the timer would run much faster than QueryPerformanceFrequency claims it will.

If we look at the remainder of the function, we can trace what happens to RDX:

00007FFE6EB901FC  mov             rax, qword ptr [r9+0x8]
00007FFE6EB90200  mov             rcx, qword ptr [r9+0x10]
00007FFE6EB90204  mul             rdx
00007FFE6EB90207  mov             eax, dword ptr [r9]
00007FFE6EB9020A  add             rdx, rcx
00007FFE6EB9020D  cmp             eax, r10d
00007FFE6EB90210  jnz             0x7ffe6eb901e1
00007FFE6EB90212  mov             cl, byte ptr [0x7ffe03c7]
00007FFE6EB90219  add             rdx, r11
00007FFE6EB9021C  shr             rdx, cl
00007FFE6EB9021F  mov             qword ptr [rbx], rdx
00007FFE6EB90222  mov             rbx, qword ptr [rsp+0x30]
00007FFE6EB90227  mov             eax, 0x1
00007FFE6EB9022C  add             rsp, 0x20
00007FFE6EB90230  pop             rdi
00007FFE6EB90231  ret

Just before setting up the return value and cleaning up the stack, RDX is written back to the passed-in pointer by this mov:

00007FFE6EB9021F  mov             qword ptr [rbx], rdx

So, after reassembling the 64-bit TSC value with SHL and OR, the only modifications before the write are a MUL, two ADDs, and a SHR (right-shift). How is this sequence of instructions reducing a timer with a 4.2 GHz frequency to one with a 10 MHz frequency?

Well, this sequence is quite a bit more challenging to reverse-engineer if you’re new to assembly language. I don’t expect anyone would be able to guess what was happening here if their only experience with ASM is part one of this course!

But, when broken down, it’s not that hard to understand. It is, in fact, a common trick — and one that is well-documented in the only book I actually keep within reach at all times when programming, Hacker’s Delight3. If what I describe next is the kind of thing that you enjoy learning, order yourself a copy immediately. It’s an entire book filled with nothing but this kind of thing.

The instruction sequence we see is effectively MUL, ADD, SHR. We can get a big clue to what’s going on here by focusing on the MUL instruction first.

MUL in x86/x64 is defined as doing a full width unsigned multiply. MUL RDX will take the entire 64-bit value in RDX, multiply it by the entire (implicit) second 64-bit operand RAX, and produce the entire 128-bit result. If you don’t believe me, you can double-check the documentation: it really does produce the entire 128-bit result. The high 64 bits go in RDX, and the low go in RAX.

Our first big clue is that the RAX result never gets used. In fact, immediately after the MUL, the instruction sequence overwrites RAX by doing a mov to EAX (the low 32 bits of RAX) without ever having read from RAX first:

00007FFE6EB90204  mul             rdx
00007FFE6EB90207  mov             eax, dword ptr [r9]

So, we’re not just multiplying the TSC (in RDX) by whatever RAX is loaded with. We’re doing that multiply and then throwing away the bottom 64 bits. We keep only the high 64 bits of the result!

Keeping only the high 64 bits of a 128-bit value is the same as dividing that value by 2^64. That means that in this case, this MUL instruction isn’t just computing TSC*RAX. It’s computing TSC*RAX / 2^64.

Why would we want to do that? Well, recall that what we are trying to do is turn a 4.2 GHz counter into a 10 MHz counter. If we were operating in floating point, this would be very simple, right? We could just compute a single floating point ratio — 10 MHz / 4.2 GHz = 1/420 — and we’d just multiply our 4.2 GHz counter values by this ratio to produce 10 MHz counter values.

Of course, in integer, we can’t represent a fractional value like that. 1/420 is 0.00238095238095238095 — not really something we can store in a 64-bit integer! But if we knew we were going to be dividing by something anyway — say, 2^64 for example — well, then we certainly could store an approximation of 1/420, couldn’t we? We’d just solve for the numerator that, when divided by 2^64, is closest to 1/420:

x / 2^64 = 1 / 420
x = 2^64 / 420

Dividing and rounding, that comes out to 43920819223117980.

So, if we were to MUL our time-stamp counter value by 43920819223117980, and keep only the high 64 bits, we would expect this to be approximately equivalent to dividing by 420.

And there you have it: what this technique is doing is performing unsigned integer division with an unsigned integer multiply instead. It may seem like a strange thing to do, of course, because x64 has a DIV instruction! Why not just use that?

Well, you’d have to ask whomever wrote QueryPerformanceCounter! I don’t know what their thought process was. If I had to guess, I would say they may have been worried about the performance of DIV. Although it’s typically under twenty cycles today, depending on when this code was written, DIV might have been on the order of ninety cycles. So my best guess is that this was simply an optimization designed to ensure that QueryPerformanceCounter would be relatively fast to call.

Now, what about the other instructions? Why are there also two adds and a right-shift? Again, it’s hard to know for sure, because as we will see later, one of the adds and the shift seem to be mostly ineffectual. But, technically, if you want to get a precise substitute for DIV using a MUL — not merely an answer that could be off-by-one — you do need to use an additional ADD and SHR depending on what the divisor happens to be. The proof of this is fairly involved, and if you’d like to see it, I refer you again to Hacker’s Delight. It’s got full proofs for both signed and unsigned integers, as well as ready-to-use code that finds the necessary constants for you!

The remaining ADD, on the other hand — the one with r11 — appears to be something else entirely. Because it doesn’t affect the time base at all (since it is a constant added to the value, which means it cancels out when you subtract two values you get from QueryPerformanceCounter), I did not try to decipher what it actually is. But it is clearly more of an offset value that is presumably used to control what the value of QueryPerformanceCounter is relative to.

Since in this case we don’t really care about the absolute values we get from QueryPerformanceCounter, just the relative speed at which they increase, we don’t have to investigate further. But if we cared about this, we would presumably have to do a bunch of experiments to see what kinds of different values might be loaded into r11, and if we could detect a pattern to them. But that doesn’t seem like a particularly good use of time, since it wouldn’t really affect anything we do with QueryPerformanceCounter.

What would be a good use of time is to verify that our guess about the MUL/ADD/SHR is correct. And we have a straightforward way to do just that: if we step back through the code and look at the values loaded into RAX (for the multiply) and CL (for the shift), we should be able to evaluate these as a ratio and have it result in the 10 MHz over 4.2 GHz ratio we expect.

Specifically, we expect:

RAX / 2^(64+CL) = 1/420

So what are RAX and CL?

Stepping back through the code and inspecting the registers in the debugger, we get:

RAX = 43920862140074699
CL = 0

Plugging that into our ratio, we get:

43920862140074699 / 2^64 = 0.00238095470748526651

Whereas our expected value is:

1 / 420 = 0.00238095238095238095

A match to six significant figures!

So, we can be fairly confident now that QueryPerformanceCounter is, in fact, doing what we think it’s doing: it gets the time-stamp counter from RDTSCP, then uses a divide-by-multiplying trick to reduce the frequency to 10 MHz.

Of course, all of these values will differ per machine. The reason our ratio was 10 MHz / 4.2 GHz was because this happens to be a machine where QueryPerformanceFrequency returns 10 MHz, and the TSC runs at 4.2 GHz. So, to repeat this experiment on a different machine, you of course first have to determine what that machines’ TSC frequency is. But otherwise, the entire process can be replicated on any modern x64 Windows machine.

So there you have it. Now you know exactly what QueryPerformanceCounter uses for a clock under the hood. And, more importantly, you now know how to reason about when to use QueryPerformanceCounter vs. RDTSC.

For example, you may not want to use something like RDTSC in a game or application executable you ship to an end user, because as you’ve seen, you don’t really know when Intel or AMD might change its behavior in the future. If they change how it works, you would have to ship a patch to account for that. Calling QueryPerformanceCounter instead might be a safer bet, because hopefully Microsoft would itself patch the code for QueryPerformanceCounter on those new CPUs, thereby insulating your code from that change.

On the other hand, if what you are primarily doing is timing code on your own machine for benchmarking purposes, it makes no sense to call QueryPerformanceCounter. All it is is a much slower way to issue an RDTSCP instruction, and what’s worse, it throws away at least two orders of magnitude of the precision! So there’s absolutely no reason to ever use QueryPerformanceCounter for this purpose.

Now if you're reading this because you’re already taking my Performance-Aware Programming course, your homework is very straightforward this week: I would like you to do the exact same process I did in this article, but on your machine. Find a debugger that can step through assembly, and step into the QueryPerformanceCounter call. See if Windows uses the same code for it on your machine as it does on mine! And, if it does, see if you can verify that the RAX and CL constants work out to the same ratio as your QueryPeformanceFrequency and TSC frequency.

And remember, you already have listings for determining the QueryPerformanceFrequency and TSC frequency! We went over those last time.

If, on the other hand, you're not taking the course, and you like the sort of in-depth investigation we did today, then perhaps you would like to take the course! You can look at the table of contents here, and see subscription options using the little e-mail box below the next paragraph.

That's it for QueryPerformanceCounter. Next week we'll be looking at some more of the subtleties with RDTSC as we begin to look at timing smaller pieces of code. The smaller the thing we want to time, the harder it is to get the timing right, so there’s a lot to talk about! I hope to see everyone back here on Monday for that, as well as the Q&A. Until then, have fun programming!

I say “simple” because, if the OS call transitions to the kernel via a syscall, you’d have to know more about how to set up a kernel debugger to see the assembly, and that’s outside the scope of this course!

If we wanted to completely reverse-engineer QueryPerformanceCounter, you can see how this would be a lot more work, because we’d have to do experiments — potentially on many different machines — to try to figure out when these branches get taken, if ever, and what they do.

Thanks go to Jonathan Blow for originally recommending this excellent book!

Mikhail Gusarov

Jun 21, 2023Edited

Interesting findings for Mac OS / M1. There is `_mach_absolute_time` in `libsystem_kernel.dylib`.

It starts by reading a byte from absolute address (!) 0xfffffc088 + 8, and then based on the value of the byte it:

- makes a syscall `mov w16, #-3; svc #0x80`, or

- reads value from `CNTVCT_EL0`, or

- reads value from msr `S3_3_C14_C0_6`, or

- reads value from msr `S3_4_C15_C10_6`.

Looks like a dispatch by machine type. One of the latter registers appears in Asahi Linux msr dumps, I guess it is Apple Silicon-specific.

Does anyone know what is that address with a magic byte, where does the data come from and how it is mapped into the userspace process' address space?

Expand full comment

17 replies

Aaron

What a coincidence! My copy of Hacker's Delight just arrived this morning.

71 more comments...

Computer, Enhance!

73 Comments