How does QueryPerformanceCounter measure…

Jun 20, 2023

The way QueryPerformanceCounter measures time has changed across revisions of the Windows operating system. What method does it use today?

Read →

75 Comments

Mikhail Gusarov

Jun 21, 2023Edited

Interesting findings for Mac OS / M1. There is `_mach_absolute_time` in `libsystem_kernel.dylib`.

It starts by reading a byte from absolute address (!) 0xfffffc088 + 8, and then based on the value of the byte it:

- makes a syscall `mov w16, #-3; svc #0x80`, or

- reads value from `CNTVCT_EL0`, or

- reads value from msr `S3_3_C14_C0_6`, or

- reads value from msr `S3_4_C15_C10_6`.

Looks like a dispatch by machine type. One of the latter registers appears in Asahi Linux msr dumps, I guess it is Apple Silicon-specific.

Does anyone know what is that address with a magic byte, where does the data come from and how it is mapped into the userspace process' address space?

Expand full comment

Reply (2)

Mikhail Gusarov

Jun 21, 2023Edited

Answering it myself (with the help of Asahi folks):

0xfffffc000 is `commpage` with a bunch of useful data, Darwin kernel maps it into userspace: https://github.com/apple/darwin-xnu/blob/2ff845c2e033bd0ff64b5b6aa6063a1f8f65aa32/osfmk/arm/cpu_capabilities.h#L203

Also:

[12:09:43] <j`ey_> S3_3_C14_C0_6 is CNTVCTSS_EL0

[12:10:01] <j`ey_> it's like CNTVCT_EL0 but you don't need an ISB

Expand full comment

Reply (1)

Mikhail Gusarov

Jun 27, 2023

Thanks to Marcan, the last MSR is also demystified:

<@marcan> dottedmag: and S3_4_C15_C10_6 is ACNTVCT_EL0 which is a pre-standard version of CNTVCTSS_EL0

Expand full comment

Reply (1)

quazirfan

Nov 7, 2023

What's the diff between mach_absolute_time and _mach_absolute_time? Couldn't you have done something like the following to get current CPU ticks?

#include <stdio.h>

#include <stdint.h>

#include <mach/mach_time.h>

int main() {

uint64_t start_time;

start_time = mach_absolute_time();

printf("mach abs time %llu\n", start_time);

return 0;

}

Expand full comment

Reply (1)

Mikhail Gusarov

Nov 7, 2023

mach_absolute_time is _mach_absolute_time (_ is due to name mangling IIRC).

mach_absolute_time gives you 24MHz timer, not CPU ticks, unfortunately.

Expand full comment

Reply (1)

quazirfan

Nov 7, 2023Edited

Not sure if I am following you. What do you mean by timer?

Official documentation[1] says this function returns current value of a clock that increments monotonically in tick units

[1] https://developer.apple.com/documentation/kernel/1462446-mach_absolute_time

Expand full comment

Reply (1)

Mikhail Gusarov

Nov 7, 2023

absolute time clock is not instruction count, it's just time, so it does not give any good idea of how many instructions were actually executed, only how much time has passed.

Expand full comment

Reply (1)

Jun 26, 2023Edited

Is 1 GHz high enough resolution for the homework we'll be doing later in the course? That's the frequency I'm getting using mach_absolute_time in last week's homework and I'm tempted to use it with my mac but don't want to invest the time in it if not.

Expand full comment

Aaron

Jun 20, 2023

What a coincidence! My copy of Hacker's Delight just arrived this morning.

Expand full comment

Julian

Jun 24, 2023

Here's what mach_absolute_time does on my Intel MacBook Pro running macOS 12.6:

```

; Symbol: mach_absolute_time

; Source: unknown

7FF80322BE20: 55 push rbp

7FF80322BE21: 48 89 E5 mov rbp, rsp

7FF80322BE24: 48 BE 50 00 E0 FF FF 7F > movabs rsi, 0x7fffffe00050

7FF80322BE2E: 44 8B 46 18 mov r8d, dword ptr [rsi + 0x18]

7FF80322BE32: 45 85 C0 test r8d, r8d

7FF80322BE35: 74 F7 je 0x7ff80322be2e ; <+14>

7FF80322BE37: 0F AE E8 lfence

7FF80322BE3A: 0F 31 rdtsc

7FF80322BE3C: 0F AE E8 lfence

7FF80322BE3F: 48 C1 E2 20 shl rdx, 0x20

7FF80322BE43: 48 09 D0 or rax, rdx

7FF80322BE46: 8B 4E 0C mov ecx, dword ptr [rsi + 0xc]

7FF80322BE49: 83 E1 1F and ecx, 0x1f

7FF80322BE4C: 48 2B 06 sub rax, qword ptr [rsi]

7FF80322BE4F: 48 D3 E0 shl rax, cl

7FF80322BE52: 8B 4E 08 mov ecx, dword ptr [rsi + 0x8]

7FF80322BE55: 48 F7 E1 mul rcx

7FF80322BE58: 48 0F AC D0 20 shrd rax, rdx, 0x20

7FF80322BE5D: 48 03 46 10 add rax, qword ptr [rsi + 0x10]

7FF80322BE61: 44 3B 46 18 cmp r8d, dword ptr [rsi + 0x18]

7FF80322BE65: 75 C7 jne 0x7ff80322be2e ; <+14>

7FF80322BE67: 5D pop rbp

7FF80322BE68: C3 ret

7FF80322BE69: 90 nop

7FF80322BE6A: 90 nop

7FF80322BE6B: 90 nop

```

It brackets `rdtsc` (not `rdtscp`!) with `lfence` instructions and combines the 64 bits of rax and rdx. After that, it does a bunch of transformations I don't semantically understand yet.

Expand full comment

Reply (2)

Julian

Jun 24, 2023

I found Apple's annotated assembly here, which explains the algorithm used: https://opensource.apple.com/source/xnu/xnu-4570.1.46/libsyscall/wrappers/mach_absolute_time.s.auto.html

```

* 64-bit version _mach_absolute_time. We return the 64-bit nanotime in %rax.

* The algorithm we use is:

* ns = ((((rdtsc - rnt_tsc_base)<<rnt_shift)*rnt_tsc_scale) / 2**32) + rnt_ns_base;

* rnt_shift, a constant computed during initialization, is the smallest value for which:

* tscFreq << rnt_shift) > SLOW_TSC_THRESHOLD

* Where SLOW_TSC_THRESHOLD is about 10e9. Since most processor's tscFreqs are greater

* than 1GHz, rnt_shift is usually 0. rnt_tsc_scale is also a 32-bit constant:

* rnt_tsc_scale = (10e9 * 2**32) / (tscFreq << rnt_shift);

```

On my machine, `rnt_tsc_scale` seems to always be `0x6aaaaaaa`, while `rnt_tsc_base` and `rnt_ns_base` change on reboot.

I used a debugger to manually modify `rax` and `rdx` after the call to `rdtsc` to pretend that the full 64 bit value was `0x1` in one run and `0x2` in another. The transformed results in `rax` were just `0x1` apart, so I think there is no loss of precision from using `mach_absolute_time`.

Expand full comment

Albert Xu

Jul 22, 2023

When I single step through using lldb, I seem to get into an infinite loop in the outer function __commpage_gettimeofday_internal which repeatedly calls mach_absolute_time. It's strange, since I only make a single call to the clock_gettime library function. Has anyone else observed similar behavior?

My test program:

```

#include <time.h>

#include <stdio.h>

#include <assert.h>

int main()

{

struct timespec tp;

int rc;

rc = clock_gettime(CLOCK_MONOTONIC, &tp);

assert(rc == 0);

return 0;

}

```

Expand full comment

Cory Duplantis

Jun 20, 2023

For folks on linux who haven't done a ton with GDB, I highly recommend a GDB wrapper to make GDB a bit more friendly and visual. My two favorites are below:

Pwndbg : https://github.com/pwndbg/pwndbg/blob/dev/FEATURES.md

GEF : https://github.com/hugsy/gef

Expand full comment

Reply (1)

Gaurav Gautam

Jun 20, 2023

I haven't used it much but there is also this gui for gdb: https://www.gdbgui.com/installation/

Expand full comment

Reply (1)

Cory Duplantis

Jun 20, 2023

Ah for sure! I'm mostly curious about terminal things, but gdbgui looks good for those who like getting out of the terminal.

Expand full comment

Reply (1)

Gaurav Gautam

Jun 20, 2023Edited

It has a terminal at the bottom-left too: https://pasteboard.co/hqvyVTfWrxQ0.png

And this way I don't have to remember all the commands. Its also supposed to be able to visualize the data but I haven't figured out how to make it do that.

I do like how it merges the assembly with the source. I like it more than how visual studio does it. I don't like how remedybg does it at all. I can't understand assembly if there is only assembly there. It takes me way too long.

Expand full comment

Reply (1)

Aaron

Jun 20, 2023

Agreed. I really love RemedyBG but I do wish it had the option to interleave the disassembly with program symbols.

Expand full comment

Max

Dec 7, 2023

If you can always make divisions faster with the mult/add/shift trick, why isn't division implemented that way at the cpu level?

Expand full comment

Bruno Lourenço

Jun 24, 2023

After the sad chuckle promised on the previous video, I have two questions:

Why go through all the extra work to reduce the precision from GHz to MHz? If API users are supposed to query the frequency through QueryPerformanceFrequency and take that value into account, why not just provide the best frequency available?

To accomplish the frequency reduction, the OS has to somehow find out the frequency of rdtsc. Any hint of how this is done and if it can be done at user level?

Expand full comment

Reply (1)

Casey Muratori

Jun 25, 2023

A) I don't know why they reduce the frequency. I assume it is for some kind of compatibility reason, like they don't want to have an API that might return 4 ghz or might return 10 mhz, so they tried to keep it at 10 mhz to maximize compatibility knowing that apps wouldn't test on both kinds of counters, etc.

B) AFAIK the OS times the TSC against a known clock, but the known clock depends on the hardware. There is a way to get the TSC frequency from the chip on modern Intel chips (there's CPUID info for it), but not on AMD. So AFAIK, AMD chips are always timed, whereas Intel ones _might_ be read out of the CPUID, or might not be (you'd have to ask Microsoft which they do, I don't know).

- Casey

Expand full comment

Datamite

Jun 20, 2023

The multiply-to-divide trick requires accessing the top 64 bits that the multiply instruction puts in RDX. Is there a way to access those top 64 bits in C? Are there intrinsics for this?

Expand full comment

Reply (2)

Casey Muratori

Jun 20, 2023

Yes, there is. It's __mulh.

- Casey

Expand full comment

Gaurav Gautam

Jun 20, 2023Edited

A 64 bit value will be of type long long so I would assume you can just directly read it. I don't know if you can say "read this register into this variable in C" though. But that trick is for assembly and I would imagine is supposed to be written in assembly directly.

Although I have never seen it being used in some code I have read, and I have most certainly never used it, there is this feature that supposedly allows you to write assembly in a C file: https://en.cppreference.com/w/c/language/asm

Expand full comment

Reply (1)

Martins

Jun 21, 2023

You can write same multiply, or even 128->64-bit division with help of compiler intrinsics/types. No need for assembly. In MSVC you can do full 64x64->128 multiply with _umul128 (or __umulh if you need just the top 64 bits). In gcc/clang you do that with help of __int128 type (it represents pair of 64-bit registers).

See examples on how to do that here - PCG64_MUL macro: https://gist.github.com/mmozeiko/1561361cd4105749f80bb0b9223e9db8#file-pcg64-h

Expand full comment

sumeet

Jun 25, 2023

Is the homework applicable in Linux?

In Listing 74, platform_metrics, a Linux implementation of ReadOSTimer is provided. The Windows version of that calls into QueryPerformanceCounter, but the Linux version calls gettimeofday(), which seems to return the timestamp since 1970: https://github.com/cmuratori/computer_enhance/blob/main/perfaware/part2/listing_0074_platform_metrics.cpp#L32-L51

I'm not sure where to continue from here.

Expand full comment

ruinedme

Jun 21, 2023

So, I've been doing the homeworks in Rust. You have to use the winapi crate to get access to QPF. However, as best as I can tell there is no call to rdtsc/p. When the exe runs somehow there is a value in memory set to 10,000,000 and it just loads that into RAX, then copies it to a location held by RCX.

I ran through the assembly a handful of times via RemedyBG and the only call to rdtsc is when i call it directly later in the program.

Link to Git outlining the steps in RemedyBG https://github.com/ruinedme/cpu-timer/blob/master/remedybg-steps.txt

Has anyone else run into something like this?

Expand full comment

Reply (1)

Casey Muratori

Jun 21, 2023

I could be misreading your TXT file, but that looks like QueryPerformanceFrequency, not QueryPerformanceCounter. It makes sense that QueryPerformanceFrequency would load a constant 10MHz because that is what the Windows timer is specified to be, so, it just returns you that...

- Casey

Expand full comment

Reply (1)

ruinedme

Jun 21, 2023Edited

You're right i'm not sure what i was thinking. I found the QPC call and it looks similar to your example.

Expand full comment

Anton

Jul 15, 2023Edited

Did a little bit of digging on the linux side (AMD Zen 2, linux 6.3.9) looking at clock_gettime(CLOCK_MONOTONIC, ...), which seems to be the recommended call for monotonic OS time nowadays (gettimeofday(...) maps to clock_gettime(CLOCK_REALTIME, ...), but this does depend on what libc you run).

On my machine clock_gettime(CLOCK_MONOTONIC, ...) ends up performing a handful of jumps and 2 calls before ending up in a rdtscp in the kernel vDSO library. Whether or not you end up with a rdtscp here depends on what clocksource your kernel chooses (see dmesg | grep -i clocksource), it's probably tsc but it might fallback on hpet or similar which maps to a syscall instead. In doing this digging I found out that my system often considers tsc as unstable and falls back on hpet due to a bios bug :(

Here's some benchmarking of various methods of getting a monotonic timestamp (average over 1000 samples measured using rdtsc)

- clock_gettime (hpet) 1262 ns

- clock_gettime (tsc) 33 ns

- rdtsc 15 ns

A word of caution if you're stepping through the asm of clock_gettime in a debugger: I eventually ended up in vDSO which actually calls rdtscp (assuming tsc clocksource) in a loop, see source here: https://github.com/torvalds/linux/blob/master/lib/vdso/gettimeofday.c#L144. My guess is that it's a spin lock waiting for the sequence number `seq` to change before returning (not 100% on this), and when single stepping in a debugger `seq` never updates and you'll be stuck in an infinite (?) loop. Running at full speed, gdb Still attached, it seems to pass through the loop twice. Maybe it would only perform a single rdtscp when running without gdb attached, no idea. Maybe it's the reason clock_gettime is ~2x rdtsc in the timing above.

Expand full comment

Gaurav Gautam

Jun 26, 2023

So here we are again. The evening before the new episode. The worst is when it gets on my mind right before I hit the bed (like now) and then the anticipation starts building and building... Can't wait but have to.

Expand full comment

Mathieu Soula

Jun 22, 2023

For those who are on Linux and use gdb, here is the command that set disassembly style to Intel (default style is AT&T) :

set disassembly-flavor intel

If you want to set it default, just write the command into ~/.gdbinit

Expand full comment

shikaan

Mar 15Edited

Documenting here my journey on an Apple Silicon M3, hopefully it will be less of a treasure hunt for somebody else since it seems to differ from the other M? chips.

On Apple Silicon, the precise, monotonic, non-speculative timing is stored in the `CNTVCTSS_EL0` which is not reacheable from user code. If you want to read that information, you need to use the `mach_continuous_time` function, which ultimately calls this code https://github.com/apple-oss-distributions/xnu/blob/8d741a5de7ff4191bf97d57b9f54c2f6d4a15585/osfmk/arm64/machine_routines.c#L2366-L2385

The `CNTVCT_EL0` registers, while can be accessed by user code, holds a less secure, and possibly speculative version of the same information. It's probably fine for these exercises, but maybe should be relied upon in general.

On the frequency part, Apple Silicon offers something similar to what - I think - Windows offers with `QueryPerformanceFrequency`, via the userland-redable `CNTFRQ_EL0` register. You can have code that looks more similar to Casey's (and the Windows version of this part in general) by using those, instead of relying on the `gettimeofday`.

Expand full comment

waffle.lord

Feb 14

If you are following this course in C# and not using Visual Studio (like I was), the easiest way to debug the disassembly seems to be using Visual Studio. You just need to disable 'Just My Code' in the debug options and set 'enable native code debugging' in your debug properties to true.

Then you can set a breakpoint, hit it like normal, and drop to the disassembly by right clicking in the source window and choosing 'go to disassembly'

Maybe someone will find this helpful or maybe not, but figured I'd just say it since I didn't see it here and I was losing my mind trying to use all kinds of weird tools that weren't really working for me....

Expand full comment

Daniel

Jan 14, 2024

I've tested this in Linux Mint and I can see the rdtscp instrution followed by the others, but it's buried behind a couple of calls and it does a lot of checks before and after. One of those checks makes the whole thing loop indefinitely, my guess is that it must check some time sensitive thing for accuracy and fails because I'm debugging.

Expand full comment

Trey

Sep 29, 2023

XD Elon Musk would be proud haha. Thanks for making all of this information digestible! As I start to understand more and more of this stuff, programming gets EVEN MORE enjoyable.

Expand full comment

Computer, Enhance!

How does QueryPerformanceCounter measure…