Stupid me, I forgot I have laptop with 12700H. Not sure what a difference between between mobile and desktop cpus, but I could to test things on windows 11 and linux.
With discoveries like this, every passing day my jestful claim that Assembly is really a declarative language, rings more and more true. Thank you for these deep dives!
I have an Alder Lake CPU, and observed this when doing the homework for part 3.
I have also tested this on a Raptor Lake CPU, which follows the exact same pattern.
I'm using Google Benchmark, since it makes it really easy to read the performance counters you want, without recompiling.
I was able to create a few other benchmark programs that probe a bit at how the CPU executes these things. I've observed that the front-end will fuse these instructions with a jump when they occur contiguously, at which point each cycle can only execute a single of immediate addition or subtraction. Adding a nop prevents the fusion, and the optimization happens again. So a question would be how often the CPUs leverage this optimization in the wild.
I also found something in Agner Fog's microarchitecture manual, in the section about Alder Lake, where it says: "Integer addition with a small immediate constant has zero latency in some cases."
Great article thanks for sharing this research! And, sorry you had to go through the Ultimate Sadness...
On another note I wonder why they didn't stick with this for newer processors? Maybe it was only something they experimented in Golden Cove and turned out not as beneficial?
I am not certain what processesors have this, since I only was able to test Golden Cove. It's possible that it does happen in some other Intel processors!
Such an interesting read ! Thank you for sharing it with us !
Stupid me, I forgot I have laptop with 12700H. Not sure what a difference between between mobile and desktop cpus, but I could to test things on windows 11 and linux.
Another Casey banger :)! Thank you!
With discoveries like this, every passing day my jestful claim that Assembly is really a declarative language, rings more and more true. Thank you for these deep dives!
I definitely initially misread the title as “The Case of the Missing Excrement” and was *really* confused
Since I had to use Event Tracing for Windows, I can assure you that not only was the excrement not missing, it was present in abundance.
- Casey
I have an Alder Lake CPU, and observed this when doing the homework for part 3.
I have also tested this on a Raptor Lake CPU, which follows the exact same pattern.
I'm using Google Benchmark, since it makes it really easy to read the performance counters you want, without recompiling.
I was able to create a few other benchmark programs that probe a bit at how the CPU executes these things. I've observed that the front-end will fuse these instructions with a jump when they occur contiguously, at which point each cycle can only execute a single of immediate addition or subtraction. Adding a nop prevents the fusion, and the optimization happens again. So a question would be how often the CPUs leverage this optimization in the wild.
I also found something in Agner Fog's microarchitecture manual, in the section about Alder Lake, where it says: "Integer addition with a small immediate constant has zero latency in some cases."
I've created a gist with my benchmark program, and the output from running these on Alder Lake and Raptor Lake CPUs, with a few relevant performance counters: https://gist.github.com/danielbendix/a377a976e62b6e8a8ea9c93636f0ff1e
Anyone let me know if you have something you'd really like tried on these, and I'll see what I can do.
Great article thanks for sharing this research! And, sorry you had to go through the Ultimate Sadness...
On another note I wonder why they didn't stick with this for newer processors? Maybe it was only something they experimented in Golden Cove and turned out not as beneficial?
I am not certain what processesors have this, since I only was able to test Golden Cove. It's possible that it does happen in some other Intel processors!
- Casey