Q&A #72 (2025-02-17)

Playback speed

Share post at current time

Share from 0:00

0:00

Paid episode

The full episode is only available to paid subscribers of Computer, Enhance!

Q&A #72 (2025-02-17)

Answers to questions from the last Q&A thread.

Feb 18, 2025

∙ Paid

In each Q&A video, I answer questions from the comments on the previous Q&A video, which can be from any part of the course.

Questions addressed in this video:

[00:00:02] “Column-major or Row-major for matrices, which one do you prefer and why?”
[00:05:07] “Hey Casey, if you were going to write a high-performance 2D graphical application from scratch (like an emulator or a level editor) for Windows today, how would you do it? Is Vulkan (or maybe DirectX) just the best way to go at the moment for efficiently rendering pixels?”
[00:13:31] “In the past episodes you mention that CPUs are often optimized for read efficiency, rather than write, so expanding processes like the decompression in your example might be adversely affected.
I'd like to understand how you would approach something like this toy AoC problem:
https://adventofcode.com/2024/day/21
not from the dynamic programming point of view, but rather as a performance problem that truly expands the control paths making essentially a huge multi-step state machine that needs to walk through the required sequences to calculate the total length.”
[00:17:14] “A lot of your examples are tailored to C programming or ways of embedding C code into your high-level language codebase, which is good for educational purposes and personal projects.
What would be your advice when your existing codebase is in a relatively close-to-binary language like Java or Go and C extensions could not be used either due to portability or compliance.”
[00:22:04] “What's your opinion on using atomic CAS compare and swap operations for building lock-free containers. Like a lock-free Stack for example, that need only a single pointer which could be easily CASed in a spin-loop.
With the view of the modern CPU architectures and OS advancements, would you say this is a beneficial avenue for exploration or that a simple Mutex/Futex would be better and a more reliable and optimized by the OS maintainers solution?”
[00:30:10] “Hey there, it's a bit off topic, but while using godbolt, sometimes I don't understand the partial optimisation applied, like here (I know you don't speak Rust fluently, but I think this exemple is simple enough). Let's say we have two functions … In both cases the compiler seemed smart enough to always return the result of the computation (here 576460751766552576) without doing any actual addition. But it seems there is an unnecessary loop in the heap allocated version, that to me seems to be doing nothing. Is there an reason for this generated asm to exist?”
[00:37:20] “Would you consider to supply a course on DX12 optimization? Its said this api is very hard to learn.”
[00:37:35] “In the function approximation video recently you showed how GCC, Clang, and MSVC generated different assembly for the square root.
But to me that seemed like a strange test case. In real use the function will be called in a context, and likely inline. So the actual optimized code when in real use, would likely look different, and might knoe that it does not need to clear.
In two tests I did msvc inclined the function and made it just the sqrt call.
I think this might showcase an important angle on how to read and use the output in compiler explorer. Or what are your thought on it?”
[00:39:47] “I'm wondering if it's better to work with a higher spec machine, to deal with the requirements of development (running unoptimized builds, running many art tools etc), and rely on rigorous profiling to track any performance related issues, or, if it's better to force myself into a corner, working with a lower spec machine (within the range of what I intend to support). Agner Fog recommends doing so, but what would you recommend?”
[00:48:14] “I recently read your article about planes in the Witness. It got me thinking about how to approach preconditioning of data in general, because this is something I have some issues with in my codebase.
For instance, the younger version of me decided to normalize the ABC part or the plane in the Plane3 struct when it is initialized, so that this is only done once and downstream code can rely on it. However, the issues I have with this is that now I can't treat the plane as plain data, e.g. were I to load a lot of planes off of disk, I'd also have to run the initialization routine on them instead of just reinterpretting memory.
I am therefore thinking about trying a more lazy preconditioning strategy for future code. Have your own thoughts on this matter changed since writing the article?”
[00:52:21] “if you can look at this godbolt https://godbolt.org/z/K6EPn9Wx5 this simple example of a regular vector add.
I don’t understand why gcc and msvc so confused?
Even when I do what you taught us a couple videos back, both gcc and msvc still can’t reach clang’s output:
https://godbolt.org/z/vT1WcoeYd
But even when I look at clang’s output, I don’t understand it. Vector4 is 4 floats, exactly 128 bits. It should fit into xmm register.
But clang produced
xmm0 += xmm2
xmm1 += xmm3
Meaning two of the floats were in xmm0 and two in xmm1?
Why the 4 floats separated in two registers instead of fitting in one xmm?
why is there _mm_set_ps1 and _mm_set1_ps? They have the same description on Intel intrinsic guide, and they seemingly produce the same asm. What’s going on there?”
[00:58:08] “With the trend of newer processors having more cores with complicated interconnects,
and communication between processors with separated memory and potentially heterogenous architectures (eg. cpu and gpu) becoming increasingly important, do you think it will be an essential skill for programmers to program with the topology of the target system in mind?
Additionally, do you think we will see hardware where cache coherence will be completely software managed?”
[01:06:00] “Is there a worry of local minimums when thinking about optimizations? Ideally, we would architect code in a way that is most performant from the beginning. Any insights on when stop optimizing individual hot spots identified with a profiler and start thinking about if there is a larger architecture change that needs to occur for getting more performance?”
[01:09:10] “I'm trying to get the pmctrace code to work on my machine from the Spooktacular challenge, however, after running your code, I seem to get all zeros for the results for example. Any idea what could be going on? Could it be my version of windows 10 is not supported (it is from 2019: Version 10.0.18363 Build 18363)? thanks!”

Computer, Enhance!

Paid episode

Q&A #72 (2025-02-17)

The full video is for paid subscribers