Q&A #67 (2024-11-18)

Playback speed

Share post at current time

Share from 0:00

0:00

Paid episode

The full episode is only available to paid subscribers of Computer, Enhance!

Q&A #67 (2024-11-18)

Answers to questions from the last Q&A thread.

Nov 18, 2024

∙ Paid

In each Q&A video, I answer questions from the comments on the previous Q&A video, which can be from any part of the course.

Questions addressed in this video:

[00:02] “You've talked a lot about the sorry state of modern software caused by a lot of unnecessary bloat at lower levels. If one was to pursue the goal of moving it in a better direction (either directly or by acquiring the necessary knowledge and skills to use later), which software areas would be best to go into? OS dev? Embedded? Gamedev? Server infrastructure? Developer tools (e.g. compilers, IDEs, frameworks, etc.)? Anything else?”
[02:52] “Since Multithreading is such a vast topic and won't be covered here, do you have any good books or other resources you can recommend? How did you learn to do effective Multithreading?”
[05:28] “Out of curiosity, because you mentioned that you have a Linux machine, do you have any resources for learning low-level Linux? I can't seem to find anything good on Linux Multithreading that doesn't involve pthreads, or is that just the best way to go?”
[07:39] “The results for the performance of reading a file with MapViewOfFile on my computer look different than the ones on yours. The speed is better than large pages and comparable to a reused buffer. All the page faults are still there (indeed further tests show that when using MapViewOfFile the page fault count goes up *every page*, not even every 16 pages as with VirtualAlloc), but somehow we seem to not be paying the associated price in terms of performance. Any idea to what might be causing this discrepancy ?”
[11:06] “Say you have a function that is embarrassingly parallel (like matrix multiplication), and an ideal candidate for multithreading. However, other higher-level components/systems in the codebase are already multithreaded (i.e., one thread for physics, another for rendering, etc). How do you determine whether a multithreaded implementation of the function would be prudent? Would you be concerned about contention for threads? Should you be careful not to do this too many times for low level functions?”
[13:45] “When I'm running the listing_0168 with the fread, the sum and the overlapped sum, I'm not seeing that big drop when the buffer size gets too large for the cache. First I ran my own implementation, but even when running your code, the result is the same. The regular OpenAllocateAndSum starts out with ~1.42 GB/s at 256 KB buffer size and slowly drops to ~1.28 GB/s at 256 MB buffer size. The overlapped one has a similar curve, but starts higher and drops a bit faster. I'm also on Windows with the Intel i7-10850H. Do you have a clue what's going on here?”
[15:36] “I'm probably missing something on the memory mapped file tests, but, couldn't we have two buffers on the struct memory_mapped_file, and map to one buffer while we sum the previous one?”
[17:59] “I'm currently at the very beginning of part 3 on the repetition tester part, I was able to reproduce the page fault issue, however, for the worst case bandwidth I'm getting very low results at seemingly random times (without doing any mallocs so no page faults are involved here), to give an idea, here are the results produced by listing 104 running on my pc after just a few minutes (and the Max keeps going down over time, which seems totally unexpected because I would expect after everything being cached for speed to go up), in perfmon I don't see any anomalies, this is run on a file with 1 million points (on a file with 1 thousand points the difference between min and max is even much worse), any ideas what I can check? thanks.”
[20:10] “I wrote the simplest possible memory clear function and the compiler converted it into a memset. Going through the assembly code for memset I discovered the "movntdq" instruction. You stated in the non-temporal stores episode that it is quite rare that you need this instruction, but memset is a pretty common thing to use. Maybe I misunderstood something, but can you explain how is this instruction rare if something as common as memset generates it?”
[24:16] “Next I went ahead and wrote my own SIMD version of clearing memory. I wrote two versions, one with a regular store and the second with non-temporal store. The non-temporal was thrice as fast. Maybe it was thrice as fast because of non-temporal store but I wanted to be certain by taking you opinion on these routines. Maybe I did the temporal version in a bad way so that is slower than it needs to be making the non-temporal version seem faster. Would you be able to do an analysis of this code?”
[26:45] “I multithreaded the non-temporal store, but couldn't gain much performance improvement. The throughput in the non-multithreaded version was 24GB/s whereas the multithreaded one peaked at 30GB/s. I have spent two days trying to find a reason and my hunch is that the memory bandwidth is the bottleneck here. What sort of tests should be done to investigate this further?”
[28:36] “Regarding the wonderful Halloween Spooktacular Challenge - As far as I understood, it would be possible to create a far simpler implementation with fewer limitations if we could use a custom kernel driver to retrieve the PMCs, correct? If we assume that we are fine with installing a kernel driver for collecting PMCs, are there some such kernel drivers available that you would recommend? If not, do you perchance have any tips on how one should approach writing a driver? (I know you're not a driver developer, but maybe you know a thing or two about it anyways)”
[30:02] “What is the reason that Linux is struggling to make so many applications work on it? Why is basically everything made for Windows, even MacOS has better applications support than Linux. I'm talking about stuff like video games, apps like Photoshop for instance.
While it is possible to run those applications on Linux (although sometimes not), why is it such a struggle and there is always something that doesn't work? Is it because MacOS/Windows are kind of stable operating systems where programmers know what they have available and Linux comes in so many variations that it's hard to know what's included? Or maybe is it because Windows is just more popular among regular users and the market for Linux is not worth it? Also endemic to this is problem with support for any kind of specific hardware. You bought a WiFi adapter that plugs into USB and brings internet to your laptop and requires custom driver from the vendor? Forget it will be supported on Linux.”

Computer, Enhance!

Paid episode

Q&A #67 (2024-11-18)

The full video is for paid subscribers