Each Monday I answer questions from the comments on the prior week’s videos. Transcripts are not available for Q&A videos due to length. I do produce closed captions for them, but Substack still has not enabled closed captions on videos :(
Questions addressed in this video:
[0:00:25] “I'm curious where you'd put hyperthreads in your taxonomy of factors and how you think about HT vs core count in performance analysis? My understanding is limited, but since they blend both ~~ILP~~ instructions-per-clock and multithreading concerns I would be happy to hear your thoughts on this.” / “I was curious if one of the multipliers would end up being branch prediction. Did you exclude it because it has much less of an impact than the others, or because it’s harder to take advantage of, or some other reason?”
[0:22:00] “What is a "minimal" working mental model for working with multithreaded code (for e.g x86-64, windows, msvc ) ?”
[0:27:15] “When the data does not fit in any of the caches, I would assume that the gained multiplier for multithreading would be greater if the operation would be more complicated than a simple add, since the threads would be less starved for memory bandwidth. Is that correct?
If so, what would be the optimal cycles spent per operation in relation to the available memory bandwidth?”
[0:32:11] “Logical processors vs cores”
[0:33:32] “What happens if we start more threads than there are L1 caches?”
[0:38:49] “Green threads / fibers”
[0:41:58] “Are we also going to look at memory optimization?”
[0:42:36] “Can multithreading help with reading data from disc? Or will it be stopped by the similar barrier as memory and would just be faster on a single thread?”
[0:50:02] “Can an application query the size of caches, core count, memory bandwidth of the machine on-the-fly to be able to optimize the algorithm parameters?”
[0:53:14] “Why does the first example get to 35 adds/clock while the second to 52adds/clock, even though each of the 4 cores/threads only uses L1 cache? Does it have something to do with multithreading/SIMD overhead being a smaller percentage of the whole execution?”
[0:56:09] “I've got a similar question for the other side, why does the speedup break down so large when going to main memory? Is it just the amount of data that needs to be transferred that is the bottleneck (is that the memory bandwidth?) My first instinct would be to split the large chunk up in the size of the fastest smaller chunks, but that does not seem to help apparently.” / “When we have a workload that fits into L1, that data still has to be loaded from main memory when we first access it, and thus the first time using that data should be limited by memory bandwidth, not L1 bandwidth right? So, if all our program does is look at each number once and add them, why isn't every single-threaded version limited to the 1.x adds/clock that is imposed by the cores bandwidth? How did the data get into L1 without hitting that limit?”
[1:05:08] “Question related to caching: When does the hardware pre-fetcher put something in the L1 vs L2 vs L3 vs not, and if this depends on the data size, how is the size specified in the code? A follow up question would be: Where would it put data at a random pointer address, that could potentially point to a small or large array?”
[1:09:25] “What exactly happens between L1-cache and ALU. I mean cache lines are 64 bytes wide, but registers are narrower than that. Yet everyone tells to make as much use out of those 64 bytes as possible. So is there some kind of "most-recently-queried" cache line? Or maybe there are more then one of them? How does it work?”
[1:21:35] “There are many numbers beyond bandwidth, that memory manufacturers slap on their products. Are we going to look at them core closely? I mean there are things like CAS-latency and other kinds of latencies, then there is memory clock rate, whether it is DDR3, DDR4 or DDR5 etc?”
[1:23:32] “I'm wondering if this also means that these cores can execute at different clock rates from each other or if they all execute at the same clock rate at all times?”
[1:26:24] “It seems like all this work to speed up the execution of my instructions takes no time. How is that possible?”
[1:33:44] “I don't know if you published that testing harness yet, but FYI numpy has their own array type as well. I actually had no idea the array existed so I don't know if it's faster or slower. But I suspect if you stay within numpy arrays, the numpy.sum gets faster” / “Great lecture! For python perf, I would recommend trying out numba as well.”
[1:42:02] “What I’m wondering is to what extent the technique you showed (especially using Cython) could be used for non numerical computations (such as sparse graph search for example), and what gain it could bring.”
[1:45:35] “If I'm running with 350+ other processes that are all squeezing CPU/memory - threading becomes slower, to a point where the context switching and overhead may be a net loss, no? So are there techniques to apply optimizations such as threading in some dynamic fashion that obeserves the PC's current workload?”
[1:47:41] “In your opinion (or from your experience), large Python, JS, PHP,... applications that are slow, where do you think the majority of performance loss is happening? Is it in a specific isolated pieces of code (such as a computationally heavy loop) which can be replaced by a fast and carefully optimized version? or is it spread through out the entire code and cannot be fixed as shown in the video because you would have to rewrite the majority of the application?”
[1:53:34] “Are there any plans to do "Data-Oriented Design and C++"-style analysis with generated assembly? I think everyone would want to see your take on these topics (probably in much more detail) on practically reasoning about cache usage and information density, and extending the analysis to more involved examples.”
[1:55:52] “Will there be any coverage of AZDO or other api-independent means of performance aware graphics programming? Coverage of shader programming would also be an interesting topic.”
[1:58:36] “If we are deploying a web app in a container (e.g. docker) to a platform like Heroku or to a virtual server like Linode are there ways to understand the underlying architecture enough to use these tools as you show here, or do we just extract more generic concepts around efficiency?”
[2:02:21] “Will you be going over other languages at all e.g. Java? Would be very useful to see at least a small example in one of these languages.”