In each Q&A video, I answer questions from the comments on the previous Q&A video, which can be from any part of the course.
Questions addressed in this video:
[00:02] “How effective is manual cache management in modern cpus? I once tried writing a loop that would go over a chunk of memory, then reset back to the beginning, and I tried using a prefetchw instruction to "tell" the cache that I was restarting and where it has to look next. But that ended up being slightly slower than the version without the prefetch instruction. I would think this is a good strategy, but maybe not. or maybe the memory I was looping over was too small”
[08:39] “When people talk about performance. Rightfully they say you have to know how the code will be used in order to decide the correct structure and there is no way around that. I totally get it. But when you design libraries for others, you have to do this based on incomplete information. How do you try to design those so you are trying your best at getting a good performance given the user doesn't do stupid things with your API?”
[13:48] “Regarding the Powerful Page Mapping Techniques talk, have you ever combined GetWriteWatch with Sparse Memory? Would it be fast enough? Why?”
[14:41] “I'm confused and maybe this was already discussed and answered before, but how exactly porting an application from 32 bit to 64 bit actually improves it's performance? Does compiler automagically compiles code where it might read data from memory to wider registers (pack it somehow)?”
[20:30] “Can you give a high-level explanation of Transparent Huge Pages, and how an OS would detect and automatically switch a program to use huge pages at run time?”
[25:44] “Regarding the M2 security issue talk: Do the library writers of alikes of SSL have to re-write the routine you shown to basically do even more redundant work to always write to both arrays anyways somewhow?”
[29:39] “Is the measurement of the L3 size going to be hazier because the L3 is shared amongst all cores/processes in the system? So it's not just us using the L3 during our testing?
Related, wouldn't it be true that a process that is churning through some large amount of memory pollutes the L3 cache for all other processes running on the machine? I guess that is related to the M1 GoFetch attack but with the L2. But it's wild to think a process could hurt the performance of others so easily.”
[33:57] “I'm catching up, just watched the video where we reached the achievable limit of memory reads using SIMD. From this, I guess that a good version of memcpy would be implemented using SIMD. Is that usually the case? Are there other subtleties or gotchas related to memcpy?
I remember reading that sometimes, good results were achieved using a movsb instruction. I didn't understand what this is at the time, and you have not talked about this so far. What is it, and is it useful in practice?”
[40:34] “Let's assume that my work has a repository that is deeply object oriented with multiple inheritance. Let's also assume that a complete rewrite is off the table, but I'm free to start "de-objectifying" it one thing at a time. What's the best way to approach this incrementally? Should I start with the slowest parts, and flatten the "leaf/child" objects into the parents first, and so forth? And is it even worth the effort to deobjectify the code if it remains in something like Python? Or is all that a fool's errand?”
[42:50] “Can you help describe the difference between throughput and bandwidth? Is bandwidth just throughput at the limit?”