Table of Contents
Every entry in every series, listed for quick navigation.
Performance-Aware Programming Series
This series is designed for programmers who know how to write programs, but don’t know how hardware runs those programs. It’s designed to bring you up to speed on how modern CPUs work, how to estimate the expected speed of performance-critical code, and the basic optimization techniques every programmer should know.
The course is broken into parts, with the first part (the “prologue”) being strictly a demonstration with no associated homework. Later parts feature weekly homework.
Q&A videos are posted regularly. If you have a question you’d like answered, please put it in the comments of the most recent Q&A video. Homework listings are available from github.
Prologue: The Five Multipliers (3 1/2 hours, no homework)
This part of the course gives simple demonstrations of how seemingly minor code changes can produce dramatically different software performance, even for very simple operations.
Welcome to the Performance-Aware Programming Series! (22:05)
Waste (32:56)
Instructions Per Clock (25:05)
Caching (22:55)
Multithreading (32:11)
Python Revisited (36:22)
Interlude (1 hour, no homework)
The Haversine Distance Problem (30:28)
Part 1: Reading ASM (7 hours, plus homework)
This part of the course is designed to ensure that everyone taking the course has a solid understanding of how a CPU works at the assembly-language level.
Instruction Decoding on the 8086 (28:28)
8086 Decoder Code Review (1:17:49)
Simulating Non-memory MOVs (18:00)
Simulating ADD, SUB, and CMP (25:56)
Simulating Conditional Jumps (19:41)
Simulating Memory (26:32)
Simulating Real Programs (16:02)
Other Common Instructions (19:43)
The Stack (26:58)
Estimating Cycles (23:56)
From 8086 to x64 (26:21)
8086 Simulation Code Review (33:05)
Part 2: Basic Profiling (4 hours, plus homework)
In this part of the course, we learn about how to measure time, and instrument programs to automatically determine where time is being spent.
Generating Haversine Input JSON (15:40)
Introduction to RDTSC (48:05)
Instrumentation-Based Profiling (18:01)
Profiling Nested Blocks (26:12)
Profiling Recursive Blocks (30:44)
Comparing the Overhead of RDTSC and QueryPerformanceCounter (13:00)
Part 3: Moving Data (13 hours, plus homework, and 3 hours of bonus content)
Using our knowledge from parts 1 and 2, in Part 3 we look at how data moves into the CPU, and how to estimate the upper performance limits of our software imposed by the need to move data.
Measuring Data Throughput (21:54)
Repetition Testing (27:57)
Page Faults (38:52)
Probing OS Page Fault Behavior* (33:05)
Four-Level Paging* (31:23)
Analyzing Page Fault Anomalies* (31:44)
Powerful Page Mapping Techniques* (39:20)
Memory-Mapped Files* (20:46)
Inspecting Loop Assembly (32:31)
Intuiting Latency and Throughput (22:57)
Analyzing Dependency Chains (29:06)
CPU Front End Basics (31:09)
Branch Prediction (42:03)
Code Alignment (32:03)
The RAT and the Register File (45:21)
Cache Size and Bandwidth Testing (34:00)
Latency and Throughput, Again (37:55)
Unaligned Load Penalties (27:25)
Cache Sets and Indexing (30:34)
Non-Temporal Stores (28:11)
Prefetching (21:31)
Prefetching Wrap-up (19:40)
2x Faster File Reads (31:01)
Testing Memory-Mapped Files* (18:03)
* Entries with an asterisk were “bonus” entries that can be skipped.
Part 4: Computation
Part 4 extends the techniques learned in Part 3 from memory operations to computation more broadly. We look at what kinds of operations CPUs perform natively, how more complicated operations are broken down into native ones, and how to reason about the peak performance of computational processes.
Reference Haversine Code (19:49)
Determining Input Ranges (14:37)
Introduction to SSE Intrinsics (48:51)
Function Approximation (56:28)
Part 4 is still in progress. This listing will be updated as new videos are posted.
I became a paid subscriber solely for this course. I am SUPER stoked!!
Can’t wait for this!