Zen is one of the most important microarchitectures in the history of the x86 ecosystem. Not only is it the reigning champion in many x64 benchmarks, but it is also the architecture that enabled AMD’s dramatic rise in CPU marketshare over the past eight years: from 10% when the first Zen processor was launched, to 25% at the introduction of Zen 5.
I recently had the honor of interviewing none other than Zen’s chief architect, Mike Clark. I only had 30 minutes, but I tried to fit in as many of our microarchitecture questions as I could! Subscribers to Computer Enhance will recognize many of them as ones we’ve collectively wondered about during Q&A’s in the Performance-Aware Programming series - and I’m delighted to report that, as you’ll see, Mike gave detailed answers to all of them.
Below is the edited transcript of our conversation. I’ve tried to keep it as accurate as possible to the original audio, while reworking the phrasing to be appropriate for reading rather than listening. I have also had AMD approve the transcript to ensure accuracy, and I will be working with them to release an extended video version as well.
Now, without further ado, my interview with Mike Clark:
CASEY: You will often hear “people on the internet” say that ARM as an ISA is better for low power than x64. People like me who study ISAs tend to be skeptical of this claim. As a hardware designer, are there any specific things about the x64 ISA that you find difficult to deal with for low-power designs?
MIKE: Having spent my career working on x86, I might have a bias here! I do think each ISA has its own quirks that influence some of the microarchitecture. But at the base level, we can build low-power x86 designs as well as ARM can, and ARM can build high frequency, high performance designs as well as x86 can. None of the quirks are really limiting you on the microarchitecture. The reality is that the markets we've been targeting have been different, so they've driven the architectures to optimize for different design points. ARM is in much lower power markets where x86 hasn't had the market share to chase.
On the x86 side, the higher performance / higher frequency devices are the established market that our devices have to compete in, so that's where our design focus is. We could build the same Zen microarchitecture with an ARM ISA on top instead. We could deliver the same performance per watt. We don't view the ISA as a fundamental input to the design as far as power or performance.
CASEY: So the memory model, whether the instructions are variable length, those sorts of things don’t factor in? None of the differences are big enough to matter?
MIKE: No. It may take a little bit more microarchitectural work for us to account for stronger memory order on the x86 side, but of course the software has to account for the weaker memory ordering on the ARM side. So there are tradeoffs.
Variable length instructions are harder than fixed length, but we've created techniques like the uop cache, and it also gives us better density overall by having smaller instructions. x86 can put more work into each instruction byte, so we can have denser binaries and increase performance that way.
So these are all just tradeoffs in the microarchitecture. They’re not fundamental issues for delivering performance per watt at the end of the day.
CASEY: Similar question, but moving to the OS side of things: does the 4k page size on x64 create problems for you architecturally by limiting the L1 cache size due to how tagging works? Would architectures like Zen benefit if x64 operating systems moved to 2mb pages as the smallest page size, or perhaps a 16k or 64k page size if you were to introduce that in a future architecture?
MIKE: Definitely. We always encourage developers to use larger page sizes if they can, because it gives us a lot more capacity in our TLBs and therefore less TLB pressure overall. But we have the ability to combine 4k pages into larger pages in our TLB if the OS allocates them sequentially. We can turn four 4k pages into a 16k page if they are virtually and physically sequential. That's been a technique we've used even since the original Zen to help software get the benefits of larger page sizes without moving away from 4k pages.
However, 4k to 2mb is a big jump. We're always looking for ways to allow our software partners to have larger page sizes, but maybe something in between is more appropriate.
CASEY: Just to poke a little further at that, for the L1 cache specifically, you're hitting up against the limit of the address bits. Have you ever wanted to put in bigger L1 caches, but found that you couldn't because the 4k page size means you can't do that without going to a larger-way cache?
MIKE: No. In the past we have built L1 caches that don't follow the “ways times 4k page size is the largest index you can have” property. There are ways to do that. We've solved those problems. It is a little bit more logic, but it's a solvable problem. It doesn’t limit us in what we design.
CASEY: Moving on to the sizes of registers and cache lines, I have two questions about how CPUs seem to do things differently than GPUs.
First, CPUs seem to be settling into a natural size of 64 bytes. The L1 cache lines are 64 bytes. The registers are 64 bytes. It doesn't look like anyone's trying to go beyond that. But GPUs seem to prefer at least 128 bytes for both. Is this because of the difference in clock rates? Does it have to do with CPU versus GPU workloads? In general, do you see 64 bytes as a natural settling point for CPUs, and if so, why does it seem to be different from GPUs?
MIKE: We do look at increasing the line size. We're always going to a clean sheet of paper and making sure we're rethinking things and not missing anything as workloads evolve and things change. We don't want to be locked into a mindset where we think we've proven 64 bytes to be the correct size for everything on a CPU.
But the reality is that CPUs are targeted at low latency, smaller datatype, integer workloads as their fundamental value proposition. We've grown that capability with all our out-of-order engines, trying to expose ILP. So far, it’s allowed us to build vector units as wide as 64 bytes.
But it's been a journey to even get that wide because if you look at, say, from Zen 4 to Zen 5 - we supported 512-bit vectors on Zen 4 via a 256-bit data path. For Zen 5, we went full bore and supported the full 512-bit data path. That required a fundamental replumbing of the microarchitecture. We had to grow the width of the delivery of data from the L2 to the L1, and we had to double the delivery from the L1 to really take advantage of the wider vector units.
The integer workloads that are still primarily reading data out of the cache and branching, they're not getting any benefit from that sort of fundamental change. We have to do it in a very cautious and meticulous manner, so that those highways of delivery can exist while still ensuring that if there's only one car on the highway, we’re not burning power as if all the lanes were full. It’s tricky.
When you look at the GPU side, the workloads where they excel are throughput based. Not having to excel at the lowest-latency, small-datatype workloads frees them up to leverage all that extra investment. You need to have workloads that are really focused on using that much data in a wide vector to get the return on that investment.
So that's always the trick. If we try to go too big, too wide, we lose our value proposition in performance per watt for the mainstream workloads people buy our new generations for.
Does that make sense?
CASEY: It makes perfect sense, and it leads right into my next question.
Underlying what you said is the implication that, if we as software developers were taking better advantage of wider workloads, it would be worth your while to widen them. One of the problems people often have when trying to widen a data path in software is that CPUs seem to be a lot worse at scatter/gather. It’s an important feature for taking data that isn’t naturally wide and putting it through a wide datapath with some level of efficiency. For example, if I want to widen something that does an array lookup, historically it’s been hard to port that code directly because of poor gather performance.
Could you give us some insight on why this is?
MIKE: That's a good question. It does tie back to the previous question in the sense that it’s really not the fundamental scatter/gather concept that’s the problem. It's the amount of bandwidth needed to pull all those different elements inside the CPU to put them together to feed to the vector unit.
Again, we're focused on latency, not throughput. That has permeated itself out into the interface to what we call our “data fabric”. The memory system isn't wide enough to be able to pull all the data in so it can be assembled into lanes and operated on. If we wanted to attack that, we’d have to widen the interface, and that would come with a large power cost.
So again, that's the trick. You're trying to avoid the power cost when you're running workloads that don't require scatter/gather. If you widen these paths, you’ve overbuilt the design for the the baseline workloads that you normally run. We are always trying to grow and pull more applications in, but we have to balance that against the power requirements of widening the bandwidth into the CPU.
CASEY: So in other words, it's a chicken and egg problem? If software developers were giving you software that ran fantastically with scatter/gather, you’d do it. But they’re not, so it’s hard to argue for it?
MIKE: Right, yes.
CASEY: The rest of my questions don’t group together into any particular theme, so I’ll just go through them randomly.
Random question number one: previously, on the software side, we thought nontemporal stores were solely there to prevent pollution of caches. That was our mental model. But lately we have noticed that nontemporal stores seem to perform better than regular stores, separate from cache pollution - as if the memory subsystem doesn't have to do as much work, or something similar. Is there more about nontemporal stores that we need to understand, or are we mistaken?
MIKE: If you were just doing nontemporal stores to data that is in the caches, obviously that would not be a good thing. So you still have to apply good judgment on when to use nontemporal stores. But tying it back to the ARM-ISA-weakly-ordered discussion, nontemporal stores, while not exactly being weakly ordered, are in some ways easier to deal with in the base case. We can process them efficiently as long as they really are nontemporal. So I think your intuition is right - we can do well with them as long as the software side ensures that the data isn’t finding itself in caches along the way.
CASEY: Random question number two: for educational purposes, does anyone publish modern CPU pipeline diagrams that would be reasonably accurate? AMD and Intel, for example, both publish flow charts for new microarchitectures, but not pipeline diagrams.
MIKE: It might surprise people, but if you go back to when we did publish pipeline diagrams, those are still fine for learning how a modern CPU works. We do have more complicated pipelines today, and we don't publish them because they reveal proprietary techniques we're using, or give hints that we don't want to give to the competition. But at the end of the day, it's still a fetch block, a decode block, an execute block, a retire block... there's more stages within those blocks, and you can break it down even more than that, but the fundamental pipelining is still similar.
CASEY: So, for example, I think the Bulldozer pipeline diagram was the last one I saw from AMD. It’s not woefully out of date? If someone learned that pipeline, they would be able to understand what you actually do now if they were given an updated diagram?
MIKE: Roughly speaking, yes.
CASEY: Random question number three: if you look at a single-uop instruction like sqrtpd
that has a latency longer than the pipeline depth of an execution unit, can you give a cursory explanation of how this works for those of us on the software side who don't understand hardware very well?
MIKE: One way to conceptualize it is that you could have taken sqrtpd
and split it up into a bunch of different uops that can operate in parallel with dependencies along the way. It can be very expensive to keep all those operations in flight, to build the pipeline to pass the data forward so you can let something new in behind it that's working on an earlier stage. The hardware cost would be too high to create a pipeline to get the execution done in a way that allows another sqrtpd
to start on an earlier stage - especially if it's going to be, say, 16 stages of execution until you have achieved your answer.
It's really just that cost. Is the amount of hardware worth it to make something like sqrtpd
a pipelineable instruction, or can we save a lot of power and hardware by just doing one of them at a time?
CASEY: Just to make sure I understand: does that mean inside an execution unit that can do one of these, the uop gets issued and it knows it’s got something special that it has to work on for a while, so it asks not to be given anything else for several cycles while some special control part inside it takes over?
MIKE: Correct. The scheduler that feeds it understands that it's not a pipeline execution unit that can take another uop every cycle. But it has a known quantity where, if it has sent one in, after some number of cycles, it knows it can send another one in and it should be safe.
CASEY: So the system upstream of the execution unit - the thing that's feeding it - knows not to send more?
MIKE: It knows, yes.
CASEY: Last question: are there things you wish we as software developers would start - or stop - doing that would help take advantage of the hardware you design, or that would make it easier for you to design new hardware in the future?
MIKE: We already hit on one, which is the feedback loop when we add new ISA components - larger vectors, for example. We need software to use them to get the return on investment that we're putting in.
Of course we also understand that, as a new feature comes out, it's only on the new hardware. You want your software to run well on our old hardware as well as on our new hardware. We totally understand that problem. But still, if software developers could embrace the new features more aggressively, that would definitely help.
It would be great if the software could find ways to leverage wider vectors, AI, and so on - all the areas we've invested a lot of hardware in. And of course we would also like to get feedback from you guys - “if we just had this instruction or this concept, we could really leverage that in our software” and so on. We're constantly open to that, too. We want to know how to make your lives easier.
And finally, one other thing I would add is that larger basic blocks are better. Taking branches versus not taking branches can have a big effect on code flow. Try to put conditional operations in the right places. I’m sure you guys probably focus on this already.
CASEY: Yes, but it’s always good to hear it from you. We only ever know that something runs faster when we time it - we can't always guess what the designers are thinking on the hardware side.
MIKE: Gotcha.
CASEY: Well, I think we are out of time. Thank you very much! This has been fantastic. Thank you for answering all of my questions, and please keep in touch. We always have questions like this on the software side, so anytime you want to talk, or if there is anything new you want to tell us about, please let us know.
MIKE: Okay, cool. It was a great conversation. And yeah, any time you're wondering what's going on in the hardware, we want to close that gap as best we can!
CASEY: We all appreciate it. And we love Zen as well! I’m conducting this interview from a Zen processor as we speak. So thank you for all your hard work over the years.
MIKE: Alright, thanks! Talk to you later.
Awesome interview, super interesting to hear from someone who knows the hardware at such a deep level. I'm also really glad that I was able to understand most of it, in no small part because of the course!
That being said, I believe there was a typo when discussing ARM being "weekly" ordered, it should be "weakly" I think.
Fantastic interview, congratulations! I was devouring the transcript like the best novel, I learned a lot from it.