AMD Demos 3D Stacked Ryzen 9 5900X: 192MB of L3 Cache at 2TB/s

The Computex trade show has kicked off in Taiwan and AMD opened the show with a bang. Last week, we discussed rumors that AMD was preparing a Milan-X SKU for launch later this year. The Zen 3-based CPU would supposedly offer onboard HBM and a 3D-stacked architecture.

We don’t know if AMD will bring Milan-X to market in 2021, but the company has now shown off 3D die stacking in another way. During her Computex keynote, Lisa Su showed a 5900X with 64MB of SRAM, integrated on top of the chiplet die. This is in addition to the L3 cache already integrated into the chiplet itself, granting a total of 96MB of L3 per chiplet, or 192MB per 5900X with two chiplets. The dies are connected with through-silicon vias (TSVs). AMD claims bandwidth of over 2TB/s. That’s higher than Zen 3’s L1 bandwidth, though access latencies are much higher. L3 latency is typically between 45-50 clock cycles, compared with a 4-cycle latency for L1.

The new “V-Cache” die isn’t exactly the same size as the chiplet below it, so there’s some additional silicon used to ensure there’s equal pressure across the compute die and cache die. The 64MB cache is said to be a bit less than half the size of a typical Zen 3 chiplet (80.7 mm sq).

This much L3 on a CPU is rather nutty. We can’t compare against desktop chips, because Intel and AMD have never shipped a CPU with this much cache dedicated to such a small number of cores. The closest analog on shipping CPUs would be something like IBM’s POWER9, which offers up to 120MB of L3 per chip — but again, not nearly this much per core. 192MB of L3 for just 12 cores is 16MB of L3 per core and 8MB per thread. There are also enough differences between POWER9 and Zen 3 that we can’t really look to the IBM CPU for much on how the additional cache would boost performance, though if you’re curious about the x86-versus-non-x86 question in general, Phoronix did a review with some benchmarks back in 2019.

Absent an applicable CPU to refer to, we’ll have to take AMD’s word on some of these numbers. The company compared a standard 5900X (32MB of L3 cache per chiplet, 64MB total) to a modified 5900X (96MB of L3 cache per chiplet, 192MB total) in Gears of War 5 (+12 percent, DX12), DOTA2 (+18 percent, Vulkan), Monster Hunter World (+25 percent, DX11), League of Legends (+4 percent, DX11), and Fortnite (+17 percent, DX12). If we set LoL aside as an outlier, that’s an 18 percent average increase. If we include it, it’s a 15.2 percent average uplift. Both CPUs were locked at 4GHz for this comparison. The GPU was not disclosed.

That uplift is almost as large as the median generational improvement AMD has been turning in the past few years. The more interesting question, however, is what kind of impact this approach has on power consumption.

AMD Has Huge Caches on the Brain

It’s obvious that AMD has been doing some work around the idea of slapping huge caches on chips. The large “Infinity Cache” on RDNA2 GPUs is a central component of the design. We’ve heard about a Milan-X that could theoretically deploy this kind of approach and on-package HBM.

One way to look at news of a 15 percent performance improvement is that it would allow AMD to pull CPU clocks from a top clock of, say, 4.5GHz down to around 4GHz at equivalent performance. CPU power consumption increases more quickly than frequency does, especially as clocks approach 5GHz. Improvements that allow AMD (or Intel) to hit the same performance at a lower frequency can be useful for improving x86’s power consumption at a given clock speed.

About six weeks ago, we covered the roadmap leak/rumor above. At the time, I speculated that AMD’s rumored Ryzen 7000 family might integrate an RDNA2 compute unit into each chiplet, and that this chiplet-level integration might be the reason why RDNA2 is listed in green for Raphael but orange for the hypothetical Phoenix.

What I am about to say is speculation stacked on top of speculation and should be treated as such:

For years, we’ve waited and hoped that AMD would bring an HBM-equipped APU to desktop or mobile. Thus far, we’ve been disappointed. A chiplet with a 3D-mounted L3 stack tied to both the CPU and GPU could offer a nifty alternative to this concept. While we still have no idea how large the GPU core would be, boosting the performance of an integrated GPU with onboard cache is a tried-and-true way of doing things. It’s helped Intel boost performance on various SKUs since Haswell.

The bit above, as I said, is pure speculation, but AMD has now acknowledged working extensively with large L3 caches on both CPUs (via 3D stacking) and GPUs (via Infinity Cache). It’s not crazy to think the company’s future APUs will continue the trend in one form or another.

Now Read:

  • AMD Prioritizes High-End CPUs During Shortages, Just Like Intel
  • AMD’s Market Share Surges on Steam and in Servers, Shrinks Overall
  • RISC vs. CISC Is the Wrong Lens for Comparing Modern x86, ARM CPUs