Connect with us

Tech

With Granite Rapids, Intel is back to trading blows with AMD

Published

on

With Granite Rapids, Intel is back to trading blows with AMD

Over the past few years, we’ve grown accustomed to Xeon processors that, generation after generation, come up short of the competition on some combination of core count, clock speeds, memory bandwidth, or PCIe connectivity.

With the launch of its Granite Rapids Xeons on Tuesday, Intel is finally closing the gap, and it may just be a turning point for a product line that has gained a reputation for too little, too late.

The 6900P processor family represents the chipmaker’s top tier of datacenter chips with up to 128 full-fat performance cores (P-cores), 256 threads, and clock speeds peaking at 3.9 GHz.

That not only puts Granite Rapids at core-count parity with AMD’s now year-old Bergamo platform, it makes it a direct competitor to its rival’s upcoming Turin Epycs with its 128 Zen 5 cores.

To be clear, Turin will actually top out at 192 cores, as CEO Lisa Su was keen to point out during her Computex keynote this spring. However, that part will use a more compact Zen 5C core which trades clocks and presumably per-core cache for compute density.

It’s also worth noting that Intel already delivered its Bergamo competitor this spring with Sierra Forest. That chip features 144 miniaturized efficiency cores, and a 288-core variant is due out sometime in early 2025. How comparable those cores actually are to AMD’s is up for debate, as they lack both simultaneous multithreading and support for AVX-512. Granite Rapids, on the other hand, doesn’t suffer those same limitations.

Nearly every element of Intel's 6900P processors get a sizable spec bump over last gen.

Nearly every element of Intel’s 6900P processors get a sizable spec bump over last gen – click to enlarge

But it’s not just cores. Granite Rapids has also surpassed AMD on memory bandwidth, and while it still comes up short on I/O, the gap is shrinking.

Suffice to say, across its entire Xeon 6 portfolio, Intel is once again trading blows with AMD, something that makes comparing socket-to-socket performance between the two a far less lopsided affair than it’s been in years.

More cores, more memory, more compute

At the heart of Granite Rapids is Intel’s Redwood Cove core, which you may recall from its Meteor Lake client CPUs launched last December.

On their own, these cores don’t deliver much of an instructions-per-cycle (IPC) uplift compared to the Raptor Cove cores found in last year’s Emerald Rapids Xeons, amounting to less than 10 percent, Intel Fellow Ronak Singhal told The Register. However, with twice as many of them and 500 watts of socket power at its disposal, Granite Rapids still manages to deliver more than twice the performance of its prior-gen Xeons, at least according to Intel’s benchmarks.

Compared to Emerald Rapids, which launched less than a year ago, Granite Rapids promises roughly 2.3x higher performance on average

Compared to Emerald Rapids, which launched less than a year ago, Granite Rapids promises roughly 2.3x higher performance on average – click to enlarge

As always, we recommend taking any vendor-supplied benchmarks with a grain of salt. In this case, Intel is pitting both 96 and 128-core Granite Rapids SKUs with both standard DDR5 and high-performance MRDIMMs against a 64-core Emerald Rapids part. So, we’re mostly looking at gen-on-gen gains at the socket level rather than per-core performance here.

Granite Rapids sees the largest gains in HPC and AI applications, where the platform delivers between 2.31x and 3.08x higher performance than its predecessor.

This isn’t surprising considering these workloads generally benefit from larger caches and faster memory. With a substantially larger L3 cache, up to 504 MB, the move to 12 memory channels, and support for both 6,400 MT/s DDR5 and 8,800 MT/s MRDIMMS, Granite Rapids now boasts between 614 GBps and 844 GBps of memory bandwidth. For reference, AMD’s Epyc 4 platform topped out at roughly 460 GBps using 4,800 MT/s DDR5.

As we’ve previously discussed, the higher memory bandwidth afforded by MRDIMMs in particular opens the door to running small to mid-sized large language models (LLMs) on CPUs at a much higher performance than was possible on prior gens.

The one trade-off that comes with these higher memory speeds, other than price, of course, is Intel only supports them for one DIMM per channel configuration.

Despite pulling 50-150 watts more power at the socket, Intel claims a 1.9x improvement in performance-per-watt, at least at 40 percent utilization anyway

Despite pulling 50-150 watts more power at the socket, Intel claims a 1.9x improvement in performance-per-watt, at least at 40 percent utilization anyway – click to enlarge

Achieving this performance comes at the expense of higher power consumption. Compared to Emerald Rapids, Intel’s 6900P-series Xeon 6 processors are sucking up an extra 50-150 watts. Despite this, Intel insists its top-specced component delivers 1.9x higher performance per watt than Emerald at 40 percent utilization.

As strange as that might sound, Ryan Taborah, who heads up Intel’s Xeon division, argues that comparing power efficiency at 100 percent utilization simply isn’t realistic outside of very specific scenarios.

“Perf-per-watt really matters where our customers actually target for real-world deployments,” he said. “As we talk to customers, most of them care about what is the perf-per-watt at 20 percent, 50 percent, and 80 percent… and, I’d argue, if you look at some of the competitive solutions out there, this is where Xeon shines.”

Compared to AMD’s gen 4 Epyc Genoa platform, Granite Rapids’ performance advantage depends heavily on the workload in question.

In a core-for-core battle with gen 4 Epyc, Granite Rapids ranges from performance parity to outright lead

In a core-for-core battle with gen 4 Epyc, Granite Rapids ranges from performance parity to outright lead – click to enlarge

In a head-to-head between Intel and AMD-powered VMs with 16 vCPUs a piece (eight cores/16 threads), Granite Rapids only manages to match Genoa in GCC integer throughput. Whereas in floating point, LAMMPS, and NGINX comparisons, Granite Rapids managed to pull ahead by anywhere from 34 to 82 percent.

Meanwhile, in AI inference-centric workloads like BERT-Large and ResNet-50, Intel’s 6900P Xeons pull well ahead, no doubt thanks to its AMX accelerator blocks and memory bandwidth advantage.

If you’re wondering why Intel opted to compare VM performance this way, it likely comes down to how cores are distributed across AMD’s Epyc platform. Each of the core-complex dies on AMD’s Epyc processors feature eight cores and 32 MB of L3 cache. By sizing the VM so it fits entirely within a single die, Intel is arguably presenting a best-case scenario for its competitor as it avoids the kind of cross-die latency you can run into when running larger VMs on Epyc.

Speaking of chiplets, let’s take a closer look at how Granite is stitched together.

Intel finds its chip groove

By now, Intel is no stranger to chiplet architectures, having shipped its first multi-die Xeons with Sapphire Rapids back in early 2023. But until recently, there’s been little consistency in how those chiplets have been arranged.

Sapphire Rapids used either one or four dies, while Emerald featured up to two. With the launch of Sierra Forest in spring, we saw Intel transition to a heterogeneous architecture with distinct I/O and compute dies more akin to what AMD has done since the launch of Epyc Rome in 2019.

Intel has carried this formula forward with its Granite Rapids P-core Xeons. But while similar in spirit to AMD’s chiplet architecture, it’s by no means a clone.

At least with the 6900P-series, the chips feature a pair of I/O dies (IOD) based on Intel 7 process tech located at the top and bottom edges of the package. These dies are responsible for PCIe, CXL, and UPI connectivity and also house several accelerators – DSA, IAA, QAT, and DLB to name a few – previously found on the compute die in Sapphire and Emerald.

In terms of connectivity, these chips offer up to 96 lanes of PCIe 5.0 per socket as well as support for CXL 2.0. The latter presents the opportunity to inexpensively expand the memory footprint of servers well beyond what’s supported by the CPU. For more on CXL, check out The Register‘s full breakdown here.

Just like AMD's Epyc, Intel's Xeon now utilizes a heterogeneous chiplet architecture with compute and I/O dies

Just like AMD’s Epyc, Intel’s Xeon now utilizes a heterogeneous chiplet architecture with compute and I/O dies – click to enlarge

Sandwiched between the IODs are a trio of compute dies built on the Intel 3 process node. Each of these dies feature at least 43 cores – Intel wouldn’t say how many are actually on the die – and, depending on the SKU, one or more of them are fused off to achieve the desired core count. For example, on the 128-core parts, two of the dies have 43 active cores, while the third has 42. Whereas, for the 72-core part, all three compute dies have 24 cores enabled.

Beyond fewer denser compute dies, the other thing that sets Intel’s chiplet strategy apart from AMD is that the memory controllers are integrated directly into the compute dies rather than a singular IOD like we see on Epyc. Each of Granite’s compute dies feature four DDR5/MRDIMM memory channels.

In theory, this approach should mean less latency between the memory and compute, but it also means that memory bandwidth scales in proportion to the number of compute dies on board. This isn’t something you’ll actually need to worry about on the 6900P-series parts as they all feature the same number of dies.

This won’t be true of every Granite Rapids part on Intel’s Xeon 6 roadmap. Its 6700P-series parts, due out early next year, will feature up to two compute dies on board sporting up to 86 cores and a maximum of eight memory channels.

One thing that may come as a surprise to those who haven’t deployed high-core count Sapphire or Emerald Rapids parts before is that, out of the box, each compute die is configured in SNC3 mode and appears as its own non-uniform memory access (NUMA) domain. In other words, while you see one socket, the operating system effectively sees three.

Out of the box, each of Granite Rapids' compute dies show up as its own NUMA domain

Out of the box, each of Granite Rapids’ compute dies show up as its own NUMA domain – click to enlarge

Just like in a traditional multi-socket system, this is done intentionally to avoid applications accidentally getting split between NUMA domains and suffering interconnect penalties as a result.

However, if you’d prefer the chip to behave like one big NUMA domain, Granite also supports what Intel is calling HEX mode, which does just that. As we mentioned before, using this mode will incur both cache and memory latency penalties.

More to come

As we alluded earlier, Intel’s 6900P-series chips are only just the latest in a broader portfolio of Xeon 6 processors set to trickle out over the next few quarters.

For the moment, Intel’s Granite Rapids lineup spans just five SKUs ranging from a frequency-tuned 6960P with 72 cores to the flagship 6980P with 128.

The two platforms will roll out over the next few quarters.

The two platforms will roll out over the next few quarters – click to enlarge

If you’re curious about Intel’s current crop of E-core Xeons, which made their debut in spring, you can find our deep dive here.

The remainder of Intel’s Xeon 6 roadmap, including its monster 288 E-core 6900E processors and four and eight-socket-capable 6700P parts, won’t arrive until early next year.

Intel’s 6700P series will no doubt be of interest for those running large memory-hungry databases like SAP HANA as it’ll be the first generation of high-socket-count Xeons since Sapphire Rapids made its debut in early 2023.

But with core counts growing by leaps and bounds generation after generation, and CXL memory offering an alternative means for achieving the memory density required for these applications, it may well be Intel’s last generation to support more than two sockets.

While Intel is still months away from finalizing its Xeon 6 lineup, the chipmaker is already talking up its next generation of datacenter chips.

Dubbed Clearwater Forest, the part is Intel’s follow-on to Sierra Forest. While we don’t know much about the chip just yet, we do know that it’ll be the first Intel processor built on its state-of-the-art 18A process tech.

We’ve also learned the chip will share a similar design as Granite Rapids with three compute dies flanked by a pair of I/O dies, only smaller.

Where do the core wars go from here?

Although Intel has caught up and even surpassed AMD on core count for the moment, the core wars are far from over.

As we mentioned earlier, AMD is due to launch its Turin Epycs later this year with 128 Zen 5 or 192 Zen 5C cores, which have already demonstrated a 16 percent IPC uplift in a variety of workloads. What’s more, unlike Intel’s E-core Xeons, all of AMD’s gen 5 Epycs support AVX-512.

And then, of course, there’s Amazon, Microsoft, and Google, which have all announced custom Arm-based silicon optimized for their workloads with up to 128 cores. Not to be outdone, Arm chip designer Ampere Computing is already working on chips with 256 and even 512 cores.

These higher-core-count parts offer a number of advantages ranging from the ability to move from dual to single-socket configurations to enabling large-scale consolidation of aging nodes. However, there also remain headwinds to the adoption of these chips.

For one, many software licenses are still tied to core count, which may drive customers towards mid-tier parts. Another factor is blast radius. The higher the core count, the bigger the potential impact of a failure. Lose a 32 or 64-core server, and it might take down a few workloads; take down a 512-core system, the impact will be far larger.

Whether software will evolve to overcome these challenges or if chipmakers will be forced to shift focus back to scaling frequency or driving IPC gains, we’ll have to wait and see. ®

Now read: Granite Rapids Xeon 6 pricing and more analysis

Continue Reading