Connect with us

Tech

Intel launches Gaudi 3 accelerator for AI: Slower than Nvidia’s H100 AI GPU, but also cheaper

Published

on

Intel launches Gaudi 3 accelerator for AI: Slower than Nvidia’s H100 AI GPU, but also cheaper

Intel formally introduced its Gaudi 3 accelerator for AI workloads today. The new processors are slower than Nvidia’s popular H100 and H200 GPUs for AI and HPC, so Intel is betting the success of its Gaudi 3 on its lower price and lower total cost of ownership (TCO).

Intel’s Gaudi 3 processor uses two chiplets that pack 64 tensor processor cores (TPCs, 256×256 MAC structure with FP32 accumulators), eight matrix multiplication engines (MMEs, 256-bit wide vector processor), and 96MB of on-die SRAM cache with a 19.2 TB/s bandwidth. Also, Gaudi 3 integrates 24 200 GbE networking interfaces and 14 media engines — with the latter capable of handling H.265, H.264, JPEG, and VP9 to support vision processing. The processor is accompanied by 128GB of HBM2E memory in eight memory stacks offering a massive bandwidth of 3.67 TB/s.

Intel’s Gaudi 3 represents a massive improvement when compared to Gaudi 2, which has 24 TPCs, two MMEs, and carries 96GB of HBM2E memory. However, it looks like Intel simplified both TPCs and MMEs as the Gaudi 3 processor only supports FP8 matrix operations as well as BFloat16 matrix and vector operations (i.e., no more FP32, TF32, and FP16).

When it comes to performance, Intel says that Gaudi 3 can offer up to 1856 BF16/FP8 matrix TFLOPS as well as up to 28.7 BF16 vector TFLOPS at around 600W TDP. Compared to Nvidia’s H100, at least on paper, Gaudi 3 offers slightly lower BF16 matrix performance (1,856 vs 1,979 TFLOPS), two times lower FP8 matrix performance (1,856 vs 3,958 TFLOPS), and significantly lower BF16 vector performance (28.7 vs 1,979 TFLOPS). 

Continue Reading