Ponte Vecchio 3D supercomputer processor uses five process nodes

Technology News |
By Nick Flaherty

Intel has revealed more details of its largest ever device, the Ponte Vecchio (PVC) high performance computing processor.

The details revealed at the ISSCC conference this week show the 3D Ponte Vecchio processor combines 47 functional tiles on five process nodes and connected with two different chiplet technologies. This highlights a range of chiplet, processor, networking and high speed interconnect designs that will be part of supercomputer chip design in coming years.

The Ponte Vecchio design contains over 100bn transistors and is composed of sixteen compute tiles built in TSMC’s N5 5nm process, with eight Intel 7 memory tiles optimized for random access bandwidth-optimized SRAM tiles (RAMBO). These are stacked on two Foveros base die built with the Intel 7, 17 metal layer process, with each base die measuring 646mm2.

Eight HBM2E memory tiles and two TSMC N7 7nm SerDes connectivity tiles are connected to the base die with 11 dense embedded interconnect bridges (EMIB). The SerDes connectivity provides a high-speed coherent unified fabric for scale-out connectivity between PVC SoCs.

Each tile includes an 8-port switch enabling up to 8-way fully connected configuration supporting 90G SerDes links and the SerDes tile supports load/store, bulk data transfers and synchronization semantics that are critical for scaling up HPC and AI applications.

A 24-layer (11-2-11) substrate package houses the 3D Stacked Foveros Dies and EMIBs. To handle any warping of the substrate, low-temperature solder (LTS) was used for Flip Chip Ball Grid Array (FCBGA) design for these die and package sizes.

The foundational processing units of PVC are the compute tiles. The tiles are organized as two clusters of 8 high-performance cores with distributed caches. Each core contains 8 vector engines processing 512b floating-point/integer operands and 8 matrix engines with an 8-deep systolic array executing 4096b vector operations/engine/clock.

The compute datapath is fed by a wide load/store unit that fetches 512B/clock from a 512KB L1 data cache that is software configurable as a scratchpad memory. Each vector engine achieves throughput of 512/256/256 operations/clock for FP16/FP32/FP64 data formats respectively, while the matrix engine delivers 2048/4096/4096/8192 ops/clock for TF32/FP16/BF16/INT8 operands

The two base die provide a communication network for the stacked tiles and includes SoC infrastructure modules including memory controllers, fully integrated voltage regulators (FIVR), power management and 16 PCIe Gen5/CXL host interface lanes.

The compute and memory tiles are stacked face-to-face on top of the base dies using a dense array of 36μm-pitch micro bumps. This dense pitch provides high assembly yield, high power bump density and current capacity, and twice the signal density compared to the 50μm bump pitch used in the latest Intel Lakefield CPU.

Power through silicon vias through the base die are built as 1×2, 2×1, 2×2, 2×3 and 2×4 arrays within a single C4 bump shadow. Die-to-die routing and power delivery uses two top-level copper metals with 1μm and 4μm pitch thick metal layers.

Each base die connects to four HBM2E tiles and a SerDes tile using a 55μm pitch EMIB and the base tile also contains a 144MB L3 cache, called the Memory Fabric (MF), with a complex geometric topology. This operates at 4096B/cycle to support the distributed caches located under the shadow of the compute tile cores.

The L3 cache is a large storage that backs up various L1 caches inside the core and is organized as multiple independent banks each of which can perform one 64B read/write operation/clock.

The base tile connects the compute tiles and RAMBO tiles using a 3D stacked die-to-die link called Foveros Die Interconnect (FDI), organized as eight groups, with each group consisting of 800 lanes per compute tile.

The transmitter (Tx) and receiver (Rx) circuits of this interface are powered by the compute tile rail and level-shifters on the base tile convert to the supply voltage within the asynchronous interface. After traversing the FDI link, signals are decompressed back to full width in the destination Rx domain.

Each group uses common clocking with phase compensation on the base die to correct for variation between base and compute tiles. This necessitates a base-to-compute tile clock and a return clock going back to the base die to enable clock compensation.

All of this enables 2Tbit/s of aggregate memory and scale-out bandwidth as an accelerator for machine learning applications. Intel’s benchmarking shows that the device achieves inference throughput of over 43K images per second for the ResNet-50 neural network, with training throughput reaching 3400 images/s.

Related supercompuer chip articles

Other articles on eeNews Europe



Linked Articles
eeNews Europe