14,336 ARM cores in chiplet-based waferscale AI engine

October 15, 2021 // By Nick Flaherty
14,336 ARM cores in chiplet-based waferscale AI engine
Researchers in the US have used 2048 chiplets in the world’s largest waferscale machine learning engine with 14,336 ARM Cortex M3 cores.

Researchers at UCLA and the University of Illinois in the US integrated pre-tested known-good un-packaged bare dies, or chiplets, on a waferscale interconnect substrate. The system comprises an array of 1024 tiles, where each tile is composed of two chiplets, for a total of 2048 chiplets and about 15,000 mm2 of total area.

“To the best of our knowledge, this is the largest chiplet assembly based system ever attempted,” said the team in a recent paper. “In terms of active area, our prototype system is about 10x larger than a single chiplet-based system from Nvidia/AMD and about 100x larger than the 64-chiplet Simba research system from Nvidia.”

In contrast, the second generation AI system from Cerebras has 850,000 optimised tensor cores across 46,225m2 on a single wafer.

The chiplet-based waferscale system developed by UCLA uses silicon interconnect Fabric (Si-IF) for tight integration of many chiplets on a high-density interconnect wafer on a fine-pitch copper pillar based (10µm pitch) I/Os which are at least 16x denser than conventional µ-bumps used in an interposer based system, as well as ∼100µm inter-chiplet spacing.

The chiplets can be manufactured in heterogeneous technologies and can potentially provide better cost-performance trade-offs with terabytes of memory at 100s of Tbit/s alongside PFLOPs of compute throughput for high performance computing and AI applications.

Related articles

“The scale of this prototype system forced us to rethink several aspects of the design flow. Because this is the first attempt at building such a system, there were several unknowns around the manufacturing and assembly process,” said the team in the paper. “As a result, fault tolerance and resiliency, was one of the primary drivers behind the design decisions we took. We also ensured that the design decisions were not too complex, such that they could be reliably implemented by a small team,” they said.

Each tile is comprised of two chiplets: a Compute chiplet and a Memory chiplet. Each 40nm  compute chiplet contains 14 independently programmable ARM Cortex-M3 processor cores with 64kbits of local SRAM while the memory chiplet provides 512KB of globally shared memory. The system is architected as a unified memory system where any core on any tile can directly access the globally shared memory across the entire waferscale system using the interconnect.

The chiplets are designed and fabricated in the TSMC 40nm-LP process and terminated at the top copper metal layer where the fine-pitch I/O pads were built. The waferscale substrate is a passive substrate containing the interconnect wiring between the chiplets and copper pillars to connect to the chiplet I/Os. The chiplets are flip-chip bonded on to the waferscale substrate and the power is delivered via edge connections.

As the size of the wafer substrate is much larger than the maximum size of a reticle, the Si-IF substrate had to be designed such that it is step-and-repeatable. The entire wafer is divided into smaller identical reticles and is fabricated by stitching these reticles, each consisting of 72 tiles (12x6). The inter-chiplet links within each reticle have width of 2 µm and spacing of 3 µm, but at the edge of each reticle the links escaping are made fatter (width increases to 3 µm and spacing re[1]duces to 2 µm), while keeping the pitch constant, in order to reduce the impact of reticle stitching error.

A number of I/Os from each of the tiles at the edge of the mesh needs to fan-out to the edge of the wafer and connect to the external connectors so the fan-out wiring and the edge I/O pads are designed into each reticle. The chiplet slots on the Si-IF substrate from the edge reticles would remain un-populated and the external connectors would connect to the pads in these reticles.

To ensure that these I/O pads don’t cause an issue where chiplets are bonded, the team used a custom block etch process to remove the pads wherever they are not needed. If a foundry supports multiple reticles per wafer, the edge of the wafer can also be printed using a separate mask.

Other chiplet articles

Other articles on eeNews Europe


Vous êtes certain ?

Si vous désactivez les cookies, vous ne pouvez plus naviguer sur le site.

Vous allez être rediriger vers Google.