Second-generation at-memory compute architecture unveiled
Untether AI™ has announced at the Hot Chips 2022 conference its next-generation at-memory compute architecture for accelerating AI inference workloads called speedAI devices, with an internal codename “Boqueria.” At 30 TFlops/W and 2 PetaFlops of performance, the speedAI architecture sets a new standard for energy efficiency and compute density.
Challenges of AI inference acceleration
AI is increasingly being deployed in a variety of markets, from financial technology, smart city and retail, natural language processing, autonomous vehicles, and scientific applications. There has been an explosion in the types of neural network architectures as well as compute demand, resulting in increased energy consumption for AI workloads. These demanding applications require increasing levels of accuracy to ensure safety and quality of results. These requirements of flexibility, performance combined with energy efficiency, and accuracy necessitate a new approach to AI acceleration which Untether AI delivers with its speedAI devices.
“The merits of at-memory compute have been proven with the first generation runAI device, and the second generation speedAI architecture enhances the energy efficiency, throughput, accuracy, and scalability of our offering,” said Arun Iyengar, CEO of Untether AI. “The speedAI devices offer an ability that is unmatched by any other inference offering in the marketplace.”
Energy efficiency drives performance
Because at-memory compute is significantly more energy efficient than traditional von Neumann architectures, more TFlops can be performed for a given power envelope. With the introduction of the runAI devices in 2020, Untether AI set a new energy efficiency level at 8 TOPs/W for the INT8 datatype. The speedAI architecture dramatically improves upon that, delivering 30 TFlops/W. This energy efficiency is a product of the second-generation at-memory compute architecture, over 1,400 optimized RISC-V processors with custom instructions, energy efficient dataflow, and the adoption of a new FP8 datatype, all of which helps quadruple efficiency compared to the previous generation runAI device. The first member of the family, the speedAI240 device provides 2 PetaFlops of FP8 performance and 1 PetaFlop of BF16 performance. This translates into industry leading performance and efficiency on neural networks like BERT-base, which speedAI240 can run at over 750 queries per second per watt (qps/w), 15x greater than the current state of the art from leading GPUs.
Second-generation memory bank
Each memory bank of the speedAI architecture has 512 processing elements with direct attachment to dedicated SRAM. These processing elements support INT4, FP8, INT8, and BF16 datatypes, along with zero-detect circuitry for energy conservation and support for 2:1 structured sparsity. Arranged in 8 rows of 64 processing elements, each row has its own dedicated row controller and hardwired reduce functionality to allow flexibility in programming and efficient computation of transformer network functions such as Softmax and LayerNorm. The rows are managed by two RISC-V processors with over 20 custom instructions designed for inference acceleration. The flexibility of the memory bank allows it to adapt to a variety of neural network architectures, including convolutional, transformer, and recommendation networks as well as linear algebra models
FP8 — the new datatype for accurate inference acceleration
In the search for energy efficiency Untether AI’s research determined that two different FP8 formats provided the best mix of precision, range, and efficiency. A 4-mantissa version (FP8p for “precision”) and a 3-mantissa version (FP8r for “range”) provided the best accuracy and throughput for inference across a variety of different networks. For both convolutional networks like ResNet-50 and transformer networks like BERT-Base, Untether AI’s implementation of FP8 results in less than 1/10th of 1 percent of accuracy loss compared to using BF16 data types, with a fourfold increase in throughput and energy efficiency.
Scalability for large language models
The speedAI240 device is designed to scale to large models. The memory architecture is multi-leveled, with 238MB of SRAM dedicated to the processing elements offering 1 petabyte/s of memory bandwidth, four 1MB scratchpads, and two 64-bit wide ports of LPDDR5, providing up to 32GB of external DRAM. Host and chip-to-chip connectivity is provided by high-speed PCI-Express Gen5 interfaces.
The Untether AI imAIgine™ Software Development Kit (SDK) provides a path to running networks at high performance, with push-button quantization, optimization, physical allocation, and multi-chip partitioning. The imAIgine SDK also provides an extensive visualization toolkit, cycle-accurate simulator, and an easily integrated runtime API and is available now.
The speedAI devices will be offered as standalone chips as well as a variety of m.2 and PCI-Express form factor cards. Sampling of speedAI240 devices and cards to early access customers is expected to begin in the first half of 2023.