‘At-memory’ inference engine raises NN performance

Technology News | October 29, 2020

By Peter Clarke

The startup has also discussed a PCIe card it calls tsunAImi, that integrates four such processors to provide up to 2 peta operations per second. This equates to an 8TOPS/W efficiency. The announcement was made by way of a presentation at the Fall Linley Processor Conference.

Untether AI was founded in 2017 to develop a high performance neural network inference engine based on the idea of moving processing to the data sets rather than moving data to the processor, the von Neumann architecture that is at the heart of most uniprocessor designs. Unether AI calls this “At-memory” computing which does rely on rich compilation/software that is capable of pre-placing data and optimizing resource allocation.

Bob Beachler, vice president of products at Untether.AI said that the chips and PCI board are optimized for inference and would be used across a range of applications from data center through AI service providers and down to edge servers.

Untether’s architectural decision is supported by its own analysis that in von Neumann architectures 91 percent of the energy is consumed moving data and only 8 percent in the multiply-accumulate logic.

The inference optimization means that runAI200 is designed to try and contain complete neural networks and the coefficients on a single chip – or on four chips when considering the PCIe card. The native batch size is 1 to support the lowest latency. The chip is implemented in 16nm CMOS from TSMC and contains 200Mbytes of SRAM and 260,000 processing elements dispersed among the SRAM. The design supports int8 and int16 data types and has a 720MHz clock frequency for efficiency and a 960MHz mode optimized for performance.

Next: SDK is key

At the heart of the at-memory compute architecture is a memory bank: 385Kbytes of SRAM with a 2D array of 512 processing elements. With 511 banks per chip, each device offers 200Mbytes of memory and operates up to 502TOPS in its “sport” mode.

Multiple function-specific buses support the movement of information in the north-south and east-west directions but the emphasis of the architecture is minimizing data movement as much as possible. A zero-detect function allows for processing elements to be switched off which can save as much as 50 percent of power consumption.

Beachler, who is a veteran of FPGA company Altera, commented that the resulting array architecture as similarities to an FPGA. There is also a custom 32bit RISC processor that is tailored for AI loads on the chip.

At the PCIe card level this translates into over 80,000 frames per second of ResNet-50 v1.5 throughput at batch=1. Benchmarks show that this performance is 3 times that of nearest rivals. For natural language processing, tsunAImi accelerator cards can process more than 12,000 queries per second (qps) of BERT-base, four times faster than any announced product.

Key to the ability to such performance is the software development kit, known as imAIgine.

The imAIgine SDK provides push-button quantization, optimization, physical allocation, and multi-chip partitioning. It also provides an extensive visualization toolkit, cycle-accurate simulator, and a runtime API.

The tsunAImi accelerator card is sampling now and will be commercially available in 1Q2021.

Related links and articles:

www.untether.ai

News articles:

AI startup appoints FPGA, embedded veteran as CEO

Mixed-signal designers form near-memory AI startup

Server processor startup raises $240 million

Groq enters production with A0 tensor processor

Cerebras Wafer Scale Engine: An Introduction

Intel drops Nervana after buying Habana