Flex revamps NN fabric, announces edge AI processor
Six months ago Flex Logix flagged up a move into neural network acceleration with plans for an nnMax 512 licensible tile (see Flex Logix tips move into NN processing). However, prior to tape-out and in response to conversations with potential customers Flex Logix has decided to increase the size of the tile to 1,024 DSP multiply-accumulate units.
The result is an nnMAX tile of 1,024 MACs with local SRAM, which in 16nm FinFET process has approximately a 2.1 TOPS peak performance. nnMAX tiles can be arrayed into NxN arrays of any size, without any GDS change, with varying amounts of SRAM as needed to optimize for the target neural network model, up to to >100 TOPS peak performance.
The reason for the change is some mathematics known as the Winograd Transform and how that applies to convolutional neural networks, according to Geoff Tate, CEO of Flex Logix. It turned out that the WInograd transformation when applied to CNNs can provide superior efficiency and speed up of calculations but also requires clusters of 16 MACs close together.
There are implications for loss of resolution so for INT8 accuracy Winograd calculations are done with 12bit resolution.
The result was that it was desirable to have a slightly larger tile but that for 3×3 matrix operations on a stride of one – which can represent 75 percent of CNN operations – it provides a speed up of about 2.25.
InferX X1 edge inference coprocessor: 1.067GHz clock frequency on TSMC16FFC. Source: Flex Logix Technologies Inc.
The InferX X1 edge co-processor will have four such tiles and a single 32bit wide LPDDR4 interface to DRAM.
Next: Yolo
The efficiency of computation and reduced movements of data mean a much higher throughput per watt than existing solutions, with a performance advantage that is especially strong at low batch sizes which are required for edge applications where there is typically one camera/sensor.
For YOLOv3 real time object recognition, InferX X1 processes 12.7 frames/second of 2 megapixel images at batch size = 1. Performance is roughly linear with image size: frame rate approximately doubles for a 1 megapixel image.
The nnMax 1K tile and InferX X1 coprocessor support 8, 16 and bfloat16 numerics with the ability to mix them across layers. InferX is programmed using TensorFlow Lite and ONNX, two of the most popular inference ecosystems.
“The difficult challenge in neural network inference is minimizing data movement and energy consumption, which is something our interconnect technology can do amazingly well,” said Geoff Tate, CEO of Flex Logix. “While processing a layer, the datapath is configured for the entire stage using our reconfigurable interconnect, enabling InferX to operate like an ASIC, then reconfigure rapidly for the next layer. Because most of our bandwidth comes from local SRAM, InferX requires just a single DRAM, simplifying die and package, and cutting cost and power.”
InferX X1 will be available as chips for edge devices and on half-height, half-length PCIe cards for edge servers and gateways. It is programmed using the nnMAX Compiler which takes Tensorflow Lite or ONNX models. The internal architecture of the inference engine is hidden from the user.
The nnMax 1K is in development and will be available for integration in SoCs by 3Q19. The InferX X1 is due to tape-out in 3Q19 and samples of chips and PCIe boards will be available shortly after.
Related links and articles:
News articles:
Flex Logix tips move into NN processing
Eta adds spiking neural network support to MCU
Microsoft, Alexa, Bosch join Intel by investing in Syntiant
NovuMind benchmarks tensor processor