That will be the NNP-L1000 – codenamed Spring Crest. In addition Intel plans to support a novel data type called bfloat16 on NNP-L1000 and over time extend support for bfloat16 across Xeon processors and FPGAs.
At the point of its acquisition Nervana in 2016 had a processor called Engine that was a silicon interposer based multi-chip module. It had terabytes of 3D memory surrounding a 3D torus fabric of connected neurons that used low-precision floating point math. The use of low-precision, now a mainstream idea in machine learning, was one of the things that gave it an advantage over the general-purpose GPUs then being used.
Intel pulled the development back in house and the result was the Nervana Lake Crest chip. This is achieving General Matrix to Matrix Multiplication (GEMM) operations using A(1536, 2048) and B(2048, 1536) matrix sizes have achieved more than 96.4 percent compute utilization on a single chip. This represents around 38TOP/s of performance within a power budget of 210 watts.
“And this is just the prototype of our Intel Nervana NNP Lake Crest from which we are gathering feedback from our early partners, said Naveen Rao, founder of Nervana and now vice president and general manager of the artificial intelligence products group at Intel.
The Lake Crest prototype is scalable by forming arrays and multichip distributed GEMM operations that support model parallel training are realizing nearly linear scaling and 96.2 percent scaling efficiency for A(6144, 2048) and B(2048, 1536) matrix sizes. The scheme is achieving 89.4 percent of theoretical chip-to-chip bandwidth of 2.4Tbps with less than 790ns of latency.
Rao said that he anticpates that the NNP-L1000 Spring Crest processor will achieve 3 to 4 times the training performance of the first generation limited distribution Lake Crest product.
Next: And bfloat16 is?
The bfloat16 is a truncation of float32 to its first 16 bits. It has 8 bits for exponent and 7 bits for mantissa. This makes for easy conversion to and from float32 while minimizing risks of hard or impossible to compute artefacts when switching from float32.
Intel said it is also working to integrate popular deep learning frameworks such as TensorFlow, MXNet, Paddle Paddle, CNTK and ONNX onto nGraph, Intel’s open-source library for development frameworks that can run deep learning computations efficiently on a variety of processors.
Related links and articles: