Blaize details architecture of GSP edge AI chip

Blaize details architecture of GSP edge AI chip

Technology News |
By Nick Flaherty

Blaize has developed a graph streaming processor (GSP), codenamed El Cano, for AI edge applications as well as real time video processing and sensor fusion. The fully programmable engine consists of 16 cores and can handle up to 16TOPS of 8bit integer operations and has been used for a range of boards costing from $300 to $1000. 

“We are a chip company and there will be a chip business for us, for example in automotive,” said Richard Terrill, VP of strategic business development at Blaize following the announcement of its first system on module (SoM) and PCI express boards last week. “The board companies are likely to participate but with the module with the Samtec connector for the SOM where there is no host.”

The chip is built on Samsung’s 14nm process technology and has a power consumption of 7W. This is achieved by using a streaming approach that analyses the framework during compilation and allocates the resources on multiple chips via a scheduler and 4Mbits of optimised on-chip memory.

The hardware scheduler is a frame around the cores that takes a metamap of the framework created at compile time and feeds multiple instances of the scheduler in each core. Each scheduler knows the data complexities and has full autonomy to allocate the thread slots in each core. Any forks or deadlocks are resolved by the NetDeploy compiler tool ahead of time to prevent the scheduler being the bottleneck.

Each core is an array of small 4bit load store execution units that can be compbined in real time to handle larger operations from 8bit integer up to 16bit floating point. These are connected via a two dimension register file that is contiguous with a single unifed address space. The schedulers allocate the resources depending on the incoming data and the register file using the AI framework and algorithms from the compilation tool.

Next: Scaling edge AI

“In the interest of latency we fracture the work into small elements and maintain the dependencies, eg running CNN in 8x8x8 cells and each thread in each core works on this and understands the scheduling and dependencies,” said Cook. “It’s not a VLIW machine, but we do have task level parallelism with instruction pickers in hardware.”

“We don’t expose the microarchitecture to the software,” he said. “The software does not need to know what the granularity of the SIMD vector is – we have done that through a number of techniques. This means we can grow the compute resources, the SIMD pipelines without the software changing.”

“We have a 2D register file and we can pull blocks of data out of that data file and deliver that data to an execution pipeline. In the future we could run a 64 x 64 cell and that software wouldn’t change as we can tailor the hardware to match that. We have multi-issue instructions in the architecture, and a context switch in one clock.”

NetDeploy is a quantisation and graph optimisation tool and compiler that is target aware so that it can be used acorss multiple devices, says Cook. “We  can cascade a model across multiple devices with seven scheduling mechanisms from a single chip, multi-device, multi-card,” he said. The tool understands the optimal place (or node) to partition a large model across multiple devices for more complex edge AI operation.

For the intercore connections within the chip there’s an architectural cliff, says Cook. “As you increase the number of compute devices that share information there’s a point at which you can handle that with a multi port shared memory but then you exceed that, you end up building a hierarchical memory, and then you need cache coherent protocols. Instead we have chosen simplicity to scale out at the device level and keep the low cost interconnect on chip,” he said.

The memory required on the chip is another key area for the balance of power consumption and performance. Feedback from a test chip developed last year for how much memory would be required both on and off the chip. As a result there is 4Mbits in a hierarchcal structure. “The schedulers need full access to the memory – its not the cores that are in charge, it’s the schedulers,” said Cook. “We use a single unified address space and even though we do lots of hardware address aliasing the software isn’t playing those games.”

This allows multiple AI frameworks or algorithms to run simultaneously across the chip. This is of particular relevance to sensor fusion algorithms that process and combine the flows from multiple sensors.

“If you are running Tensorflow in a 32bit flow then NetDeploy optimises, quantises and compresses it and that becomes a new model, and I can have 100 of those converted models and issue those to the hardware each with a different context,” said Cook. “All of that is happening dynamically, there is not a program. Its very fine grained and the schedulers are pre-fetching any data they need for when an instruction issues, and this means we can run different networks on the same clock across the chip.”   

The design, partly developed at design centres in Leeds and London, UK, is aimed for industrial edge AI and automotive sensor fusion applications. “The chips are built in industrial grade as we deploy in harsh environments, and that process gives us an option to the autograde part that will follow,” said Terrill. “Our OEM and Tier1 customers can prototype in commercial grade and aftermarket and it is a very cost effective die as there’s not lots of on-chip memory.”

Related edge AI articles

Other articles on eeNews Europe

If you enjoyed this article, you will like the following ones: don't miss them by subscribing to :    eeNews on Google News


Linked Articles