MENU

The AI architecture in the Imagination E-series GPU

The AI architecture in the Imagination E-series GPU

Technology News |
By Nick Flaherty

Cette publication existe aussi en Français


Imagination Technologies is integrating simplified AI processing cores into its latest E-Series graphic processing unit (GPU), rather than adding an AI accelerator or co-processor.

The E-series IP has been developed for chips for smartphone, automotive and AI PCs where the AI can be used alongside graphics rendering.  

The distributed processing allows Imagination to take its memory-efficient tile-based rendering and turn it into ‘tile-based compute’ for more efficient memory management and processing. This provides 13TFLOPS at FP32 32bit floating point instructions for performance over 200TOPS. This is 3.6x the TOPS/mm2 of the previous D-series GPUs.

 Imagination had developed a neural processing unit (NPU) co-processor, to sit alongside its CPUs, but has recently killed both product lines.

“The main thing we realised previously with a dedicated NPU was the progress was the evolution is in the algorithms, such as transformer models. The more we keep looking at this space a lot of the enhancements they are software enhancements and that continues to be a parallel compute problem,” Kristof Beets, vice president of produce management at Imagination tells eeNews Europe.

“Looking at the balance points, the CPU is essentially a sequential engine so it struggles with efficiency. As long as you stay on the fast path, the thing it is designed to do, the NPU is stunning, but where it fails is with scalability and flexibility,  a lot of NPUs don’t have scheduling capabilities.”

“That’s why we keep coming back to the GPU. The more we keep looking at it it’s the universal best solution with scalability. So we are taking the tricks that works for the NPU and bringing them into the GPU framework.”

“We realised most of the optimisations for AI are tile base algorithms, so our tile based rendering becomes tile-based compute. In the same way our deferred rendering becomes deferred compute as pruning and sparse support is similar. A lot of those key concepts we can map int the hardware we already have.”

We are taking the tricks that works for the NPU and bringing them into the GPU framework – Kristof Beets

The architecture scales in performance from 2 to 200TOPs with 8bit integer operations.

“A lot of the focus is continuing on Ai and the edge, and edge for us is pretty much anything that isn’t the cloud, datacentre training and inference. Edge is everything else, where you need responsiveness and privacy,” said Beets.

“Fundamentally there are a few use cases where we don’t play as we are still fundamentally a GPU design. If it’s a simple AI with a microcontroller in wearables or IoT, that’s not for us. It’s the graphical, higher end smartphones all the way from entry level to premium. Automotive is also a solid market for us and continues to grow, with derivatives into robotics.

E-series GPU architecture

This uses a new architecture Imagination calls burst processors. These use simplified two stage processing pipeline that operate on data in the local memory without the need for a load/store architecture.

“There a lot of direct addressing rather than load store with local direct access to the local memory,” said Beets. “The shorter pipeline helps with the instruction execution and the power efficiency. This doesn’t matter whether its graphics, compute or AI.”  

This allows for a lot more data reuse. “AI very often builds on the data of the previous instructions,” said Beets.

“The other thing that is you exchange data with neighbours, so shuffling between all the pipelines we can keep all of the data active in smaller local memory. So we are now touching the register store 60% less by exchanging data locally. The only way is deep integration as a pipeline in the universal shader cluster (USC), where we can reuse the SRAM,” he said.. There is 500Kbytes per USC and 128K per USC local memory for streaming.

The distributed AI architecture has an impact on the interconnect to link the processing cores. This determines how many processing blocks can be implemented, keeping the wires at a length that allows single cycle operation, or the ‘fast pipeline’.

“We will be delivering the library and writing the algorithm to stay within the fast pipeline,” said Beets. “We solve the synthesis challenge  with this fast window with the limited distance and fall back to a secondary, slower mechanism if we need to.”

“We used to support 8 virtual machines, now its 16 per core, so its 64 in a four core implementation. We support all the different mechanisms with a hypervisor or sandbox, and we have built up that since the B-series.”

E-series Implementation

There are eight USCs per core, and a quad core GPU implementation runs at 1.6GHz.  

The EXT version will be optimised for smartphones, while the EXS will add functional safety for  driver monitoring and AI cockpit chips in automotive and the EXD will be used for desktop for Windows AI models with local AI and privacy, voice interfaces.

“We do recognise automotive customers have custom NPUs but they all ask for optimisations for co-working which means we support shared on chip SRAM but also an mailbox communication system so you gain the flexibility.”

There is a lead customer using the architecture, says Beets, with the first configuration in autumn 2025. “In the past we have delivered mobile and desktop solution in the second half, automotive takes a little later as you take the implementation and add the functional safety,” he said.

www.imgtec.com

 

If you enjoyed this article, you will like the following ones: don't miss them by subscribing to :    eeNews on Google News

Share:

Linked Articles
10s