In videogames, GPUs perform a huge amount of work in parallel: mostly drawing lots of pixels at the same time. Different pixels will be drawn in different ways: water will be drawn differently to grass, for example. A videogame developer will use the CPU (central processing unit) to batch up similar pixels together and then tell the GPU to draw each batch differently. Each of these batches of processing for a GPU will be called a “draw call” by game developers and there are typically hundreds of thousands of these draw calls per second in a modern videogame. Each batch of pixels has a complicated set of dependencies that operate on large amounts of off-chip memory. For example, to draw shadows, a videogame will calculate what is called a “shadow map” for each light source and then, once the shadow maps are calculated, the GPU draws pixels with shadows. The games engine executes all this work on CPU, then very rapidly creates many batches of graphics work for the GPU to perform.
Digital signal processing has also grown to huge performance levels, but the operations are usually much more independent of the CPU than in videogames graphics. A communications device like a smartphone keeps the signal processing on one or more DSPs without much CPU interaction. The DSPs were originally optimized to do just one job, but as communications standards have become more complex, devices use more and more DSPs optimized for different parts of the communications standard, but the interaction between the DSPs is controlled mostly by themselves. The systems are usually designed to operate almost entirely independent of the CPU to be most power efficient. To maximize performance and minimize power consumption, the DSPs aim to mostly operate in a pipeline, using on-chip memory for data and to communicate between DSPs for different communications stack stages.
Which of these two approaches best matches today’s AI algorithms? The most advanced AI algorithms today are deep-learning neural-networks, based on layers of tensor operations, with each layer processing the output of the previous layer using a set of “weights” that have been learned from real data using deep learning “training”. At first look, the DSP approach seems to be the closest match: put each layer of a neural network on a different DSP, ideally a DSP optimized specifically for the needs of each layer, and allow the DSPs to pass results onto the DSP for the next layer via on-chip memory. But the layers and weights are too large to fit in on-chip memory, and each layer must complete before the next layer can process its output. So, the use of memory is much closer to that of a videogame than a communications standard.
If one layer needs to wait for the previous to complete, then there is no point putting different layers on different DSPs. The best performance is achieved by executing each layer on all processor cores and then switching the processor cores very rapidly to running the next layer. So, the execution of a neural network is much more like a videogame than signal processing.
You might think that a neural network doesn’t need the CPU at all, leaving all execution on a set of DSPs. This is partly true, but a real AI software application uses multiple neural networks and also other algorithms, such as sensor fusion or classical machine vision. This means that for the whole application, a large number of different algorithms, operators and layers need to execute on an AI processor, switching between each algorithm rapidly and also varying according to decisions made by previous algorithms. For example, in a self-driving car, one AI operation may guess areas of an image that might be pedestrian, forwarding on those guesses to a deep-learning neural network to make a much more accurate assessment if each guess really is a pedestrian. It is the CPU, not the AI processor, that packages up those guesses to run on the actual AI processor. So, a whole AI application looks far more like a videogame than it does a communications standard.
Next: Early GPUs