From DSPs to artificial intelligence
Huge claims are being made about what AI will deliver to the world, from self-driving cars to virtual assistants. But delivering on these promises is more work than the hype might suggest.
One of the biggest challenges is delivering the huge amount of processing power AI requires, but that challenge is also an opportunity for chip companies to create the AI processors of the future. Get it right and a chip company can make the x86 of the AI future. Get it wrong and that chip company can make the Itanium of the AI future. Software has the largest influence on AI system performance: if a chip company can accelerate the world’s AI software with their processor then they win big, but if they accelerate only a few AI demos then their processor will be the interesting curiosity that never succeeds in the real world.
Right now, the incumbent AI processor is made by NVIDIA. Their GPUs (graphics processing units) power almost all the world’s AI software today. Currently, the only significant competitors to NVIDIA are closed proprietary systems that provide both the AI software and AI processor, such as Google’s TPU or Intel’s Mobileye. NVIDIA, with extensive researchers and partners around the world, has grown a huge ecosystem of AI frameworks, tools and components making it easy for developers to quickly build AI software that is tied to NVIDIA’s GPUs.
But what if developers and manufacturers want to transition to non-NVIDIA GPUs? Can we create an open ecosystem and community that provides wider support to avoid the NVIDIA lock-in? Developers want an ecosystem that enables any AI processor to accelerate their application, and that is widely used by AI researchers and software developers. This ecosystem would allow developers to write AI software then benchmark and profile it on many AI processors. Ultimately, they will end up with the best AI processor for the job.
My background is not from AI, but instead from videogames development. Strangely, a huge amount of AI technology comes indirectly out of videogames technology, such as NVIDIA GPUs, designed for videogames graphics and programmed in C++. But at the same time, AI uses it in a subtly different way. To make the incredible smoothly-animated and life-like graphics expected of today’s videogames, game developers have developed a range of techniques to deliver unparalleled high levels of processing performance, while also being creative and original. It is this combination of high performance and creativity that has made GPUs such a great enabler for AI researchers.
But the videogames industry isn’t the only industry that relies on very high-performance processors. The DSPs (Digital Signal Processors) that enable today’s super-fast internet and mobile networks also provide incredible levels of processing power. At first, AI looks more like digital signal processing than like videogames. Both DSPs and AI take in data, pass them through a series of mathematical operators and output processed results. So why is it that the AI industry has grown out of videogame graphics and not out of digital signal processing? More AI processors being designed today are based on DSPs, not GPUs, yet the AI industry overwhelmingly sticks to GPUs. There is something strange going on in AI that needs to be understood. We must learn from the videogames industry and apply that experience to enable us to open up the AI accelerator market.
Next: Videogames
In videogames, GPUs perform a huge amount of work in parallel: mostly drawing lots of pixels at the same time. Different pixels will be drawn in different ways: water will be drawn differently to grass, for example. A videogame developer will use the CPU (central processing unit) to batch up similar pixels together and then tell the GPU to draw each batch differently. Each of these batches of processing for a GPU will be called a “draw call” by game developers and there are typically hundreds of thousands of these draw calls per second in a modern videogame. Each batch of pixels has a complicated set of dependencies that operate on large amounts of off-chip memory. For example, to draw shadows, a videogame will calculate what is called a “shadow map” for each light source and then, once the shadow maps are calculated, the GPU draws pixels with shadows. The games engine executes all this work on CPU, then very rapidly creates many batches of graphics work for the GPU to perform.
Digital signal processing has also grown to huge performance levels, but the operations are usually much more independent of the CPU than in videogames graphics. A communications device like a smartphone keeps the signal processing on one or more DSPs without much CPU interaction. The DSPs were originally optimized to do just one job, but as communications standards have become more complex, devices use more and more DSPs optimized for different parts of the communications standard, but the interaction between the DSPs is controlled mostly by themselves. The systems are usually designed to operate almost entirely independent of the CPU to be most power efficient. To maximize performance and minimize power consumption, the DSPs aim to mostly operate in a pipeline, using on-chip memory for data and to communicate between DSPs for different communications stack stages.
Which of these two approaches best matches today’s AI algorithms? The most advanced AI algorithms today are deep-learning neural-networks, based on layers of tensor operations, with each layer processing the output of the previous layer using a set of “weights” that have been learned from real data using deep learning “training”. At first look, the DSP approach seems to be the closest match: put each layer of a neural network on a different DSP, ideally a DSP optimized specifically for the needs of each layer, and allow the DSPs to pass results onto the DSP for the next layer via on-chip memory. But the layers and weights are too large to fit in on-chip memory, and each layer must complete before the next layer can process its output. So, the use of memory is much closer to that of a videogame than a communications standard.
If one layer needs to wait for the previous to complete, then there is no point putting different layers on different DSPs. The best performance is achieved by executing each layer on all processor cores and then switching the processor cores very rapidly to running the next layer. So, the execution of a neural network is much more like a videogame than signal processing.
You might think that a neural network doesn’t need the CPU at all, leaving all execution on a set of DSPs. This is partly true, but a real AI software application uses multiple neural networks and also other algorithms, such as sensor fusion or classical machine vision. This means that for the whole application, a large number of different algorithms, operators and layers need to execute on an AI processor, switching between each algorithm rapidly and also varying according to decisions made by previous algorithms. For example, in a self-driving car, one AI operation may guess areas of an image that might be pedestrian, forwarding on those guesses to a deep-learning neural network to make a much more accurate assessment if each guess really is a pedestrian. It is the CPU, not the AI processor, that packages up those guesses to run on the actual AI processor. So, a whole AI application looks far more like a videogame than it does a communications standard.
Next: Early GPUs
But this isn’t the only lesson we should learn from GPUs. There is one fundamental difference between the way DSP companies interact with signal processing software developers and the way GPU companies interact with videogame developers.
The early GPU companies knew they needed to entice videogame developers to support their GPUs, so they did everything they could to hook those developers: 1) they gave away free GPUs to almost any games developer they met, 2) they provided APIs and sample code, and 3) they had developer support, helping videogames developers to write games. But they did not write videogames themselves. They knew that it was those videogame developers that held the key to their success, to the extent that they even encouraged games developers to write games that performed badly on their GPUs. Why? Because they knew that if a game performed badly on their current GPUs, it would run great on their next generation, more-powerful GPUs. They understood that their business model was to get as many games as possible on their GPUs, both now and in the future, knowing that videogames developers are a crucial loss-leader for their real market: games consumers. And they knew that games consumers are always looking for the next-big-thing
NVIDIA learnt from their gaming experience with videogame developers to encourage and support the world’s AI innovators to adopt their processors.
The DSP market is very different. A few big customers buy most of the high-volume DSPs (such as smartphones or mobile base stations), while the rest are sold to low-volume projects for niche markets. This means that the DSP business requires a focus on the high-volume customers and finding ways of avoiding, or outsourcing, support for low-volume niche developers. Software businesses in the DSP world are bad for business. The worst software developers for a DSP vendor are the ones whose software runs too slowly on a DSP, consumes extensive engineering time and ultimately will never buy the DSP that doesn’t hit their performance goals.
What does the AI acceleration market look like? Are AI software developers the key to unlocking big sales for AI accelerators? Or are AI software developers dangerous, who will create software too slow for the AI accelerator and therefore make it impossible to ever sell the AI accelerator? The evidence seems overwhelming that AI is more like videogames in this area too. It’s a huge task to build AI software so software developers need to start working today on current-generation processors. This is where NVIDIA has been so successful: its GPUs are available to consumers and any AI software developer can build AI software on them right now. How many of the highly-hyped AI accelerator processors that claim they can compete with NVIDIA actually have a software development kit (devkit) available to AI software developers today? Try searching for AI processor companies and you will find many results of VCs throwing money at this technology. When you try finding their AI devkits, and you will struggle to find them, the solutions are complex: NVIDIA, AMD GPUs, Intel (including their Movidius Myriad platform) and Google’s own TPU are the only widely-available devkits with supported software. Software developers using those devkits will probably create some software that uses more performance than those devkits provide, but that software will drive the sales of the next-generation of AI processor. Note that this is also the thinking that is driving NVIDIA’s launch of their new Tegra Nano device.
It’s not, however, just about having devkits available, but also about the openness of the solutions that are available. NVIDIA provide most of their tools on their website for free. AMD makes theirs open-source and in many cases Intel does too. Intel and NVIDIA both also provide some paid-for professional development tools for special cases, but those tools also have open ecosystems. These open ecosystems and easy availability of tools don’t just make life very easy for software developers, they also simplify the investment decisions by software developers.
Next: The Early GPU companies knew
So how did the GPU companies succeed with all these tough challenges and hurdles? What they did really well is convince videogame developers that the GPU companies were expanding the market for videogames. That is what AI processor companies need to think about: how can AI software developers invest in their platforms and grow the AI software market?
So what are lessons from the videogames industry?
The GPU companies never said their GPUs were faster or cheaper than CPUs, they said you could create better graphics. This needs to be the message for new AI accelerators: saying you can do AI faster isn’t that interesting, because NVIDIA has already achieved massive performance with AI. You can’t easily beat NVIDIA on price, because even though the per-unit cost of an AI GPU may be high, the up-front investment in porting software adds a lot to the total cost. An AI accelerator needs to enable better, more complex, more reliable AI than is possible with a GPU. Higher performance per Watt helps you get there, but it’s a means to an end, not an end itself. There’s lots that is possible here: look how Tesla took various IP blocks and customized them with their own IP for their FSD chip, all designed around their existing AI software.
The GPU companies knew that software developers were loss-leaders, but they needed to attract them in order to bring in the game developers. Therefore, AI processor companies should be practically giving away their processors to software developers and academics to get the software ported.
The main AI software frameworks should be constantly testing on these new AI processors, with the results available publicly. Yes, this will expose performance differences and bugs, but that would be a good thing: we’re only going to fix the problems if people can see the problems to fix.
For autonomous vehicles, we really need to work on safety. This is a massive challenge that needs to go from the bottom up. Not just theoretical discussions about the age-old “trolley problem”, but the real engineering challenges of delivering safe AI software.
We need to build up an open ecosystem of software to rival NVIDIA’s. The only way of enabling this is with open standards. In graphics, this happened with OpenGL and then Vulkan, which is a recent graphics standard designed by, and for the, game developers. We need to see how we can do this with AI software developers. This is a challenge since graphics software is an established field and AI software is not. There are a lot of conflicting opinions but, if we don’t then NVIDIA’s CUDA lock-in will easily control the AI software market for decades.
We need to work together at this as an industry. There are no easy answers, mainly because this is pioneering work. Like any pioneers, we need to help each other to overcome the huge challenges. If we stay isolated, on our own, not speaking to each other, then every new AI challenger will not be welcomed by developers or customers.
If you want to be part of this open ecosystem, there are many ways you can get involved. Join the standards bodies: MISRA (for safety), Khronos (for the acceleration standards) or ISO C++ (for accelerating general programming). Become part of open-source projects that operate on open standards for AI, whether that’s SYCL, OpenCL or SPIR-V. It’s easy to port existing AI projects from NVIDIA’s CUDA to the SYCL open standard, or to port graph-compiler projects from NVIDIA’s PTX to the open standard SPIR-V. We can achieve so much more by working together.
Andrew Richards is CEO and co-founder of Codeplay Software Ltd. (Edinburgh, Scotland) a pioneer in GPU acceleration.
Richards started his career writing video games in the days of 8bit computers, progressing to become a lead games programmer at Eutechnyx, where he wrote best-selling titles such as Pete Sampras Tennis and Total Drivin’. Andrew developed early GPU compiler technology, and founded Codeplay in 2002. Codeplay has been producing compilers for games consoles, for special-purpose processors and for GPUs ever since.
Richards is also the chair of the software working group of the HSA Foundation and former chair of the SYCL for OpenCL sub-group of the Khronos Group.
Codeplay is now working on artificial intelligence and safety for self-driving cars.
Related links and articles: