Davies recently migrated from leading ARM’s graphics and vision business (see Thoughts on Jem Davies leading ARM’s machine learning group).

Davies started our interview by making the point that machine learning computation by way of neural networks is a fundamental shift in computation and that ARM has been taking its time to try and make sure its architectural approach is sufficiently general and scalable to have a long life in the market. It has now completed designing the first hardware implementation, the ML processor, which it will distribute to licensees sometime in the middle of 2018. It is also offering an iteration of its object detection image processor (see ARM launches two machine learning processors); both under the Project Trillium banner.

We asked what process node the ML processor core is targeting.

“Machine learning is coming to every market segment we operate in, therefore the IP could be deployed in many different nodes. Nonetheless, the ML processor is aimed at premium smartphone market which today implies designs aimed at 7nm,” Davies said. “The 16nm node could be an alternative and then there are also things like smart IP cameras that we are also targeting. And the 28nm node will go on for a long time, so it could turn up there.”

Machine learning will reach from sensors to servers. Source ARM.

That said, ARM’s engineers have had to make choices about the size of the circuit and how many resources to include. “The first ML processor is a fixed configuration aimed at premium smartphone. Then there will be scalable, configurable IP.” Davies declined to say how large the fixed-configuration core is in a 7nm process but said the intention is that it would easily fit inside an application processor SOC.

When we offered up a size of one square millimeter, Davies said it was of the right order plus or minus 50 percent. Another clue to the size comes from the power consumption. ARM reckons that the ML processor is capable of more than 4.6 tera operations per second at an efficiency of 3TOPS per watt.

Next: Home grown

Although the Object Detection (OD) processor comes partly as a result of ARM’s acquisition of Apical in 2016 (see ARM buys embedded vision firm for $350 million) the design of the ML processor has been done entirely within ARM, Davies said. “We started work on ML as a workload study several years ago, so it’s hard to put a particular start date on it. You start by supporting and understanding the workload and then do trial designs. But we could see that the TOPS-per-watt we were being asked for required a dedicated architecture.”

“But this is part of sustained effort from ARM. We’ve been contemplating that this is a fundamental shift in computing,” said Laudick. “It’s going to go to servers, phones, digital television, cars and hundreds of other devices.”

So what is the ML processor?

ML processor has function and layer enginers. Source: ARM.

Davies said that ARM is limiting how much it will say about the ML processor right now with plans to reveal more later in 2018. But he was prepared to discuss it in general terms. “The fixed-function engine is a collection of resources able to do matrix multiplication and accumulation efficiently. Just about all neural networks have a lot of matrix arithmetic and on a standard processor you would have a lot of intermediate results to store and load and there’s your power efficiency gone,” Davies explained.

Davies added that ARM’s approach has been to seek out the most useful and most common  primitive math operations used in machine learning workloads and package them up efficiently and with appropriate connectivity and memory support within the engine. There are also dedicated support for convolutional neural networks such as the ability to slide a window of interest across a matrix.

8bit/16bit integers

ARM has also made the decision to only support 8bit and 16bit fixed-point integer data types. Davies said: “The ML industry is coalescing around smaller fixed-point integers and although some applications use 16bit floating-point the difference in power consumption is enormous. Some research is going below 8bit but there are power costs there too. One and two-bit networks tend to have many more nodes. We think 8 bit is a sweet spot for general application and that 16- and 32-bit floating point machine learning will go into rich devices.”

Next: Going soft

Davies added that the ARM NN software will cope with the broader range of data types with support kernels for FP16/FP32 to run on other processing engines such as Cortex CPUs and Mali GPUs.

In its launch materials for the ML processor ARM emphasized that the core is designed to go in equipment at the edge of the network because using “the cloud” to run neural networks can produce problems of latency, security and power efficiency because of data transport, which therefore also implies economics. “We’re optimizing the design specifically for inference. We’re leaving training; they are different workloads. Training will tend to be done in just a few places with massive data sets. But inference must move out to the edge,” said Davies.

Laudick said he saw it as less black and white and more about the size of the training dataset. “If it is learning to identify a cat from 500,000 pictures, we won’t be doing that. But if it is learning to identify a fingerprint from a limited set of instances then we can,” he said.


So is the ML processor the first of many ML instances from ARM, a roadmap of devices?

Davies said: “We’ve aimed this processor core at the top-end of premium smartphone but not quite the top end of edge computing. For example, there will be higher requirements in automotive. The architecture will scale up to server implementations and down to ‘always-on’ devices. Mobile is on the critical path and so is showing up first.”

We also asked if licensees and potential users would be able to control power efficiency and performance through dynamic voltage and frequency scaling? “We’ve thrown everything we’ve ever learned about processor design at this; data flow stuff, data compression, DVFS,” said Davies.

For mobile applications the ML processor core will be in the application SoC for use as an accelerator and would typically be interfaced via an on-chip bus. But the detail is up to licensees said Laudick. They can choose to have it closely or loosely couple to the CPU.

ARM approach to ML from framework to metal. Source: ARM.

Most people will program the ML processor by creating their neural network in their framework of choice – be that TensorFlow, Caffe, Android NNAPI or MXNet – and then use ARM’s NN Software to take that description and transfer it to CPU, GPU and ML as appropriate and as possible.

So will ML processor be compatible with Khronos NNEF standard?

Khronos, which is a member funded organization that grew in the graphics industry, has released NNEF 1.0 as a provisional specification to enable the transfer of trained networks from their chosen training framework to a wide variety of inference engines. NNEF encapsulates a complete description of the structure, operations and parameters of a trained neural network, independent of the training tools used to produce it and the inference engine used to execute it.

The ML processor doesn’t support NNEF off the bat, but it may do in the future, Davies said. “The framework providers are still championing themselves and ARM NN operates with all the main frameworks. Right now any of those frameworks plus ARM NN provides the means to program. But over time will those framework producers start to support Khronos NNEF? Probably,” said Davies. “And ARM is a customer-driven company.”

Next: The bigger picture

We also asked Davies a couple of more general questions about machine learning and the state of the industry.

The first was about optical methods for training and inference. Could the fact that matrix multiplication is highly parallel and relatively homogeneous favour an optical approach to machine learning at much higher energy efficiency? It is notable that two startups have come out of Massachusetts Institute of Technology specifically to address this (see MIT optical AI spin-off raises funds from Baidu and MIT spin-off raises funds for optical processor). The usual hindrance to such developments is the energy burden of converting from electronics to photonics and then back to electronics.

Davies said: “It is something we ruminate about. But we have R&D engineers and usually when we ask them about these sorts of things they tell us we’ve got at least five more years. As a mass market IP deployment company, we look for what is popular. And electronics in CMOS has a lot of economic advantages in its favour.”

And what about neuromorphics? Are neural networks for training and inference and recognition just a stopping off point on the way to more neuromorphic computation and how long will the technology industry stay here?

Davies is emphatic: “Neuromorphic computing is of interest but neural networks have not hit prime time yet and there is a broad consensus on what can be done although the exact detail is still changing with research into more efficient networks. That’s why we focus on the primitive operations. There’s a lot of research into reducing the training work-loads, which can make training at the edge more plausible.”

For Jem Davies machine learning may be the new graphics but for ARM and the technology industry in general it looks set to be much more important than that.

Related links and articles:

News articles:

ARM launches two machine learning processors

ARM buys embedded vision firm for $350 million

Thoughts on Jem Davies leading ARM’s machine learning group

ARM’s soft launch for machine learning library

ARM acquires ChaoLogix for security reasons


Linked Articles
eeNews Europe