World’s fastest deep learning inference software for ARM Cortex-M

Technology News |
By Nick Flaherty

The inference software developed by London-based AI startup Plumerai is an essential component that directs resource management in a similar way to an operating system. With an inference time of 77ms this has 40 percent lower latency and requires 49 percent less RAM (80Kbits) than TensorFlow Lite for Microcontrollers with ARM’s CMSIS-NN kernels while retaining the same accuracy. This makes it the fastest currently available, says the company.

The inference software builds on TensorFlow Lite for Microcontrollers but does not use the TensorFlow or ARM kernels for the most performance-critical layers. Instead, custom kernel code was developed and optimized for lowest latency and memory usage. This includes optimized code for regular convolutions, depthwise convolutions, fully-connected layers, various pooling layers and more.

Related articles 

To be faster than the heavily optimized ARM Cortex-M specific CMSIS-NN kernels, the developers had to go deep inside the inner-loops and also rethink the higher-level algorithms. This includes optimizations such as hand-written assembly blocks, improved register usage, pre-processing of weights and input activations and template-based loop-unrolling.

These generic per-layer-type optimizations provided faster operation, the developers also performed specific optimizations for each layer in a neural network. For instance, rather than only optimizing convolutions in general, the inference software makes specific improvements based on all actual values of layer parameters such as kernel sizes, strides, padding, etc.

As the software can run many different types of machine learning model, this optimizations are made together with the compiler. This is achieved by generating code in an automated pre-processing step using the neural network as input. The tool then guides the compiler to do all the necessary constant propagation, function inlining and loop unrolling to achieve the lowest possible latency.

Memory usage is an important constraint on embedded devices; however fast or slow the software is, it has to fit in memory to run at all. TensorFlow Lite for Microcontrollers already comes with a memory planner that ensures a tensor only takes up space while there is a layer using it. The memory usage is optimised with a smart offline memory planner that analyzes the memory access patterns of each layer of the network. Depending on properties such as filter size, the memory planner allows the input and output of a layer to partially or even completely overlap, effectively computing the layer in-place.

As well as ARM Cortex-M, the inference software also works with ARM Cortex-A and RISC-V architectures. Binarized Neural Networks (BNNs) are deep learning models that use only a single bit to encode each weight and activation and Plumerai is building improved deep learning model architectures and training algorithms for BNNs with a custom IP-core for customers with FPGAs.

The company is also working with UK microcontroller maker XMOS on developing the BNN technology.

This allows the inference engine to process more frames per second, save more energy, run larger and more accurate AI models and deploy on cheaper hardware.

Other articles on eeNews Europe 


Linked Articles
eeNews Europe