World’s fastest deep learning inference software for ARM Cortex-M

October 15, 2021 // By Nick Flaherty
World’s fastest deep learning inference software for Arm Cortex-M
UK startup Plumerai has developed inference software for ARM Cortex-M microcontrollers that it has benchmarked this week as the fastest available for Binarized Neural Networks and for 8bit deep learning models.

The inference software developed by London-based AI startup Plumerai is an essential component that directs resource management in a similar way to an operating system. With an inference time of 77ms this has 40 percent lower latency and requires 49 percent less RAM (80Kbits) than TensorFlow Lite for Microcontrollers with ARM’s CMSIS-NN kernels while retaining the same accuracy. This makes it the fastest currently available, says the company.

The inference software builds on TensorFlow Lite for Microcontrollers but does not use the TensorFlow or ARM kernels for the most performance-critical layers. Instead, custom kernel code was developed and optimized for lowest latency and memory usage. This includes optimized code for regular convolutions, depthwise convolutions, fully-connected layers, various pooling layers and more.

Related articles 

To be faster than the heavily optimized ARM Cortex-M specific CMSIS-NN kernels, the developers had to go deep inside the inner-loops and also rethink the higher-level algorithms. This includes optimizations such as hand-written assembly blocks, improved register usage, pre-processing of weights and input activations and template-based loop-unrolling.

These generic per-layer-type optimizations provided faster operation, the developers also performed specific optimizations for each layer in a neural network. For instance, rather than only optimizing convolutions in general, the inference software makes specific improvements based on all actual values of layer parameters such as kernel sizes, strides, padding, etc.

As the software can run many different types of machine learning model, this optimizations are made together with the compiler. This is achieved by generating code in an automated pre-processing step using the neural network as input. The tool then guides the compiler to do all the necessary constant propagation, function inlining and loop unrolling to achieve the lowest possible latency.

Memory usage is an important constraint on embedded devices; however fast or slow the


Vous êtes certain ?

Si vous désactivez les cookies, vous ne pouvez plus naviguer sur le site.

Vous allez être rediriger vers Google.