The inference software developed by London-based AI startup Plumerai is an essential component that directs resource management in a similar way to an operating system. With an inference time of 77ms this has 40 percent lower latency and requires 49 percent less RAM (80Kbits) than TensorFlow Lite for Microcontrollers with ARM’s CMSIS-NN kernels while retaining the same accuracy. This makes it the fastest currently available, says the company.
The inference software builds on TensorFlow Lite for Microcontrollers but does not use the TensorFlow or ARM kernels for the most performance-critical layers. Instead, custom kernel code was developed and optimized for lowest latency and memory usage. This includes optimized code for regular convolutions, depthwise convolutions, fully-connected layers, various pooling layers and more.
- XMOS and Plumerai partner on binarised neural networks
- STMicroelectronics acquires Cartesiam for edge AI tool
- Edge AI tool targets PSoC microcontrollers
- Nordic ports TinyML to cellular IoT chip
To be faster than the heavily optimized ARM Cortex-M specific CMSIS-NN kernels, the developers had to go deep inside the inner-loops and also rethink the higher-level algorithms. This includes optimizations such as hand-written assembly blocks, improved register usage, pre-processing of weights and input activations and template-based loop-unrolling.
These generic per-layer-type optimizations provided faster operation, the developers also performed specific optimizations for each layer in a neural network. For instance, rather than only optimizing convolutions in general, the inference software makes specific improvements based on all actual values of layer parameters such as kernel sizes, strides, padding, etc.
As the software can run many different types of machine learning model, this optimizations are made together with the compiler. This is achieved by generating code in an automated pre-processing step using the neural network as input. The tool then guides the compiler to do all the necessary constant propagation, function inlining and loop unrolling to achieve the lowest possible latency.
Memory usage is an important constraint on embedded devices; however fast or slow the