32bit RISC-V cores are customisable for TensorFlowLite AI

32bit RISC-V cores are customisable for TensorFlowLite AI

Technology News |
By Nick Flaherty

Codasip has launched two 32bit RISC-V processor cores that can be optimised for machine learning applications.

The L31 and L11 are the latest cores optimized for customization to run machine learning neural networks in power-constrained applications such as IoT edge and are the first to feature TFLite Micro support. This will be followed by support across Codasip’s wider portfolio of RISC-V cores (above).

Algorithms for AI/ML are computationally intensive and custom processors are needed to deliver sufficient performance with the limited resources available in such embedded systems. The L31/L11 embedded cores are designed to run Google’s TensorFlowLite for Microcontrollers by using the Codasip Studio tools.

“Licensing the CodAL description of a RISC-V core gives Codasip customers a full architecture license enabling both the ISA and microarchitecture to be customized. The new L11/31 cores make it even easier to add features our customers were asking for, such as edge AI, into the smallest, lowest power embedded processor designs,” said Zdeněk Přikryl, CTO of Codasip.

To optimize vector memory loads and sequences of convolutional multiplication and accumulation, two custom instructions have been added: mac3 that joins multiplication and addition into a single clock cycle, and lb.pi that immediately increments the address after the load instruction. The idea behind both is to reduce the number of clock cycles spent for frequently repeating instruction sequences. Codasip’s CodAL language provides an efficient way to both describe the assembly encoding and programmers view functionality of the instruction.

AI and ML applications are not well suited to off-the-shelf processors. The data types, the quantization and performance needs of the devices differ significantly from application to application. Developers using the Codasip Studio tools can customize the processor for its specific system, software and application requirements. Similarly, embedded devices in low power IoT applications are extremely resource-constrained: limited in memory and with a limited instruction set.

With the L31 CPU, Codasip Studio’s built-in profiler provides detailed PPA (Performance-Power-Area) estimates, source code coverage and individual instruction usage. This allows new instructions to be quickly tested out and evaluated. The profiler can also estimate the ASIP’s power and area, providing information about information for each hardware block in the design. This enables the designer to choose between standard variants of the L31 core and to assess the benefits of quantization using TFLite-Micro.

The base L31 configuration without floating point hardware is area efficient, but relatively slow performing since FP operations have to be emulated in software. This has a 3-stage pipeline with 32 registers and parallel multipliers. Adding a hardware floating-point unit to L31 solves this issue, reducing the total time by almost 85% and power consumption by 42%, however at the cost of silicon area expansion (+207%)

TFLite-Micro supports the quantization of neural network parameters and the input data. An int8-quantized model running on the standard integer L31 core achieves almost the performance of a floating-point core, reducing the runtime by more than 80% and further improving power consumption by 77% from the initial level, without the need to increase core complexity and silicon area.

Switching from floating-point version to int16 and int8 inevitably reduces the accuracy and developers need to check that it hasn’t degraded too much. But both quantized (int8) and initial floating-point models have been verified on the test set containing 10,000 images, with an accuracy of 98.91% for 32bit floating point (fp32) and 98.89% for int8.

“Hot-spots” identification for the standard L31 core running the TFLite image classification task provides hints as to which instructions could be merged or optimized to boost a specific task.

Adding just two new instructions that optimize arithmetic and array loads from memory results in the total runtime improvement above 10% and power consumption reduction by more than 8% compared to the quantized model run on the standard core. The area increased by just 0.8% which seems to be a reasonable customization cost. Using SIMD instructions might give a further performance boost, but is likely to increase area significantly

Related Codasip articles

Other articles on eeNews Europe


If you enjoyed this article, you will like the following ones: don't miss them by subscribing to :    eeNews on Google News


Linked Articles