Configurable VLIW core boosts energy efficiency for long battery life

Technology News |
By Nick Flaherty

“We are finding that the computation in cloud data centres they are now constrained by their cooling requirements and Google and Amazon are offloading their computation to FPGAs to reduce their power consumption so they can pack in more compute power,” said Bryan Donoghue, digital system lead at Cambridge Consultants, which is part of the Altran group.

“We have had a number of projects exploring flexible systems with the power efficiency that approaches that of dedicated hardware,” said Donoghue. “You can build a 16×16 multiplier in 5000 gates, a Teak-lite II DSP takes 100K gates, and ARM’s Cortex R7 core takes 1.3M gates. So I want to add more flexibility but not go all the way to a CPU.”

So the team has developed a very long instruction word (VLIW) approach coupled with dedicated modules such as a MAC, ALU or FFT and connected via a programmable multiplexer.

The key is that the VLIW can be 100 or 200 bits long with dozens of mini op codes that control modules, memory interface and the routing of the multipliers to provide a dynamic data path on a cycle by cycle basis.

This gives a lot of flexibility in the development stage. “When you are coding the algorithm and need more modules you can add those in and experiment,” he said.

The end result is a two stage pipeline and so it has a very low control and datapath overhead. The design philosophy is very different from a CPU. Rather than going as fast as possible, a 40nm design runs at 100MHz so that the core works at the same speed as memory so there aren’t wait states and the team can use a low power library and optimise that even further.

At 100MHz this can give 1GMAC/s of performance and in many systems the algorithms paralellise well so multiple cores can be used to get to several 10s of GMAC/s of performance with tens of milliwatts of power consumption.

As compilers aren’t very efficient with VLIW code, the core is programmed in assembler and the team has also built a set of tools to support the development.

“We have a toolset that helps us build these cores and have a big library of these, mix and match the modules and that squirts out the Verilog. We code in assembler rather than C or CUDA – but the competition is Verilog and it’s a lot easier to program in assembler.”

A graphical simulator called Sapphyre is configurable with chosen modules, allowing developers to chose the data path. This is bit and cycle accurate which is important to provide the required performance, but it also produces cycle by cycle vectors that are then used as the test vectors from the Verilog.

“We also have a real time debug monitor embedded in the silicon via the multiplexer – that helps developing code on the actual silicon and it provides visibility of all the data in the system,” he said. “You can take that data and feed it back into the simulator for a replay and that gives great visibility.

A typical design using the core in 40nm runs at 96MHz and uses 116K gates. This provides 384MMAC/s at 8mW peak performance and a 1mW average power in 0.25mm2 of silicon. This can be used to replace a CPU or DSP core in an ASIC to reduce power and area and boost performance

The core can also be used for machine learning in AI systems, he says. “The modules change to array-based processing modules for CNN layers but the same architecture works well and we are doing work in that space,” he said.


Linked Articles
eeNews Europe