Low power implementation for generative AI

Low power implementation for generative AI

Technology News |
By Nick Flaherty

Researchers in the US have run high-performing large language model on an FPGA that uses the energy of a lightbulb.

The low power AI technique developed by researchers at the University of California Santa Crus eliminates the most computationally expensive and memory intensive element of a large language model for generative AI to improve energy efficiency to 13W for a billion parameter LLM model.

While this potentially opens up a new generation of low power custom edge AI chips, particularly with small language models (SML), the models are based on transformers that are also memory intensive. GPT4 is estimated to have 1.76 trillion parameters and the next generation will have even more.

Energy costs are a key challenge for running the latest LLMs for services such as ChatGPT and GPT4 on GPUs. The research at UCSC eliminates the computationally expensive matrix multiplication layer.

Even with a slimmed-down algorithm and much less energy consumption, the new, open source model achieves the same performance as models such as Meta’s Llama LLM with 2.7bn parameters.

“We got the same performance at way less cost — all we had to do was fundamentally change how neural networks work,” said Jason Eshraghian, an assistant professor of electrical and computer engineering at the Baskin School of Engineering. “Then we took it a step further and built custom hardware.”

Modern neural networks use matrix multiplication, performing operations that weigh the importance of particular words or highlight relationships between words in a sentence or sentences in a paragraph. Larger scale language models have trillions of these numbers.  “Neural networks, in a way, are glorified matrix multiplication machines,” he said. “The larger your matrix, the more things your neural network can learn.”

To multiply numbers from matrices on different GPUs, data must be moved around, a process which creates most of the neural network’s costs in terms of time and energy. This has been tackled with new architectures such as the all-in-one RISC-V vector and TPU unit developed by SemiDynamics in Barcelona.

The low power AI strategy to avoid using matrix multiplication forces all the numbers within the matrices to be ternary, meaning they can take one of three values: negative one, zero, or positive one. This allows the computation to be reduced to summing numbers rather than multiplying. The matrices are then overlaid and only the most important operations are performed. 

The two algorithms can be coded the exact same way, but the implementation reduces the complexity of the hardware. “From a circuit designer standpoint, you don’t need the overhead of multiplication, which carries a whole heap of cost,” said Eshraghian. 

Although they reduced the number of operations, the researchers were able to maintain the performance of the neural network by introducing time-based computation in the training of the model. This enables the network to have a “memory” of the important information it processes, enhancing performance.

The researchers initially designed their neural network to operate on GPUs and achieved about 10 times less memory consumption and operated about 25 percent faster than other models with a power consumption of around 700W. Using an optimized kernel during inference, the model’s memory consumption can be reduced by more than 10x compared to unoptimized models.

The researchers then worked with Assistant Professor Dustin Richmond and Lecturer Ethan Sifferman in the Baskin Engineering Computer Science and Engineering department to create custom hardware on an FPGA clocked at 60MHz.

The RTL implementation of the MatMul-free token generation core is deployed on a D5005 Stratix 10 programmable acceleration card (PAC) in the Intel FPGA Devcloud. This uses a single core with 8bit tokens.

The core requires access to a DDR4 interface and MMIO bridges for host control. In this implementation, the majority of resources are dedicated to the provided shell logic and only 0.4% of programmable logic resources are dedicated to logic for core interconnect and arbitration to DDR4 interfaces/MMIO.

The core latency is primarily due to the larger execution time of the ternary matrix multiply functional unit. By instead using the full 512-bit DDR4 interface and parallelizing the TMATMUL functional unit, which dominates 99% of core processing time, a further speed-up of approximately 64× is projected, while maintaining the same clock rate without additional optimizations or pipelining.

The 1.3bn parameter model they used has a runtime of 42ms, and a throughput of 23.8 tokens per second. This reaches human reading speed at an efficiency that is on par with the power consumption of the human brain. The performance can be improved with multiple cores and optimised the cache management and the IP blocks, for example for the DDR4 interfaces.

Estimates of multi-core implementation latencies are generated by scaling the overheads of the single core implementation and factoring in the growth of logic to accommodate contention on the DDR4 channels. Each core connects to one of four DDR4 channels, and each additional core connected to a channel will double the required arbitration and buffering logic for that channel. As both the host and core share DDR4 channels, this overhead will scale proportional to the number of cores attached to the channel.

To mitigate this, future work could bring additional caching optimizations to the core and functional units. With further development and custom silicon, the researchers believe they can further optimize the low power AI technology for even more energy efficiency. However, one limitation is that the MatMul-free LM has not been tested on extremely large-scale models with over 100bn parameters due to computational constraints.

“These numbers are already really solid, but it is very easy to make them much better,” said Eshraghian. “If we’re able to do this within 13 watts, just imagine what we could do with a whole data centre worth of compute power. We’ve got all these resources, but let’s use them effectively.”

The code is available on Github at this URL

If you enjoyed this article, you will like the following ones: don't miss them by subscribing to :    eeNews on Google News


Linked Articles