Tachyum® has detailed in a white paper how to use 4-bit Tachyum AI (TAI) and 2-bit effective per weight (TAI2) formats in Large Language Models (LLMs) quantization without accuracy degradation.
Tachyum hardware also enables workable LLMs with 1-bit per weight with higher degradation than TAI2 and its AI scientists are continuing to further improve performance to reduce degradation as Tachyum looks to bring it to mainstream.
Tachyum addresses massive LLMs with capabilities that have dramatically increased by more than a thousand times over the past few years. Examples of these increases include ChatGPT-3.5 LLM with 175 billion parameters, the PALM LLM with 530 billion dense parameters and the Switch Transformer with 1.6 trillion sparse parameters.
For example, a 1.6 trillion parameters Switch Transformer would require 52x NVIDIA H100 80GB GPUs at $41,789 each plus 7 x $25,000 for Supermicro GPU servers for a total of $2,348,028. In contrast, a $23,000 single Prodigy socket system with 2TB DDR5 DRAM could fit and run such big models and bring them into the mainstream for generative AI applications.
AI systems built on Prodigy universal chips with 256PB DDR5 DRAM (Dynamic Access Random Memory) using FP8 (8-bit floating point) and 4-bit Tachyum AI (TAI) data formats can fit up to 100 quadrillion parameter models. It can serve more than 150,000x ChatGPT models or 610,000x PALM2 models and represents huge possibilities for using LLMs as a mainstream technology in various industries from retail and e-commerce, marketing, finance, cyber security, military to healthcare including faster drug development or practical implementation of personalized medicine in hospitals.
Effective deployment of LLMs requires low-bit quantization to minimize model size and inference cost. Low-bit integer formats, like INT8 and INT4, have been the conventional choice, however, the emerging low-bit exponential formats offer a compelling alternative. At reasonable costs, LLMs could be deployed by enterprises small to large across a variety of industries. LLMs could be an integral part of an organization’s web presence to provide an interactive experience, such as enabling the ability of visitors to ask questions naturally vs. entering search terms.
“By combining TAI 4-bit and effective 2-bit weights with FP8 per activation, we are capable of quantizing LLMs without much accuracy degradation,” said Dr. Radoslav Danilak, founder and CEO of Tachyum. “Our techniques avoid expensive multiplication while simultaneously reducing the size of the model by 4x to 8x, enabling generative AI models that can be applied in use cases from complex language modelling tasks, text generation, drug and chip design, few-shot learning and reasoning to protein sequence modelling. Whole new avenues of calculations can be opened with Tachyum AI.”
As a Universal Processor offering industry leading performance for all workloads, Prodigy-powered data center servers can seamlessly and dynamically switch between computational domains (such as AI/ML, HPC, and cloud) with a single homogeneous architecture. By eliminating the need for expensive dedicated AI hardware and dramatically increasing server utilization, Prodigy reduces CAPEX and OPEX significantly while delivering unprecedented data center performance, power, and economics. Prodigy integrates 192 high-performance custom-designed 64-bit compute cores, to deliver up to 4.5x the performance of the highest-performing x86 processors for cloud workloads, up to 3x that of the highest performing GPU for HPC, and 6x for AI applications.
The paper: “Mainstreaming Large Language Models With 2-bit TAI Weights” is available here.