MENU

Cerebras Launches the World’s Fastest AI Inference

Cerebras Launches the World’s Fastest AI Inference

Technology News |
By Wisse Hettinga



Developers can now leverage the power of wafer-scale compute for AI inference via a simple API

from the press release:

Cerebras Systems, the pioneer in high performance AI compute, announced Cerebras Inference, the fastest AI inference solution in the world. Delivering 1,800 tokens per second for Llama3.1 8B and 450 tokens per second for Llama3.1 70B, Cerebras Inference is 20 times faster than NVIDIA GPU-based solutions in hyperscale clouds. Starting at just 10c per million tokens, Cerebras Inference is priced at a fraction of GPU solutions, providing 100x higher price-performance for AI workloads

Unlike alternative approaches that compromise accuracy for performance, Cerebras offers the fastest performance while maintaining state of the art accuracy by staying in the 16-bit domain for the entire inference run. Cerebras Inference is priced at a fraction of GPU-based competitors, with pay-as-you-go pricing of 10 cents per million tokens for Llama 3.1 8B and 60 cents per million tokens for Llama 3.1 70B

“Cerebras has taken the lead in Artificial Analysis’ AI inference benchmarks. Cerebras is delivering speeds an order of magnitude faster than GPU-based solutions for Meta’s Llama 3.1 8B and 70B AI models. We are measuring speeds above 1,800 output tokens per second on Llama 3.1 8B, and above 446 output tokens per second on Llama 3.1 70B – a new record in these benchmarks,” said Micah Hill-Smith, Co-Founder and CEO of Artificial Analysis

“Artificial Analysis has verified that Llama 3.1 8B and 70B on Cerebras Inference achieve quality evaluation results in line with native 16-bit precision per Meta’s official versions. With speeds that push the performance frontier and competitive pricing, Cerebras Inference is particularly compelling for developers of AI applications with real-time or high volume requirements,” Hill-Smith concluded.

Inference is the fastest growing segment of AI compute and constitutes approximately 40% of the total AI hardware market. The advent of high-speed AI inference, exceeding 1,000 tokens per second, is comparable to the introduction of broadband internet, unleashing vast new opportunities and heralding a new era for AI applications. Cerebras’ 16-bit accuracy and 20x faster inference calls empowers developers to build next-generation AI applications that require complex, multi-step, real-time performance of tasks, such as AI agents.

If you enjoyed this article, you will like the following ones: don't miss them by subscribing to :    eeNews on Google News

Share:

Linked Articles
10s