
SemiDynamics details its all-in-one RISC-V NPU

SemiDynamics in Spain has developed a fully programmable Neural Processing Unit (NPU) IP that combines CPU, vector, and tensor processing to deliver up to 256 TOPS for large language models and AI recommendation systems.
The Cervell NPU is based on the RISC-V open instruction set architecture that scales from 8 to 64 cores. This allows designers to tune performance to the requirements of the applications, from 8 TOPS INT8 at 1GHz in compact edge deployments to 256 TOPS INT4 in high-end AI inference in datacentre chips.
It follows the launch of the all-in-one architecture back in December detailed in this white paper.
“Cervell is designed for a new era of AI compute — where off-the-shelf solutions aren’t enough. As an NPU, it delivers the scalable performance needed for everything from edge inference to large language models. But what really sets it apart is how it’s built: fully programmable, with no lock-in thanks to the open RISC-V ISA, and deeply customizable down to the instruction level. Combined with our Gazillion Misses memory subsystem, Cervell removes traditional data bottlenecks and gives chip designers a powerful foundation to build differentiated, high-performance AI solutions,” says Roger Espasa, CEO of Semidynamics.
- RISC-V AI IP selected for LLM application
- RISC-V SDK adds ONNX runtime support
- Spanish startup performs open core surgery
Cervell NPUs are purpose-built to accelerate matrix-heavy operations, enabling higher throughput, lower power consumption, and real-time response. By integrating NPU capabilities with standard CPU and vector processing in a unified architecture, designers can eliminate latency and maximize performance across diverse AI tasks, from recommendation systems to deep learning pipelines.
THe Cervell cores are tightly integrated with the Gazillion Misses memory management subsystem. This enables up to 128 simultaneous memory requests, eliminating latency stalls with over 60 bytes/cycle of sustained data streaming. There is also massively parallel access to off-chip memory, essential for large model inference and sparse data processing.
This maintains full pipeline saturation, even in bandwidth-heavy applications like recommendation systems and deep learning.
The core is fully customizable with the ability to add scalar or vector instructions, configure scratchpad memories and custom I/O FIFOs and define memory interfaces and synchronization schemes to provide differentiated AI hardware with future-proofing.
This deep customization at the RTL level, including the insertion of customer-defined instructions, allows companies to integrate the unique IP directly into the solution protecting their ASIC investment from imitation and ensuring the design is fully optimized for power, performance, and area. The development model includes early FPGA drops and parallel verification to reduce the development time and risks.
Configuration |
INT8 @ 1GHz |
INT4 @ 1GHz |
INT8 @ 2GHz |
INT4 @ 2GHz |
C8 |
8 TOPS |
16 TOPS |
16 TOPS |
32 TOPS |
C16 |
16 TOPS |
32 TOPS |
32 TOPS |
64 TOPS |
C32 |
32 TOPS |
64 TOPS |
64 TOPS |
128 TOPS |
C64 |
64 TOPS |
128 TOPS |
128 TOPS |
256 TOPS |
