
Fully customisable 4-way RISC-V core for big data

Semidynamics in Spain has developed a customisable 4-way RISC-V 64bit core for data centre chips.
The Atrevido 423 RISC-V core has a wide 4-way pipeline for decoding and retiring twice as many instructions as the previous 2-way 223 core. It is also coupled with more functional units, which significantly increases the instructions-per-cycle (IPC) throughput.
Atrevido can be configured as a coherent core with a CHI NoC or as a simpler, incoherent core connected via an AXI interface. An improved TLB and MMU and support for SV39/48/57 means the core is well suited for running applications with large memory footprints using Linux.
The Out-Of-Order core comes with a large menu of RISC-V extensions that can be added. Most notably, it can be configured with the in-house Vector Unit, which fully supports the latest RISC-V vector spec.
The key to the customisation is a large Vector Unit with up to 2048b of computation per cycle for data handling. The Vector Unit is composed of several ‘vector cores’, roughly equivalent to a GPU core, that perform multiple calculations in parallel. Each vector core has arithmetic units capable of performing addition, subtraction, fused multiply-add, division, square root, and logic operations.
The vector core can be tailored to support different data types: FP64, FP32, FP16, BF16, INT64, INT32, INT16 or INT8, depending on the customer’s target application domain. The largest data type size in bits defines the vector core width or ELEN. Customers then select the number of vector cores to be implemented within the Vector Unit, either 4, 8, 16 or 32 cores, catering for a very wide range of power-performance-area trade-off options. Once these choices are made, the total Vector Unit data path width or DLEN is ELEN x number of vector cores. Semidynamics supports DLEN configurations from 128b to 2048b.
The core also offers a second key choice in the Vector Unit: the number of bits of each vector register (known as VLEN) can also be tailored to customer’s needs. While most other vendors assume that VLEN is equal to DLEN (i.e., 1X ratio), Semidynamics offers 2X, 4X and 8X ratios. When the VLEN is larger than the DLEN, a vector operation uses multiple cycles to execute. For example, when VLEN=2048 and DLEN=512, each vector arithmetic operation will take four clocks to execute. This is a great feature for tolerating large memory latencies and for reducing power.
Other extensions are bit manipulation, crypto, single-precision FP, double-precision FP and half-precision FP, and bfloat16. Customers can also optionally choose to protect the Data cache with ECC and the Instruction cache with parity, if required for their target markets. Furthermore, the Atrevido core is fully compliant with the latest RVA22 RISC-V profile. The cores are process agnostic with versions already being supplied down to 5nm.
“The Atrevido 423 is particularly well suited for applications that require massive amounts of data. It shines when the data required cannot fit in memory hierarchy levels that are closer to the core (such as L1, L2 or even L3) by tolerating very large latencies without compromising on throughput thanks to our Gazzillion misses technology,” said Roger Espasa, CEO of Semidynamics.
“This can handle up to 128 simultaneous requests for data and track them back to the correct place in whatever order they are returned. Gazzillion allows the core to access memory hierarchy levels far away from the core without an impact in bandwidth or throughput. Effectively, Gazzillion technology removes the latency issues that can occur when using CXL technology to enable far away memory to be accessed at the supercharged rates that it was designed to deliver. This makes Atrevido very well positioned to handle AI and HPC workloads, which typically need to rapidly access very large amounts of data from main memory.”
The scalar crypto extension implemented follows the latest specification (Zks and Zk) and provides high performance encryption for algorithms such as SHA2-256, SHA2-512, ShangMi 3, ShangMi 4, AES-128, AES-192, and AES-256. The Atrevido 423 constant-time implementation provides security against side-channel attacks while still delivering a high-performance crypto solution.
“Semidynamics has the fastest cores on the market for moving large amounts of data with a cache line per clock at high frequencies even when the data does not fit in the cache. And this can be done at frequencies up to 2.4 GHz on the right node. The rest of the market averages about a cache line every many, many cycles, that is nowhere near Semidynamics’ one every cycle,” said Espasa.
