Semidynamics launches configurable RISC-V vector unit

Semidynamics launches configurable RISC-V vector unit

Technology News |
By Nick Flaherty

Semidynamics in Spain has developed a highly configurable out of order vector unit with a new architecture to boost performance of RISC-V processor designs, and is running a demonstration of Doom.

The Vector Unit pairs with Semidynamics’ 64bit Out-Of-Order RISC-V Atrevido core and upcoming In-Order cores. “We are launching the configurable core with the vector unit support,” Roger Espasa, CEO of SemiDynamics tells eeNews Europe.

“We have software running on the IP with a Linux system live on an FPGA,” he said. “The cool thing is we decided to go with Doom and vectorised it ourselves and you can see it is faster with the vectors, you get a very good speed up,” he says. The video comparison with the core without the vector unit is below.

“Doom is nice as it is so old it doesn’t use the texturing capabilities of todays GPUs, it just paints pixels into the image buffer. So we took the software and went to the top three subroutines in C and rewrote them using vector intrinsics, compiled with a stock compiler and it just runs,” he said.

“It means that when there is a loop in the programme where they are scaling something, it can scale by a factor of eight. Not everything is vectorised so you get a nice 3 – 4x speed up,” he said.

The design uses the RVV1.0 RISC-V vector standard with additional, customisable features to provide enhanced data handling capabilities. Each vector core has arithmetic units capable of performing addition, subtraction, fused multiply-add, division, square root, and logic operations and can be tailored to support different data types: FP64, FP32, FP16, BF16, INT64, INT32, INT16 or INT8, depending on the customer’s target application.

“This is two major pieces of work,” said Espasa. “We designed the vector cores with configurable arithmetic units for integer and floating point. The next job is to make them customisable for the customer to define the number of cores, starting at 4 and going up to 32 and that gives you a big block of MAC functions.”

“The second part is the vector load and store instructions. Those instructions are within the core as we made the choice of making the vector loads and stores coherent with scalar instructions. It should just work, that’s what the specification says. The vector instructions move through the cache pipeline and do all the things a load would do through the memory unit, everything you would expect. So there was a piece of work to ensure that the core supported functions such as Gather/scatter.”

“But this is complicated to say the least. We have a good connection with our  Gazillion [modular interconnect] technology and vectors want to touch many, many things in memory and that works well with Gazillion. For example for a vector register gather, you can gather a byte from any register and put it into any other, then there’s slide up and down which is very useful for FFTs, moving a vector to the left or right, then there’s compress and expand which is also a difficult instruction,” he explained.

A key part of the configurability of the vector register is the K value. ARM and Intel use a K value of 1 for a vector of 512 bits for a SIMD (single instruction multiple data) style processor.

“There’s other ways of doing this and the vector register can be bigger than the vector unit and that’s great for power and performance,” he said. “We are the first to offer this multiplying factor,” he added. “Rather than a SIMD style processor like ARM and Intel where  K=1, K can be 2, 4 or 8 to minimise the power wasted on the instruction processing.”

He points to a register where K-4 with 2048 bits and the vector unit is 512bits. This means the register is filled over four clocks each with 512bits, using the hardware fully for four clocks with no wait states. In the meantime instructions are piling up and instruction gating cuts in, reducing the clock and the power consumption.

“The analogy is the same way that GPUs put tons of registers to get good performance and memory tolerance,” said Espasa. “If you make the vector register large, you get the same benefits, so we recommend this to our customers.”

The company is doing a small test chip, probably in a 12 or 7nm process, he says. “We have a customer taking the vector unit but that’s not yet in silicon in 5nm so we are doing timing and area on that design and we have another customer in the 12nm range,” he said, “But it is available in an FPGA of your choice to test out.”

If you enjoyed this article, you will like the following ones: don't miss them by subscribing to :    eeNews on Google News


Linked Articles