MENU

Tachyum plans 3nm Universal processor

Tachyum plans 3nm Universal processor

Interviews |
By Nick Flaherty



Tachyum has resolved problems with the IP for its 5nm universal processor and is developing a next generation on a 3nm process.

The 5nm Prodigy chip should have taped out last year. “We purchased IP for the DDR5 memory and PCIe and the supplier had an issue,” said Radoslav Danilak, founder and CEO of Tachyum tells eeNews Europe. “We replaced that with Rambus and AlphaWave and another for DDR5. That was a nasty surprise but the good news is that the new IP is integrated and ready now.”

Rambus is supplying its PCI Express (PCIe) 5.0 IP, while UK-based Alphawave supplies its AlphaCORE Long-Reach (LR) Multi-Standard-Serdes (MSS) IP, a high-performance, low-power, DSP-based PHY with speeds up to 112Gbps.

The 3nm Prodigy 2 design will add the CXL memory interface and PCIe 6.0 lanes from the same IP suppliers and increase the number of cores. The plan is for samples in 2H24 and this has already been funded as a key European project says Danilak.

Related articles

The universal processor is designed for the group up to run CPU, GPU and AI instructions natively in the same execution core with two 1024bit vector units and a 4096 matrix processor.

“We extended the vector lines and removed the penalties in the CPU. When you already have vector units you can do matrix multiplication directly. That was the magic, and it’s easier to programme,” said Danilak. There isn’t much of a penalty for the wide vector units. “The vector unit is less than a quarter of the die, and we have a relatively high clock for operations,” he said.

The largest version of the Prodigy chip has 128 cores running at 5.7GHz and will sample by the end of the year with production in  2023. The largest chip is designed in two blocks of 64, separated by isolation cells. This means if there is a defect on one side the die can be sold as the 64 core version.    

The high clock comes from the way the core is optimised for the software workloads in high performance computing (HPC) and AI. The place and route of the silicon was linked to the gcc compiler to minimise data transfers in and out of the core.

“In about 93% of cases we can do the calculation in the same unit and not have to move the data. But the compiler and hardware has to work together to make sure the data is all on the same unit,” he said.

The result is a chip that is up to 4x faster for integer operations and 30x faster for floating point operations.  It is also 12x faster for AI operations as it can handle sparse data efficiently.

“If you have a chip that’s as effective for AI as for CPU you can have 10x more AI and increase the utilisation and use unused capacity for cloud or video or HPC. We believe this will drive utilisation up from the current level of 30 to 40%,” he said.

The largest 128 core die (the T16128-AIX) is 400 to 500m2, although this is half the die size of the competing AI Hopper chip from Nvidia, says Danilak. It consumes 950W, and Tachyum has developed reference designs for a server board and air-cooled and liquid cooled server racks. The lowest power version, the 32 core, 3.2GHz T832-LP consumes 180W.

The architecture runs x86, ARM and RISC-V binaries in a translation layer, but there is a large software ecosystem of applications that have been compiled to run natively, including Linux and TensorFlow.

“Even after binary translation the core is still twice as fast as other processors. Customers are saying they want to move in 12 to 18 months to native applications but can get started now. Our team is porting native HPC applications and we are planning to port all the Linux distributions,” he said.

The company has an esteemed advisory board, including Prof Steve Furber, developer of the first ARM processor, and Jack Weber, developer of the x86 processor at AMD.

www.tachyum.com

Other articles on eeNews Europe

If you enjoyed this article, you will like the following ones: don't miss them by subscribing to :    eeNews on Google News

Share:

Linked Articles
10s