AMD has launched a graphics processor optimised for high performance computing (HPC) rather than graphics.
The Instinct MI100 accelerator marks the divergence of the two types of GPU, delivering 11.5 teraflops of 64bit floating point operations from an array of 120 compute units and 7680 streaming processors in a 300W power envelope for supercomputers and data centres.
“Each era has unique characteristics for compute,” said Brad McCready, VP for data centre GPU accelerators at AMD. “We have moved from CPUs carrying the weight of the computation as we needed a boost to keep performance moving forward using general purpose GPUS. We believe we need another boost to move into the exascale era. AI is driving new workloads, again diversifying the workloads that GPUs are carrying. This is the first GPU to break the 10 TFLOP barrier,” he said.
The 120 compute units are arranged in four arrays. These are derived from the earlier GCN architecture and execute flowes of data, or wavefronts, that contain 64 work items. AMD has added a Matrix Core Engine to the compute units that are optimized for operating on matrix datatypes, from 8bit integer to 64bit floating point, boosting the throughput and power efficiency.
The classic GCN compute cores contain a variety of pipelines optimized for scalar and vector instructions. In particular, each CU contains a scalar register file, a scalar execution unit, and a scalar data cache to handle instructions that are shared across the wavefront, such as common control logic or address calculations. Similarly, the CUs also contain four large vector register files, four vector execution units that are optimized for FP32, and a vector data cache. Generally, the vector pipelines are 16-wide and each 64-wide wavefront is executed over four cycles.
The Matrrix engine adds a new family of wavefront-level instructions, the Matrix Fused MultiplyAdd or MFMA. The MFMA family performs mixed-precision arithmetic and operates on KxN