Knowles gave an indication of the launch timing as well as some further details of the Colossus architecture in a talk given at the Scaled ML Conference held at Stanford University on March 24.
Graphcore’s aim is to produce a combination of programming environment and semiconductor hardware optimized for a broad range of machine learning networks and strategies as applied in the cloud and enterprise datacenters. The startup, founded in 2016, claims Colussus can increase performance on machine learning algorithms by a factor of up to 100 compared with systems based on GPUs, which tend to be the fastest systems available today.
Knowles had previously said that Graphcore’s IPU would be shipping to early-access customers before the end of 2017 with more general availability set to start early in 2018 (see Graphcore’s ‘Colossus’ chip due before end of year). That may still be true. In his Stanford presentation Knowles said that the IPU would launch in the “next few months,” which does not exclude the possibility that samples of the 16nm IPUs have already been shipped.
As part of his presentation at the conference Knowles discussed the philosophy behind building an IPU that is memory-centric and that uses bulk synchronous parallel (BSP) communications between processors – in contrast to conventional designs that separate logic and memory.
Two phases of bulk synchronous parallel (BSP) computation. Source: ScaledML Conference and Graphcore Ltd.
Under Graphcore’s implementation of BSP there is a communications phase, where all the processors send and receive information as required and then a processing phase which then produces results that will be communicated in the next cycle. Knowles described BSP has a simple abstraction that is guaranteed free of concurrency hazards, although he acknowledged that load-balancing is key to getting the most efficient performance out of the IPU.
Next: ‘Perfect interconnect’
One of the others keys is deterministic communication over a “stateless exchange” mechanism, Knowles said. And one of the implications, he indicated, is that compilation would be applied to both the function of the program and to the inter-processor communications. Threads would be used to hide small local latencies and so relatively few would be needed per processing element.
“Build chips that are mostly memory. In fact, if you do that enough, you can get the power density down…and then you can start bolting these things together,” he told his audience.
In comparison to a GPU-and-DRAM platform Knowles said: “What you’ll end up is with much less memory, in this example 600Mbytes instead of 16Gbytes, but you can access that 600Mbytes at spectacular bandwidth and zero latency. Can you use that to build a higher performance machine intelligence machine? I hope so because that’s what we’ve built.”
Colossus IPU pair. Source: ScaledML Conference and Graphcore Ltd.
Knowles’ showed Colossus as an IPU pair with 2432 processors with 256kbytes of memory distributed across two die and with 90Tbytes/s of communications bandwidth. The two die are set down on a 300W PCIe card. “The inter-communication is: all-to-all, completely deterministic, compiled, non-blocking, stateless; it’s perfect interconnect,” Knowles said. “We haven’t actually declared what performance this thing will have but it will be at least 200Tflops and about 600Mbytes memory all together over those two chips. So not very much. We have to explore high machine performance with small memory footprint.”
The chips will not use external memory and the whole model stays on the chip pair or cluster during operation. The processor cores use mixed-precision floating point; 16-bit floating point for multiplications and 32-bit for accumulation.
Related links and articles: