Google’s second TPU processor comes out
Google’s first TPU was designed to run neural networks quickly and efficiently but not necessarily to train them, which can be a large-scale problem. So the second TPU, which Google describes as a Cloud TPU, which can both train and run machine learning models.
Google doesn’t appear to have said very much about the devices except that each Cloud TPU can provide 180 TFLOPS of floating-point performance and that the TPU2/Cloud TPU has been designed to work well together in large arrays.
It is not clear whether a TPU is a single ASIC or four identical ASICs on a PCB as shown in the picture at the top of this article. What Google has said is that 64 Cloud TPUs can be put together to form something called a TPU pod capable of up to 11.5 petaflops.
Skyscraper heatsinks to cool how much power consumption? Source: Google.
As a benchmark of performance Google said that a large-scale translation model can run in an afternoon on 8 TPU-2s compared with a full day on 32 of the best commercially-available GPUs. So that represent about a factor of ten improvement in performance.
It is also interesting to note that the TPU-1 was benchmarked at a maximum performance of 92TOPS, which happen to be 8bit integer operations in a systolic array, while the Cloud TPU is benchmarked expressly in terms of floating-point operations.
But what sort of floating-point operations? Full 32bit precision; 16bit half-precision? Is quantization down to 8bit integer still an essential part of the architecture?
As would be expected these second generation TPUs can be programmed with TensorFlow, an open-source machine learning framework available from the GitHub repository.
Google also announced that it would make 1,000 of its Cloud TPUs available to machine learning reseachers for free via something called the TensorFlow Research Cloud.
While we don’t yet know much about the Cloud TPU Google has provided more detail about its first TPU and this may indicate areas where the Cloud TPU is likely to be as good or better.
Next: What about TPU-1?
So Google has revealed that the TPU-1 is implemented in a 28nm process, runs at 700MHz and consumes 40W when running. It is packaged as a single IC on an accelerator card that fits into an SATA hard disk slot and is connected to its host via a PCIe Gen3 x16 bus that provides 12.5Gbytes/s of effective bandwidth.
TPU-1 printed circuit board contains a single TPU-1. Source: Google.
According to Google report from May 12, 2017, typical neural networks deployed today as software need to deploy millions of weights – of the order between 5 million and 100 million depending on the problem. And every prediction requires multiplying the input data by a weight matrix and applying an activation function. So that is an enormous number of multiplications.
The TPU-1 supports quantization of data from floating-point to an integer 8bit representation of the data between a maximum and minimum value which helps with reducing the memory capacity and bandwidth burden and enhancing performance. Overall the TPU1 was found to deliver between 15x and 30x performance increase and between 30x and 80x power efficiency improvement compared with running the same NNs on CPUs and GPUs.
Next: CISC not RISC
Rather than opt for the Reduced Instruction Set Computer (RISC) model in the case of the TPU1 Google opted for a Complex Instruction Set Computer (CISC) model, so that a single instruction could be used for multiple multiply-adds. That is, of course, no guarantee that the Cloud TPU is the same.
TPU-1 block diagram. Source: Google.
The TPU1 includes a matrix multiplier array that houses 256 by 256, 8bit multiply-add units together with 24Mbyte of SRAM multiple hardwired activation functions in an activation unit. The matrix multiplier unit (MXU) is a systolic array, which means that data flows through the array. It also makes an engineering trade-off with reduced control and operational flexibility compared with a conventional CPU in return for higher operational density. So whereas a CPU can execute only few instructions per clock cycle the array can perform hundreds of thousands.
TPU-1 die floor plan. The data buffers (blue) are 37% of die; the compute (yellow) is 30%, the I/O (green) is 10%, and control (red) is 2%. Control is much greater percentage in CPU or GPU. Source: Google.
The TPU-1 is efficient because it is dedicated to performing neural network calculations but it is the case that machine learning is an exceptionally broad landscape so it will be interesting to find out what trade-offs Google chose to make in the TPU-2/Cloud TPU.
Related links and articles: