In applications where it is time critical to detect objects quickly, doing large batches increases latency and increases the time to detect objects. Batch=1 is the preferable mode for applications where object detection and recognition is time critical and large images are preferable for applications where safety is critical. Also, as we’ll see below, batching does not necessarily improve performance as image sizes increase.
Why do larger batches increase throughput for ResNet-50 and other models with small images?
ResNet-50 has the following:
- Input images of 224 by 224 pixels by 3 bytes (RGB) = about 0.05Mbytes
- Weights of 22.7Mbytes
All inference accelerators have some number of MB (megabytes) of on-chip SRAM. If total storage requirements exceed the on-chip SRAM, everything else must be stored in DRAM.
To better explain the tradeoffs between different amounts of on-chip SRAM, below are some examples of three different inference chips.
- Chip A: 8Mbytes
- Chip B: 64Mbytes
- Chip C: 256Mbytes
For example, in TSMC’s 16nm process, 1Mbyte of SRAM occupies about 1.1 square millimeters of die area, which would make Chip C very big and expensive.
The 3 things that need to be stored by inference accelerators are:
- Intermediate activations (the outputs of each layer: ResNet-50 has about 100)
- Code to control the accelerator (we’ll ignore this in the rest of this discussion because almost nobody has disclosed this information)
Performance will be highest if everything fits in SRAM. If not, what to keep in SRAM and what to keep in DRAM to maximize throughput will have something to do with the nature of the model (relative sizes of weights and intermediate activations) and of the inference chip architecture (what is it good at and weak at).
In models such as ResNet-50 and most CNNs, the activations output from Layer N are the input to Layer N+1 and are not needed again.
In ResNet-50, the largest activation is 0.8MB. Typically, one of the layers before or after will have half the activation size. Thus, if there is 1.2MB for temporary activation storage, that means no activations need to be written to DRAM: this is for batch=1. If all activations had to be written to DRAM, it would mean almost 10MB of activations written out and 10MB read back in. This means that 1.2MB of on-chip storage for temporary activations avoids 20MB of DRAM bandwidth per frame. This is almost as much as the 22.7MB of weights.