Keeping weights on chip only helps if there is room to keep all 22.7MB on chip. In the case of Chip A and Chip B, there is not enough room to hold everything. Only Chip C can hold the weights and the intermediate activations. However, it takes a long time to read in the 22.7MB of weights.
The idea of batching is actually quite simple. If performance is limited by reading in weights, then you need to process more images each time the weights are available to improve throughput. With a batch size of 10, you can read in the weights once for every 10 images, thus spreading the weight-loading slowdown across more workload. At 1.2MB of on-chip SRAM needed per image, just 12MB of SRAM for temporary activations allows batch=10 without using DRAM bandwidth for activations.
In Chip B, there is room to hold enough temporary activations to store all of them in on-chip SRAM, so batch=10 will definitely accelerate performance. In Chip A, the smaller on-chip activations can be stored on-chip and batch=10 will help, but not as much because larger activations will require some DRAM bandwidth.
For many architectures and SRAM capacities, loading weights is the performance limiter. This is why larger batch sizes make sense for higher throughput (although they increase latency). This is true for ResNet-50 and many “simple” models because all of them use small image sizes.
Next: So large batch sizes always give higher throughput?