Quad cores in mobile platforms: is the time right?
Mobile platforms are following the same evolution as we have seen on PC’s before: frequencies of single core processors have increased, quickly reaching the limits of the power consumption budget. At that point multi cores have been introduced in order to keep increasing the computing capabilities at sustainable power consumption levels.
ST-Ericsson was one of the first companies to introduce dual-core processors back in 2009 and its NovaThor roadmap is based on ARM dual core processors. The company’s analysis shows that the software available today and in the near future is not yet ready to exploit quad cores efficiently. Adding cores is not free of charge in terms of maximum achievable frequency and memory access penalties. This approach has to be well balanced with respect to the software’s capability to exploit increased hardware parallelism.
In this article we’ll consider the benefits and drawbacks of high frequency dual core versus lower frequency quad core architectures. The trade-off is evolving quickly though, both in terms of silicon technology and software, and Ericsson will adapt its plans in the light of future developments. Let’s focus on performance only, leaving power consumption effects as implicit considerations; indeed quad cores could save power by using lower frequency and voltage at increased performance, but with no performance gain to start with, there is no power to be saved.
Multi cores are not free of charge
The multi core architectures used in most current mobile platforms are Symmetric Multi Processing (SMP) systems characterized by the fact that all cores are identical and face the same cost for accessing any system resource including memory. In addition they share a single memory space, which is a fundamental feature required to run SMP-enabled operating systems. The latter feature requires specific circuits to ensure memory coherency among multiple physical memories. In the current generation each core has its own L1 cache that needs to be maintained coherent with the other L1 caches and with a single shared L2 cache. The coherency hardware determines the scalability properties of multi core architectures to a great extent, since it is in the critical path of cache operations. Adding cores inevitably leads to additional cycles for managing the coherency protocol, creating extra delays while having a negative impact on performance. This is why SMP architectures do not scale indefinitely, not even on super-computers.
Other limits to scalability come from the increased L1/L2 traffic and congestion in accessing the shared L2 cache. More silicon area is required for each additional core and the increased L2 cache capacity that is required. This adds further layout constraints and delays in the memory hierarchy.
ST-Ericsson experiments have shown that for representative loads, this leads to an overall 25-30% performance impact per additional core when moving from dual to quad cores. For simplicity, we’ll represent this performance impact as an equivalent frequency reduction, although in reality it is a combination of core frequency loss and additional cycles lost in various parts of the system.
Software needs to be designed to run in parallel in order to exploit the available multi-cores. There are two kinds of software parallelism: parallel applications and multi-tasking. Parallel applications are governed by the well-known Amdahl’s law of equation 1:
which determines the highest achievable speedup S of an application given N number of processors, where P is the percentage of the application that can be run in parallel, i.e. that scales linearly with the number of processors ((1-P) is the serial proportion). For a given application, it is interesting to combine the achievable speedup of a quad core compared to a dual core, also taking into account the hardware overhead factors described previously. In particular, for a given software (i.e. for a given P), we want to understand when it is beneficial to move to a quad core structure, taking into account the hardware impacts in the concerned frequencies F, i.e., when
Figure 1 shows the ratio S4/S2, to be compared against Fdual/Fquad to solve the inequality above. The blue horizontal line represents Fdual/Fquad = 1.37, which correspond to the 25-30% per-core performance overhead described previously. The line intersects the S4/S2 curve for P = 70%, meaning that an application must expose a parallel proportion of at least 70% in order to benefit from a quad core versus running on a 25-30% faster dual core. Now P > 70% is quite huge! It means that 70% of the application code is perfectly parallelized, which is very rare.
Figure 1: The ratio S4/S2 compared against
Fdual/Fquad when Fquad * S4 > Fdual * S2.
One could argue that the same reasoning applies if we compare dual core versus single core solutions. This is true, except that the actual values bring to opposite conclusions: the red line in figure 2 is the same as in figure 1, while the green line represents the same analysis but for dual versus single core. The shapes are the same, but the green line is always considerably higher than the red one, resulting in much lower values of P for which a slower dual core is better than a faster single core. What’s more, the performance penalties of moving from a single core to a dual core architecture are lower than when moving from dual to quad architectures, further lowering the threshold on P to easily sustainable values (i.e., ~30-35%).
Back to our comparison of dual to quad cores: during our experiments we rarely found applications whose parallel proportion was higher than 70%. Multimedia in general and video in particular are the exception, achieving very high values of P after considerable parallelization and optimizations efforts, even up to 90-95%. But here we get into one of the peculiarities of our context: the relevant multimedia functions in mobile platforms are almost entirely hardware accelerated, either because there is no other way to achieve the required performance and/or because of the very limited power budget. Non-HW-accelerated multimedia applications could indeed take advantage of quad cores, but their relevance is questionable since they quickly drain the batteries of any mobile device.
Web browsers are among the main performance drivers in mobile computing. Web browsers today achieve speedups in the order of 1.4-1.5 on dual cores, corresponding to P in the range of 55-65%, so quite below our threshold of 70%. And the reality is even worse than the theory, because there is no parallel portion P as defined in Amdahl’s law in today’s browsers. Indeed most of the speedup comes from the induced support activities, such as user interface, multimedia, networking and others, some of which are executed on separate threads, exposing therefore enough system-level parallelism to take good advantage of dual cores (1.4-1.5 speedup).
The different nature of the parallelism involved doesn’t allow us to extrapolate a theoretical speedup of 1.8-2.0 on a quad core, as if Amdahl’s law was applied. In fact, we can observe speedups in the order of 1.6-1.7 on quad cores, equivalent to a more realistic P-equivalent of 50-55%. We expect significant improvements in browser parallelization, given the strong benefits now achievable on mobile multi cores; however, given the complexity of the software involved, it certainly is going to take a while.
The other compute-intensive category to consider is video games. Similar to browsers, most of today’s commercial game engines are not parallel in the sense of Amdahl’s law. We observe a good speedup on dual cores for some of them, typically of the same system-level nature as for web browsers, but certainly below the threshold to motivate the adoption of quad cores.
Besides, so far the bottleneck on mobile video games has been graphics, not the CPU, so for now the motivation has not been strong enough to justify the very significant efforts required for parallelizing such complex systems. We expect this situation to evolve but again it is going to take a while.
How much multi-tasking?
Multi-tasking is the other source of software parallelism that could certainly benefits from multi cores. By multi-tasking, we mean multiple concurrent coarse-grain activities with little or no dependencies between each other. Examples are composite use cases, such as listening to music while web browsing, or multi-tab browsing of which modern implementations branch out separate processes for the different tabs.
There are no mathematical formulas like Amdahl’s law to help because it entirely depends on the number and the nature of the concurrent tasks. We need to think hard to find plausible use cases that would saturate a next-generation 2 GHz dual core. Listening to music while doing anything else is certainly a common use case, but music playback is not a heavy task, even when streamed from the web.
There are multiple tasks generated by the many widgets in use today, but fortunately, those are also lightweight and don’t even get close to saturate a single core. A heavy use case could be video transcoding for streaming to an external monitor while Web browsing or video gaming. However multimedia functions are mostly hardware-accelerated on mobile devices, otherwise they tend to drain batteries too quickly, regardless of dual vs. quad core considerations. In the end, it is hard to find sufficiently motivating concurrent-use cases.
Time is not ripe, yet
Part of the mobile industry will soon move to quad core platforms following the marketing trend in the PC world. Four is better than two is an easy marketing slogan! However the reality might not be so simple for end users.
As we have seen in this article, there might even be negative impact, certainly on cost but also on end user’s perceived performance. There are certainly exceptions in some high-end niche markets or specific use cases, but the reality is that today’s software is not yet parallel enough to justify the move. The right time for quad cores will come as software improves. For now we at ST-Ericsson prefer to focus our optimization efforts on devices with faster and lower power dual cores, which we believe will bring more tangible benefits to the vast majority of consumers.
Marco Cornero is Fellow of Advanced Computing Application Engine & Platform BU at ST-Ericsson – www.stericsson.com.