Thread synchronization techniques for better multicore system power/performance tradeoffs
It’s a common refrain heard among embedded software design teams everywhere, when the team manager declares, “We need better system power management from both the hardware and software, but we also need to optimize the design for increased functionality and performance.”
So, how can a software designer hope to accomplish such a feat when in order to maximize one side of the equation the other side has to be minimized? You just can’t have the best of both worlds.
Interestingly, within embedded multicore system design, unique situations present themselves in which power and performance can complement one another – rather than being in a state of perpetual competition. This article considers one such scenario where a symmetrical multiprocessing (SMP) RTOS, multicore hardware, and power management features combine to facilitate parallel embedded programming.
While power management features of a full-featured RTOS contribute significantly to power efficient design, this article takes an in-depth look at a thread synchronization mechanism based on hardware supplied primitives, which guarantees power savings while improving system performance.
Multicore to the rescue
The embedded world is quickly embracing multicore technology. Almost all major silicon vendors offer multicore chips such as ARM (Cortex A9), Freescale Semiconductor’s PowerPC (QorIQ), and MIPS (1004K MT).
Contrary to what you might think, increased performance is not the only factor behind this trend. Embedded systems cannot afford high performance at the cost of elevating the power budget or compromising legacy software. Embedded silicon providers, therefore, are focusing on a small number of cores (usually no more than eight). Already systems in mobile and networking with handheld devices and high speed switches/routers, respectively, fit this multicore paradigm quite well.
An embedded runtime software solution including the OS, middleware, and the end applications targeting multicore platforms has to take performance, power requirements, and existing software design into consideration. Although parallel computing is a fairly advanced field with many programming, architectural, and compiler technologies available – parallel embedded development poses new challenges because of the variety of hardware and tools involved that underline the need for flexibility.
Thread-level parallelism
With no widely accepted multicore programming standard available, most embedded application developers rely on the thread-level parallelism (TLP) supplied by the RTOS. In theory, this approach promises no change in the legacy code if the RTOS support for the multicore architecture is robust.
Unfortunately overhead is an unavoidable byproduct of TLP, existing in part because of the creation and termination of worker threads as well as the need of synchronization between individual contexts. Typical embedded systems have a limited number of threads and usually a software context is activated as a result of an external event such as an interrupt. This thread then has to complete its execution in stipulated time.
Based on these properties of embedded runtime applications, this article presents a scheme to reduce TLP overhead by using essential thread attributes of preemption and priority in a RTOS. The basic idea is to prioritize parent and child threads at different levels so synchronization time is not required. In situations where synchronization is unavoidable, the proposed scheme utilizes hardware-supplied, power-saving primitives.
Certainly similar problems have been tackled in terms of compiler dependent technologies which include speculative scheduling and OpenMP-based solutions. Another approach is to devise load balancing strategies and new real-time scheduling algorithms to get maximum performance/utilization from multiple resources.
But these schemes often require significant changes in either application logic or the operating system code, which is a significant drawback for a practical system in the sense that a lot of the legacy/certified/robust code is either jeopardized or rendered unusable. Instead of finding an optimal scheduling technique or counting on a fancy compiler technology, a portable scheme that will not affect application code at all and only involves changes in wrapper function of RTOS interface is proposed.
System model to be analyzed
The embedded system example for this discussion is comprised of a multicore platform with M identical (homogenous) cores. The SMP RTOS thread scheduler follows a baseline/traditional approach which is a fixed-priority, preemptive threading model.
This example was selected because this kind of scheduling is the most common in today’s commercially available RTOSes. Such an approach ensures that the most important thread (highest priority) always gets the first opportunity to run.
Any low priority preemptable thread can be kicked out only by a thread of a higher priority. Therefore, an indirect guarantee of real-time operation is available for critical threads. The same priority threads are executed in a round-robin fashion. A simple multicore extension of this model simply takes the highest priority ‘n’ (0 < n < M-1) threads and dispatches them to any of the ‘n’ available free cores.
Parent-child programming model
From the perspective of using as much legacy code as possible, the easiest path to multicore development is via TLP under a parent-child programming model. Under this paradigm application code remains agnostic from parallel programming constructs. The parent-child model does not impact the deadline requirements as long as it can keep TLP under limits.
For instance, a legacy application thread (parent) performing video frame decoding can speed up its execution using TLP by distributing frames among its child threads. These child threads will then execute in parallel on multiple available cores potentially increasing the performance of the system.
There is very little requirement of a code change. A parent needs to span and then wait for all children to finish. If the thread creation (fork) and wait time (join) are small as compared to the thread workload, the deadline requirement of original thread will easily be fulfilled. Fork and join services are usually wrappers around other standard RTOS interfaces explained next.
Fork-join services
A common way for an RTOS to support POSIX-style fork service is to provide a wrapper function around thread creation API with the addition of setting up a mailbox or event associated with this child thread. Information of an event (mail box) is added to thread data (possibly in thread control block). In case of a join call, the RTOS checks the status of event/mailbox to determine if the child has completed execution. This simplistic arrangement results in TLP synchronization overhead since it depends on components of RTOS other than software contexts.
Techniques to reduce RTOS overhead
The proposed scheme in Figure 1 below shows the nodes that represent a thread (either parent or child) while the branches indicate the fork-join operations. Each node is labeled by the priority level (except the last node labeled as E to indicate the end state).
Figure 1: Illustration of proposed fork-join with parent-child model.
Children have one level higher with respect to the parent. Each child responds to a join call from the parent to indicate its completion. The proposed RTOS fork-join services involve the following steps in each system call:
Thread Fork
1. It is endured that the child thread to be forked has a priority one level higher than the parent thread.
2. Preemption of the parent thread is disabled until it finishes forking all children threads. (As soon as the parent enables preemption using an RTOS system call, all the cores will be occupied by the child threads, thus having maximum utilization).
3. The RTOS requests the hardware to set an architecture specific notification event visible to all the cores in the system.
Thread Join
1. When the parent is able to call the join API, this means that one of the child threads has ended its job.
2. The parent can then check the status of the thread by a simple read in its thread control block, make sure it shows an end status, and then return successfully.
On the other hand, if the parent checks the thread status and a state other than “end” is indicated, it means there are more parallel resources than the number of children left.
Otherwise, the parent would not have been able to run again and call join – so it can take a scheduler trip giving other tasks a chance to run. In case parent and children were the only load in the system, put the core on which the join call was made into a low power mode safely – in the knowledge that for now this core has nothing more to do.
Note that while dynamic voltage and frequency scaling (FVFS) is difficult to achieve on embedded multicore processors, almost all vendors support low power CPU modes, increasing the portability of the proposed solution.
A four-core use-case
Let’s consider a parallel multithreaded application use-case consisting of eight child threads on a four-core system. Figure 2 below shows a sequence diagram of RTOS fork-join services supporting this use-case where each column represents a single core. On core zero (C0) parent thread forks (Thread_Fork) eight children threads one by one without leaving control of the processor.
While parent thread is forking children, there are three cores free in the system and they get to run the first three child threads without waiting for the parent to finish. The fourth child has to wait until the parent is done with creating all of the threads.
This is required to ensure the parent does not hop around the cores while creating child threads. As soon as the parent is done spanning the children, it is preempted by the fourth child and from then on, maximum utilization of all the cores is depicted by the rectangular shapes. Idle time of a core is indicated through an absence of any rectangle (empty space).
Once a parent gets control back (on whichever core is free), some of the children will have already completed their job and Thread_Join requires only a status check on a flag in the thread control block.
This is the case for child zero through six where no synchronization is required because of the priority relationship between parent and child threads. The last thread to join the child has not yet finished and since the parent is the lowest priority thread, this means the system is now lightly loaded.
Therefore, the core on which the parent is running is taken to a low power mode. When thread seven finishes it wakes up this core using a hardware assisted event, serving a synchronization purpose.
We can compare this use-case with its counterpart, which relies on RTOS constructs like message passing or events for thread synchronization. First thing to note is that since typically a child inherits its parent’s priority, the parent will be able to create all the children, but then child four has to wait until the parent calls Thread_Join against the first child thread.
If the parent suspends on this call because completion message/event has not arrived from the first child, child four gets a chance to execute on the core vacated by the parent. Of course, this depends on the duration child one takes to complete. As a result, time that should be spent in computation is being wasted on internal OS logic, thus increasing synchronization overhead. Also, every child thread upon completion generates a software event message that the RTOS scheduler has to process, resulting in additional overhead.
When this happens, the parent, in accordance with the program logic, goes through the join routine and will always get suspended if that thread has not finished again resulting in overhead and low utilization of the core. If power management capabilities are not built into the RTOS this scheme will also consume more power.
To verify the proposed solution discussed in this article, a CPU-bound dot product application was developed and tested on an ARM Cortex A9 MPCore with four identical cores each operating on 400MHz. The RTOS used was the Nucleus RTOS from Mentor Graphics compiled with Mentor’s Sourcery GNU tools for ARM EABI.
The results for proposed and message passing (MP) based fork-join schemes are shown in Figure 3 below for matrix sizes of 2048, 4096, and 8192.
Since overhead is a function of number of threads, each matrix size is evaluated against 2, 4, 8, and 16 threads. Time taken in seconds by dot product application is plotted on the vertical axis.
Note that for all cases the technique suggested in this work results in lower execution time with an average of around three percent of savings in overhead in addition to promise of power saving. Since overhead is a fraction of absolute execution time in a highly parallelizable application such as dot product, the savings here are significant.
Faheem Sheikh joined the Embedded Systems Division of Mentor Graphics in 2007, where he is working as a senior technical lead. His current focus is software research and development for symmetric multiprocessor architectures. Faheem has a Masters (2005) and PhD (2009) in computer engineering from Lahore University of Management Sciences, Pakistan. He has more than ten technical publications in leading international conferences and journals.