Efficient debugging of multi-core MCUs

Efficient debugging of multi-core MCUs

Technology News |
By eeNews Europe

Multi-core systems have been established for a long time in the area of consumer electronics and infotainment. The operating system distributes the individual, independent applications to the particular CPUs, which are typically all alike (homogenous). Depending on the current computing load of the cores, distribution of the application software, which in most cases is not aware of being executed on a multi-core system, is performed dynamically at runtime. This should ensure a high system performance paired with a CPU load that is as balanced as possible.

Multi-core systems in motor controls, on the other hand, are normally far from homogenous. In most cases, only some of the CPUs are designed for universal use and the other cores are intended for special tasks. The new AURIX multi-core controllers from Infineon (see Figure 1), for example, comprise a total of up to three TriCore CPUs and additionally a special purpose, powerful, programmable time module.

Figure 1: AURIX multi-core architecture for motor control units with four processor cores

The latter represents the extremely heterogeneous part of the processor. Even the TriCore CPUs are not all identical, that means the TriCore part isn’t homogenous as well. They differ not only in computing power and energy consumption but also in their safety features. Two out of the three CPUs are additionally equipped with so‑called lockstep cores. These execute the same operations on the same data in background as their master cores do. By comparing the results of both the master and the lockstep core, incorrect behavior in real-life operation, caused by a hardware defect for example, can be detected and immediately corrected.

The challenge for software developers is to distribute the code, grown and matured over years, to the CPUs of the multi-core system. At the same time the correct functionality must still be guaranteed. In addition, when changing to a new computing platform it must always be examined whether boundary values such as latency, response time or energy consumption still comply with the specification. Usually with a new platform new features will also be added, utilizing the more powerful computing resources of the multi-core system. However, the higher communication overhead between the CPUs may have a negative effect on the performance.

Changing over to multi-core is therefore not an easy task. The heterogeneity of multi-core controllers requires making the decision which software part has to be executed on which processor core in the development phase. In the end this decision is left to the software engineer, who must very carefully consider when, for example, full computing power is needed and when lower energy consumption is required. The task-to-core assignment is finally done at compile-time and not at run-time as it is for homogeneous multi-core systems.

Challenges when changing over to multi-core MCUs

Experience has shown that most developers will be confronted with one or another of the following issues when changing over from single-core to multi-core MCUs:

Very often in the past, the communication between individual tasks has been realized via global variables located in shared memory. While the tasks, which are alternately executed on the same CPU, always operate with a valid value of such a variable, this can be completely different on a multi-core system with distributed tasks. Core-local caches and write buffers can, under certain circumstances, delay the writing into the shared memory so variables seen from different cores may sometimes become inconsistent. At worst, tasks will proceed with the wrong value.


Figure 2: Deadlock situations are a typical issue when single-core applications are ported to multi-core systems


One of the most common issues found in multi-core applications are deadlocks. As illustrated by Figure 2, tasks 1 and 2 require two resources A and B and try to reserve them. It may happen, by an unfavorable sequence of the reservations, task 1 blocks resource A and task 2 blocks resource B. If both tasks then try to reserve the second required resource, both are getting to block. This problem was not visible on a single-core system because both tasks were sequentially executed and the reservation sequence of one task is completely finished before the other starts.

At the point where we are confronted by any of these issues the debug tool comes in. To find the reasons for these bugs we need to see the memory contents each task is working with or what is currently executed by each CPU. In both cases the state of the complete application or of the involved tasks at a certain point of time needed to be observed. However, this is where the problem is. Usually there is no global time reference for a multi-core system.

Since each processor core may run at a different clock rate, it is more than questionable whether stopping all CPUs — the basis of traditional stop and go debugging — is sufficiently synchronous. The difference between the time the halt request is signaled to the chip and the time the CPUs have been finished that request and completely stopped depends on the realized core signaling method (see Figure 3).

Figure 3: Different parts of debug infrastructure may be responsible for run-control synchronization between the cores. But only on-chip triggers keep the delays short enough.

Depending on where the signaling is initiated, unacceptable latencies can occur for some applications. As an example, if a breakpoint was hit by one CPU and the signaling is done either by the debug software or by the access device, the system view after all other cores are halted as well will be completely run out of the context of the breakpoint. Only special debug hardware on the chip, that allows a so-called cross triggering between all CPUs, guarantees an almost synchronous stop. For exactly that purpose, in the new AURIX, the already known On-Chip Debug System (OCDS) of Infineon’s TriCore architecture is extended with a dedicated trigger switch. Depending on the selected clock rates of the individual cores, the TriCore CPUs as well as the timer module can be halted at the same cycle or with only a few cycles delay.

Synchronization is only one aspect of stop and go debugging, another is the behavior of the whole system when a breakpoint is hit or the developer single-stepping though the code. Up to now there was no need to discuss this point. However, with multiple processor cores and multiple concurrent tasks it is important to know what effect a break or single step should have. In the case a single task is in focus, it may be very helpful if only that CPU executing that task is stopped at a breakpoint while all others remain running. Sometimes it might be an advantage, or even essential, if all processor cores or a particular group is stopped. If several CPUs halted and you want to carry out a single step on one of those cores, it is unclear what a single step means for the other cores. Heterogeneous CPUs with various clock frequency or pipeline architecture may finish the execution of one instruction in different time. There is a serious danger of losing the synchronization. It lies in the hand of the debug tool to take appropriate precautions and to offer the user the possibility to debug both a single processor core and multiple cores in a group.

PLS has accepted this challenge and integrated a special multi-core run control management in its Universal Debug Engine (UDE). This allows combining individual CPUs to a so-called core group and to synchronously control them. The user can precisely define the behavior in the event of a break, go or single-step of the grouped CPUs that way a synchronous run-control is always guaranteed.

Beyond stop and go

It has to be said that stop and go debugging, as previously discussed, involves risks. Let’s have a look at that example: Two tasks each initiate an event, whereas the sequence of these events is crucial.

On a single-core system, the developer realizes that both events are obviously triggered in the wrong sequence and he/she attempts to find the reason with the aid of the debugger. The debugger needs to execute several debug operations at the target such as readout memory or register contents. The necessary target access via the debug interface competes with the normal program execution and as a consequence, influences the runtime behavior of the system. The symptom, however, remains unchanged. Even during the debug session, the sequence of events is still incorrect and can be observed by the debugger (see Figure 4 – upper diagram).

A multi-core system may behave completely different. Assume both tasks run on different CPUs but produce the same timing bug. The debugger likewise affects the runtime behavior. However, in case of different timing of debug operations at the cores, the debugger may ‘correct’ the malfunction of the application (see Figure 4 – lower diagram). Such problems, which are closely linked to the concurrency, are often difficult to reproduce and to observe. It is not unusual that errors are simply hidden or as yet non-observable errors are revealed by the influence of the debugger. Hence, it is important to keep this influence to the system behavior as low as possible.

Figure 4: The influence of the debugger to the runtime behaviour may lead to unexpected effects such as the ‘disappearance’ of errors.

This can be most efficiently achieved with trace which allows a non-intrusive observation of the system during run-time. However, similar to those for synchronous run control, a dedicated hardware on the chip is required for trace in order to capture the instruction flow of the individual CPUs or even data transfers between the cores. The collected data are transferred to the debugger via a trace interface for subsequent analysis. The big advantage of trace, besides the non-existent influence to the runtime behavior, is that you do not observe just a single point in time — for example, at a breakpoint — but rather you may also look back in the past. Hence, reasons for errors, which do not coincide with that time the symptom becomes visible, can be found much more easily.

The other side of the coin is that on-chip trace produces a huge amount of data that has to be transferred off-chip to the debugger. Even though special trace ports promise transfer rates of up to 10 Gbits/s in the near future — with increasing clock frequency and a growing number of processor cores needed to be observed, the amount of data may soon require transfer rates of 100 Gbits/s. Therefore, in order to tackle this discrepancy other solutions, than simply write out data immediately, are needed.

At the moment most up-to-date on-chip trace architectures use on-chip trace memories which have a sufficiently high bandwidth to the internal trace sources by nature. In this case, a standard debug interface is sufficient in order to transfer the buffered data. However, the memory size is very limited — one or two megabytes — and therefore the possible recording time is very short.

An interesting approach for an optimal use of the trace memory, regardless of its limited capacity, are filter and trigger mechanisms to be realized in the Infineon Multi-Core Debug Solution (MCDS) as an example. The fundamental idea behind this is to restrict the trace only to the really interesting parts, for example, access to a variable or entering a reservation routine. In the end, this on-chip pre-preprocessing simplifies the analysis and post-processing in the debug tool. At the best, there is no need to do a time-consuming search for the error location in a huge trace record.

A picture is worth a thousand words

Another challenge that must not be underestimated is more of a visual nature. Compared with a single-core system, the amount of debug information the user has to process, analyze and interpret in the end is immense; however, the presentation capacity by the workstations and monitors is, at most, rather limited. The debugger has to find a balance between what information is required to be displayed at the same time and what information can be omitted in favor of clarity. To make work easier for developers, modern debuggers — such as the Universal Debug Engine (UDE) from PLS — therefore make use of the framework concept. Although a specific core debugger, tailored to the actual processor core, exists for each CPU, the user sees none of this. The core debuggers are encapsulated as components in the framework, which serves as a link to the graphical interface. With this, it is possible to debug different targets — or even different heterogeneous processor cores in the case of multi-core — within a single user interface. Therefore, core-to-core interactions can be simply visualized and more easily understood by the user. The intelligent window management of user interfaces and frameworks such as Eclipse additionally help users to choose the presentation that is most senseful for them. Individual windows for code, variables or memory contents for the particular CPUs can even be grouped or also highlighted with different, core-specific colors (see Figure 5).

Figure 5: Powerful graphical user interfaces and frameworks like Eclipse are very helpful to present debug information of multiple processor cores to the users.

The examples mentioned above demonstrate that on-chip debug support is essential for multi-core based control units. Utilizing this in combination with a flexible and component-based debugger architecture — such as the Universal Debug Engine (UDE) from PLS under a powerful user interface such as Eclipse — you are on the safe side right from the beginning.

About the author: Jens Braunes studied Computer Science at the Dresden University of Technology (TU Dresden), Germany. He joined the development team at PLS in 2005 and since then is responsible, among other things, for the development of software support for multi-core trace.

If you enjoyed this article, you will like the following ones: don't miss them by subscribing to :    eeNews on Google News


Linked Articles