Best practices for designing high-throughput, real-time SoC systems

Feature articles | March 25, 2014

By eeNews Europe

MPUs/MCUs PLDs/FPGAs/ASICs EDA & CAD tools

Achieving hard real-time – less than 1 µsec jitter with µsec-level response – on such a system requires careful tradeoff analysis and system partitioning. It is also essential to consider future-proofing strategies for ever-increasing SoC complexity. There are mainly three approaches to such a system design – Asymmetric Multi-Processing (AMP), hypervisors, and Symmetric Multi-Processing (SMP) with core isolation (Figure 1) – from which system designers can choose to optimize hybrid SoC systems.

Asymmetric multi-processing

AMP is fundamentally a port of multiple Operating Systems (OSs) on physically different processor cores. An example would be to run a bare metal OS dedicated to handle real-time tasks on the first core and to execute a full-featured OS such as embedded Linux on the other cores. Most of time, the initial porting of the OSs onto the cores is straightforward. However, the start-up code and resource managements, such as memory, cache and peripherals, are very error-prone. When multiple OSs access the same peripheral, their behaviours become non-deterministic and the system could become extremely time-consuming to debug. Hence, it often requires careful protection using an architecture such as ARM TrustZone to be in place.

To add more complexity, message passing between the OSs requires memory sharing and needs to be managed together with the other protection measures. Because the cache is usually not shareable between different OSs, message passing needs to happen through non-cache memory regions, which adds latency and jitter to the overall performance. It is also poor software architecture from the scalability viewpoint as it requires significant re-porting when the number of cores increases.

Hypervisors

A hypervisor is a low-level software layer that runs directly on the hardware and manages multiple independent OSs on top of it. Though the initial porting is similar to AMP, the benefit is that the hypervisor hides the non-trivial details of the resource management and message passing. One drawback is that it incurs a performance overhead due to the extra software layer degrading the throughput and real-time performance.

Symmetric multi-processing

SMP with core isolation runs a single OS on multiple cores with internal core partitioning. An example is to instruct an SMP OS to assign a real-time application on the first core and the rest of the non-real-time applications on the remaining cores. This approach is very scalable as the SMP OS is designed to port seamlessly to an increasing number of cores. Because all cores are managed by a single OS, message passing between cores can happen at the L1 data cache level, resulting in faster communication with less jitter.

Core isolation can reserve a core for the hard real-time application to shield effects from other high-throughput cores, preserving the low-jitter, real-time data response. This is generally a good software architecture decision because it allows the designers to consider which OSs to use instead of re-inventing error-prone, low-level software to manage multiple OSs. The initial porting may require some effort if starting from multiple OSs. However, starting from an SMP architecture would be much lower effort.

Figure 1 Comparison of AMP, hypervisor and SMP with core isolation.

Optimising a high-throughput, real-time SoC with SMP

Based on analysing the alternatives, SMP with core isolation offers the best architecture to optimise high-throughput, real-time SoC systems. The architecture we consider is a system similar to Figure 3 where an I/O data stream comes into a SoC, undergoes some form of processing in the cores, is returned to the I/O with a low-jitter and low-latency real-time response. In addition, the SoC consists of multiple cores that run other throughput-intensive applications simultaneously.

First, it is essential to understand what a real-time response (loop time) consists of:

– Transfer new data to the system memory from an I/O (DMA)

– Processor detects the new data in the system memory (Core Isolation)

– Copy the data to a private memory (memcpy)

– Compute on the data

– Copy the result back to the system memory (memcpy)

– Transfer the result back to an I/O (DMA)

Because the jitter and the latency are accumulation of the six steps, it is essential to optimise each step. With an RTOS such as VxWorks with core isolation, the polling/interrupt response can be bounded in the nanosecond range (Step 2). Data computation is also application specific and is fairly predictable (Step 4). Therefore, we focus on the trade-off of the Direct Memory Access (DMA) and the memcpy (Steps 1, 3, 5, and 6). There are two major means to transfer data: transfer with or without cache coherency. The two methods have very different consequences in the DMA and memcpy.

As Figure 2 shows, despite cache coherency (using ARM Cache Coherency Port (ACP)) resulting in a longer path for DMA to complete, the processor only needs to access the L1 cache to obtain the transferred data. Therefore, memcpy time is significantly lower using cache coherency dominating the small degradation in the DMA performance. What this means to designers is that the cache coherent transfer results in much shorter latency, and lower jitter due to the direct cache access.

Figure 2 Memcpy and DMA performance with/without cache coherency.

next; case study

Case study: Best-practice SoC design

A complete system can be demonstrated with a reference design using a Cyclone V SoC FPGA development kit. The device consists of a dual-core ARM Cortex-A9 cores subsystem (HPS) and a 28-nm FPGA in a single chip. The hardware and software architectures are summarised below and illustrated in Figure 3.

Figure 3 Experimental reference design.

Hardware architecture

The hardware architecture comprises;

Two DMAs that transfer the data from the FPGA I/O to the ARM processors and vice versa; both DMAs are connected to the ACP to transfer the data directly to/from the ARM processor cache

Real-time control unit IP to initiate the message passing between the ARM processors and the DMA engines in the fastest way possible

Jitter monitor that collects the real-time performance and jitter by directly probing the DMA signals, achieving the accuracy within ±6.7 nsec.

Software architecture

The software architecture comprises;

VxWorks real-time OS running in SMP mode on the dual-core ARM processors

Core isolation is used to assign the real-time application to the first core and the rest of non-real time applications to the second core

The real-time application continuously fetches a data from I/O, computes and sends the results back to the I/O

The non-real-time applications stresses the ARM core and other I/O performance by continuously running FTP transfers and decryption of the data

Results

Experiments were run on different data sizes ranging from 32 Bytes to 2,048 Bytes. Each size was run millions of times to collect the histogram of the loop time to analyse the jitter (the difference between the maximum and the minimum loop time). As shown in Figure 4, even with the heavy FTP traffic running on the second core, micro-second level latency with less than 300 picosecond jitter was achieved over millions of test runs. There are some jitter swings with different sizes but it is controlled within 200 picoseconds, which is insignificant.

Figure 4 Experimental results of the real-time loop and the jitter.

The same FTP application was also run on the VxWorks SMP utilising both cores and achieved close to a 2x speed increase. Therefore this technique does degrade the throughput and it becomes the trade-off decision between the throughput and the hard real-time applications. However, an AMP solution also exhibits the same degradation due to hard partitioning of the cores with much less scalability to increasing the number of cores.

Conclusion

Designing a balanced SoC system with high-throughput and real-time applications require a number of trade-off considerations such as:

DMA data transfer

Cache coherency

Message passing between the processor core and the DMA

OS partitioning

Software scalability with increasing number of processor cores

In this work, we showed a “best-practice” system design using SMP with core isolation and cache coherent transfer, achieving low-latency, low-jitter real-time performance while maintaining the software scalability into the future SoC generations.

Nick Ni is an Embedded Applications Engineer with Altera.