
How to achieve earlier, faster subsystem performance analysis and debugging
This article discusses technologies and techniques that make it possible to, early in the design process, model realistic traffic that taxes the interconnect in order to quickly identify and resolve performance bottlenecks.
Introduction
Corner cases—those exceptional, unexpected scenarios or sequences of events that wreak havoc on otherwise well-behaving designs—happen. While you may not be able to prevent corner cases, you can take steps to model them in order to debug the hardware in your design to minimise their impact.
Understanding system performance calls for a considerable investment in testbenches that you can use to put your system through corner-case scenarios that can cause performance bottlenecks. Done manually, this can involve weeks or even months of testbench coding. And this doesn’t include accommodating changes in the design. Who can afford this investment in time and resources? What’s more, once you’ve detected the performance bottlenecks, how can you efficiently find and debug the causes?
Fortunately, there are technologies and techniques available that help you automate testbench creation and accurately model the kind of traffic that a given design is anticipated to experience. With these insights, you can productively accomplish cycle-accurate performance analysis of bandwidth and latency in your design.
Cycle-accurate performance analysis
Traditionally, one way to generate the kind of realistic traffic that will burden [stress] a system-on-chip (SoC) interconnect has involved a lot of waiting. After all, it’s only at the end of the register-transfer level (RTL) simulation stage that you would have in place all of the intellectual property (IP) and associated software drivers. Of course, the closer you are to the end of your design cycle, the costlier it is to make changes.
Another solution is to model all of the IP in SystemC and run early versions of the software on top. There are many limitations to this approach, not the least of which is that the models are not cycle-accurate. However, worse than this is that many components of the SoC infrastructure may be extremely complex and, in many cases, provided by third-party providers (the ARM CoreLink CCI-400 Cache Coherent Interconnect is an example of such an IP). This limits the availability of models and may force analysis to be deferred until RTL analysis has been performed.
Ideally, it would be great to run performance analysis simulations with the cycle-accurate RTL of the interconnect subsystem. In this approach, we would add critical IPs such as the DDR controller, while removing dependency on the availability of other IPs by replacing them with traffic synthesisers that drive realistic traffic patterns representing the replaced IP.
Coupling this approach with a tool capable of automating the creation of the necessary testbench would greatly reduce the effort and risk associated with manual testbench creation. This is especially true as experience shows that interconnect configuration frequently changes during the design cycle.
GUI-based tool automatically generates testbenches
Cadence’s Interconnect Workbench is a tool with two major capabilities. One: it automatically generates testbenches tailored for functional verification and performance analysis of complex interconnect subsystems. Two: the tool provides a powerful GUI for analysing the performance metrics collected while running simulations using the generated testbenches. These testbenches use Cadence Verification IP to replace selected IP blocks in your design and gain access to faster simulation and a higher level of control over simulation traffic. Verification IP monitors can assess traffic at each of your interconnect ports. Making cumbersome spreadsheets redundant, the GUI has built-in filters for choosing the masters, slaves, and paths that you want to evaluate. Rather than running multiple, lengthy simulations, the tool can quickly identify the critical paths for debugging.
By using Interconnect Workbench on its SoC, one leading communications technology company reduced its interconnect verification effort from eight man-months down to one man-month, gaining important insights into latency, bandwidth, and outstanding transaction depth.
Here’s a summary of what Interconnect Workbench can do:
– Automatically generate Universal Verification Methodology (UVM)-compliant performance and verification testbench code from ARM CoreLink AMBA Designer output (interconnect fabric RTL and IP-XACT metadata)
– Deliver cycle-accurate performance analysis, plus a performance analysis cockpit that lets you visualise, discover, and debug system performance behaviours
– Collect all transactions and verify the correctness and completeness of data as it passes through the SoC interconnect fabric, via integration with Cadence Interconnect Validator Verification IP
Figure 1: Data flow through Interconnect Workbench. RTL, Verification IP, and traffic pattern descriptions move into the tool, which automatically generates a testbench for simulation. As other variations of SoCs are generated, the tool can generate additional testbenches. The performance GUI provides an overview of simulation results. Performance metrics can also be collected from manually created testbenches, as long as they include an instance of the Interconnect Validator.
next; Top-down debug…
Top-down debug
Exposing performance bottlenecks usually requires driving hundreds of thousands, if not millions, of transactions through your system. It’s an enormous amount of data. Therefore, an ideal performance debug tool should provide you with the ability to:
– View the aggregated data from many simulation runs
– Apply sophisticated filtering (for example, the ability to look at a specific datapath or view performance broken down by virtual networks running through your system)
– Identify any transactions that diverge from the expected performance criteria
Once such transactions are identified, the tool should allow you to easily go to the lowest level of detail, reviewing waveforms for hints on what could have impacted performance.
Interconnect Workbench meets these criteria, and also lets you display design events alongside bandwidth and latency charts. For example, there might be dips in bandwidth observed in the SoC; however, by overlaying the DDR refresh events from inside the DDR controller, one can quickly “see” that most of the dips are to be expected. You might then investigate bandwidth dips that don’t coincide with a DDR refresh. Looking at the screenshot below, the correlation between the DDR refresh event (Event_REF_DMC_m0) and high latency on the big_cluster ACE interface can be clearly seen.
Figure 2: Interconnect Workbench allows charting of the latency of individual ARM AMBA masters over time. In addition, it is a very simple matter to also gather event information from the design under test (DUT) and display it on any graph that has time as its X-axis. As seen here, the DDR refresh event is shown and the time correlation against latency peaks becomes visually obvious.
As you spend months developing your design, you gain an understanding of how your system behaves under stress. Along the way, you can use Interconnect Workbench to build in checks and run them as batch processes overnight to further evaluate performance and ensure that late design changes don’t break the performance limits investigated earlier in the development process. These checks are then run as part of the RTL check-in regression process.
Starting with basic characterisation, the latency and bandwidth limits of each path, from each master interface to DDR, are tested in isolation. The value of this testing is that system assembly and configuration errors can be swiftly spotted and corrected. Interconnect Workbench accelerates this onerous task by automating the generation of path-by-path tests. This solution enables the creation of a performance regression and checks, which ensure that any new RTL check-ins are rigorously checked for performance issues and then reported.
next; Examining SoC Traffic Workloads…
Examining SoC traffic workloads
In addition to characterisation, it is essential to explore some typical SoC traffic workloads to ensure that the Quality of Service (QoS) aspects of the interconnect and DDR are correctly configured. Interconnect Workbench provides traffic synthesisers to make this task simple. By using a few UVM constraints, Verification IP for AMBA Protocols will generate specified bandwidth, read/write ratio, burst types, etc. A simple example of a big and little cluster workload is shown below.
Figure 3: Traffic synthesisers can be configured to generate defined levels of requested bandwidth for any of the ARM AMBA masters. In this example, we have specified a big_cluster WRITE bandwidth of 3.5 Gbps and a READ bandwidth of 2.5 Gbps. As seen here, the WRITE bandwidth is perfectly achieved, whereas the READ bandwidth has fluctuations. Understanding and debugging these fluctuations is a key learning step in understanding the operational behaviour of any system.
By tuning the configuration of the traffic synthesisers, a first approximation to corner-case scenarios can be developed. Also, bandwidth and latency extremes can be measured, as can average bandwidth figures. Given that the traffic synthesisers are built using UVM, these use cases can be as simple or as complicated as required in order to model system behaviour.
Interconnect Workbench provides a sophisticated analysis GUI to accelerate the debug of the vast amounts of data generated from these simulations. By loading the complete accumulated results from all simulations, you can spot and debug corner-case bandwidth and latency issues. Tools such as the latency distribution analysis shown in Figure 4 allow multiple simulation results to be accumulated and the slowest transactions identified in a click. Combined with intelligent path filtering, these tools provide the power needed to rapidly drill down into corner-case scenarios.
Figure 4: Latency distribution analysis is performed with Interconnect Workbench. The charts are clickable to identify the transactions in each latency bucket with full transaction details. Right-clicking can also add markers to the waveform window, further aiding debug.
Summary
Gaining a deeper understanding of system performance need not require the time-consuming, manual process of building testbenches to model corner-case scenarios. Technologies are now available for engineers to model realistic interconnect traffic flows—quite early in the design process—to quickly identify and resolve performance bottlenecks.
