
Tackling large-scale SoC and FPGA prototyping debug challenges
When designing complex ASICs, such as a highly-integrated system-on-chip (SoC), engineers are highly motivated to perform comprehensive verification under as real-world operating conditions as possible to ensure that bugs can be identified and corrected before final tapeout. The source of the motivation, of course, is the high-cost and time required to re-spin an ASIC.
While discovering and tracking down the root cause of bugs can be challenging in the best of circumstances, inherent limitations in the various technologies available to ASIC designers for verification testing make the job much harder as each involves a variety of tradeoffs. Now, however, new technologies are emerging that offer the promise of much more efficient and less time intensive debug processes using FPGA prototypes.
Debugging requirements
A SoC consists of both hardware and software controlling the microcontroller, microprocessor or DSP cores, peripherals and interfaces. The design flow for a SoC aims to develop the hardware and software in parallel. Most SoCs are developed from pre-qualified hardware blocks for the hardware elements, together with software drivers that control their operation. Also important are protocol stacks that drive industry-standard interfaces like USB or PCI Express. The hardware blocks are put together using EDA tools while the software modules are integrated using a software development environment.
In order to debug the SoC, all these various pieces need to be pulled together into a functional unit and tested extensively. When this functional verification turns up bugs, the root causes need to be tracked down and either fixed, or a work around needs to be developed. Further, if time permits, it can be useful to conduct optimization testing to identify and correct performance bottlenecks in the design. Core requirements for optimal SoC debugging include:
- Full RTL-level visibility with minimal dependence on FPGA CAD flows
- Fast, efficient identification of the root cause of bugs
- Run at full speed to enable verification with real-world signals
- Time-correlated signals across different clock domains
- Scale seamlessly to support large, complex designs
- Enable efficient debugging of firmware and application-level software
- Verify hardware and software interactions at the system-level
ASIC prototyping approaches and tradeoffs
The three main approaches traditionally used to verify and debug both hardware and software for SoC designs are simulation acceleration, emulation and FPGA-based prototyping technology. To date, all three fail to meet all the requirements outlined above for ASIC debugging.
With high capacity and fast compilation time, acceleration and emulation are powerful technologies that provide wide visibility into systems. Both technologies, however, operate slowly, on the order of MHz, which may be significantly slower – up to 100X slower – than the ASIC’s operating frequency. As a result, emulators and accelerators cannot use real-world stimulus for testing without creating specialized interface circuits. Nor are they feasible for developing a system’s software. For instance, at the rate of 100X slower, emulating a system running for 15 minutes of real-time would take more than 24 hours. Acceleration and emulation boxes are also very large and expensive ($1M+) and thus beyond the means of many chip manufacturers.
FPGA-based prototyping platforms use FPGAs directly to enable engineers to validate and test at, or close to, a system’s full operating frequency with real-world stimulus. FPGA-based prototypes are small enough to be portable and used at an engineer’s desk. They are also much less expensive than emulators or accelerators, on the order of $75K. Given these advantages, it’s no wonder that about 80 percent of all ASICs and SoCs are prototyped in FPGAs.
But that doesn’t mean that FPGAs are the ideal platform either. Some teams use an emulator in addition to building up an FPGA prototype. That may be changing.
FPGA prototyping debug challenges
An FPGA-based prototype is a hardware-based implementation of an ASIC design that operates at high clock frequencies that closely represent the final ASIC while enabling non-intrusive monitoring of internal signals. Figure 1 shows the process for instrumenting and observing an FPGA-based prototype. Depending upon the size of the ASIC, the design may span multiple FPGAs. To test the system, engineers partition their RTL design among the FPGAs. Probes are added directly to the RTL to make specific signals available for observation. This instrumented design is then synthesized and downloaded to the FPGA prototype platform.

Figure 1. To monitor internal signals, probes are added directly to the RTL.
When the system is run, the RTL-based probe connected to each of the instrumented signals collects the signal’s value at each clock cycle. To enable the system to run at its full operating frequency and collect signal data in real-time, the data is stored in a trace buffer in FPGA block RAM. An analyzer connected to the prototype then downloads the information collected from each of the instrumented signals from block RAM, giving engineers offline visibility into the system.
The chief limitation to date of this approach is that instrumenting signals requires the use of significant amounts of block RAM and LUTs within the FPGA. Both of these resources are constrained by fixed availability on the FPGA, as well as by the fact that the majority of these resources are required by the ASIC or SoC design itself. For example, while an FPGA may have 96 block RAM, the ASIC design may require 86 of them, leaving only 10 for use in debugging.
Three primary factors influence the number of block RAM and LUTs required to instrument a system: the number of accessible signals, observation width, and trace depth. For example, the deeper the trace depth, the more block RAM that will be needed. How a debugging system uses these block RAMs impacts the efficiency of instrumentation and defines how much visibility engineers have into the system. The ability to probe more signals reduces how often the system must be recompiled. A wider observation width means more signals can be viewed with each run, potentially enabling faster identification of root causes. Finally, the ability to capture long traces is crucial for identifying and locating bugs. The types of bugs that are not caught during verification may require thousands or millions of cycles to manifest. Verifying software-driven functionality may span millions of clock cycles as well.
With traditional tools, engineers have to balance each of these factors and rarely achieve the robust level of visibility they need in a single pass. Designers must consider how long it takes to recompile the system between debug iterations. Because instrumenting code involves synthesis and place and route, adjusting which signals are probed requires the system to be recompiled. Even when an incremental recompile is possible, recompiling is a process that commonly takes from 8 to 18 hours and is typically performed overnight. If new probes are needed during the day, the process is often a “go home event” as the new results will not be ready until the next day.
The standard debugging tools offered by FPGA vendors such as ChipScope and SignalTAP can probe a maximum of 1,024 signals and require extensive LUT and memory resources. For example, 29 block RAM are required to capture even a shallow trace depth of just 1,024 words (assuming a 36 Kb block RAM size). It is likely that this may be too short a time frame for many types of errors.
To create a longer buffer, fewer signals can be captured to enable a deeper trace with the same number of block RAM used. However, several new problems are introduced in the process. Trying to locate a bug, for example, with only 32 probes in a complex system with over 10 million RTL-level signals is like randomly opening the pages of a dictionary and hoping to find a specific word.
The use of fewer probes also increases the number of iterations required to locate bugs. As each iteration also requires a synthesis, place and route; “go-home events” start to dominate debug time. This can stretch debugging of a single issue over weeks or months, leading to significant scheduling delays. In fact, if a bug is particularly difficult to uncover, it may be necessary to develop a workaround and tapeout with known bugs.
To increase the number of signals that can be instrumented, some tool vendors employ a mux network. A full crossbar mux would give concurrent access to a finite number of every probed signal on the ASIC, but such an approach quickly becomes impractical in terms of the die area required to implement the crossbar. For this reason, an n-input simple mux is commonly used. For example, an 8-1 mux can take 1,024 signals and mux into 8 pre-defined groups of 128 signals each. This enables the total number of signals that can be observed to be 8 times greater for the same number of block RAM. However, signals cannot be observed from different groups in the same run, so engineers have to spend time carefully creating the signal groups or risk having to re-run the FPGA CAD form again.
The bottom line is that ASIC prototype debug involves compromise. Emulators offer a rich debug environment, but lack speed and involve considerable expense. FPGA prototypes are cost-effective, yet for larger SoC designs traditional tools haven’t keep pace with growing complexity, with the show stopper being signal visibility at the RTL level due to resource constraints. If this latter problem could be solved, would emulators still have place?
Improving the FPGA debug process
With the preceding analysis in mind, Tektronix set about to address the need for improved FPGA prototyping tools with the formation of an embedded instrumentation group. The goal was to bring full RTL-level visibility to FPGA-based debugging. That goal has now been accomplished with the release of Certus 2.0.
Using only software and RTL-based embedded instruments, this debug platform uses a highly efficient multi-stage concentrator that serves as the basis for an observation network. This reduces the number of LUTs required per signal to increase the number of signals Certus can probe in a given space. This architecture, coupled with advanced routing algorithms, keep the size of the concentrator to a minimum and enables Certus to provide effectively the flexibility of a full crossbar mux while requiring no more die area than a standard simple mux. In practical terms, Certus enables engineers to instrument tens of thousands of signals using fewer FPGA resources than what a standard FPGA-based debug tool requires to instrument 1,024 signals.
The focus on RTL-level signals creates even more efficiency. With an RTL-based design, many signals are equivalent. Consider a flip flop that drives an I/O. If one directly observes the state of the flip flop, one can also infer the state of the I/O. For most circuits, there are between 3 and 5 inferable signals for each signal that is directly observed. All of these relationships are automatically calculated and made visible through the Certus analysis tools. Thus, if a system is instrumented for 30K signals, engineers will be able to observe the equivalent of ~100K signals without any need to evaluate the low-level details of the design.
In addition, Tektronix has simplified instrumentation by providing access to multiple probes with a single selection. For example, Certus provides the ability to select all flip flops, all interfaces, all inputs or outputs, and all state registers without needing to recompile the system. Typically, however, engineers would instrument all interfaces and registers to ensure access to all of the key signals in a design.
The ability to instrument every relevant signal and then view any combination of signals breaks through one of the most critical prototyping bottlenecks. As noted earlier, recompile time is a major consideration in determining how quickly a bug can be identified and resolved. More comprehensive signal access minimizes the impact of recompiling on the debugging process as show in Figure 2. Instead of multiple hours between debug sessions, engineers can select a whole new set of signals to monitor in a matter of minutes by reconfiguring the infrastructure over the JTAG port.

Figure 2. Since the need to run synthesis/place and route is not required to access new signals, time consuming “go-home” events are eliminated from the debug cycle.
Improving capture depth
Once the signals are instrumented, the focus turns to capture depth. To extend the size of capture buffers, some debugging tools utilize off-chip memory or high-speed cables. The challenge is that this creates a bandwidth bottleneck since time division multiplexing techniques are required to offload data. In turn, this restricts maximum capture frequency to <20 MHz and/or number of probed signals to a few thousand. This in turn restricts the system under test to be clocked at a lower frequency.
An alternative approach, as implemented in Certus, is to use compression technology to improve the information density of captured data and minimize the number of block RAM dedicated to debug. FPGA-based debug tools consume an entry in block RAM for every signal over every clock cycle, quickly consuming limited block RAM. As it turns out, however, many signals, exhibit reoccurring patterns that can be greatly compressed without any loss in signal integrity.
Since no single compression algorithm delivers optimum results by itself, Certus instead uses a compression cocktail. It dynamically uses a variety of compression and data packing algorithms to minimize the block RAM needed to accurately capture a signal. The effects of this compression are dramatic. Trace depth is typically extended by 100X or more. The effectiveness of compression depends upon the signal and can even exceed millions of cycles per block RAM (see Table 1).

Table 1. Due to the fact that many signals exhibit reoccurring patterns, compression techniques can significantly improve capture depth.
Multiple domain analysis
For a complex SoC, the design may be partitioned across several FPGAs running in parallel on the debugging platform. In addition, each FPGA may have multiple clock domains. A typical ASIC prototype, then, might have on the order of 20+ clock domains. Such complex systems have the potential for complex problems. In the past, when issues have spanned multiple chips and clock domains, engineers have had to trace the relevant signals separately and then manually piece together the data to correlate events, a tedious and error-prone process.
Certus addresses the complexity of multi-chip, multi-domain debugging by automating time correlation. Through the use of integrated analyzer software, Certus can collect trace data from across the system and align it in time to provide a system-wide, time-correlated view. In contrast to the manual approach, the relevant data can be captured with a single instrumentation configuration.
It’s also worth noting that since Certus works at the RTL-level, it can be used in conjunction with the CAD flow used for the design. Available software tools guide engineers through the process of signal selection and instrumentation and can be fully automated as part of an implementation flow. Verification can also be automated to enable multiple capture runs using different instrumentation configurations. This is especially important for verification and development of software applications, which are typically developed in parallel to hardware. FPGA-based prototyping enables hardware and software to be tested together so that the software is fully tested and available when the silicon ships.
Conclusion
With the capability to bring full RTL-level visibility to FPGA-based prototypes, the Certus ASIC prototyping design platform provides a single tool that supports accelerated development, verification, and debugging of ASIC hardware and software. Companies that in the past considered using an emulator, or accelerator may no longer need to make that investment. Given the significant cost and performance benefits of FPGA-based prototypes, it may be time to rethink the traditional views of both technologies.
Certus can be used with all high-end Xilinx and Altera FPGAs and all commercially available FPGA prototyping boards regardless of the specific I/O or FPGA topology. Though a standard JTAG interface, Certus requires no special I/O or connectors to achieve fully synchronized debugging across multiple FPGAs and clock domains.
About the author
Brad Quinton is the Chief Architect for the Tektronix Embedded Instrumentation Group. He can be reached a brad.quinton@tektronix.com.
