
Hybrid execution – the next step in the evolution of hardware-software co-development
Over the past decade the software content to be addressed by semiconductor companies has multiplied several-fold. Where providing some core drivers and managing an ecosystem of operating system (OS) providers was sufficient in the late 90’s to win a socket in the mobile space, today the contenders for providing application processor have to be able to deliver the chip with multiple OSs already ported, up and running, ready to be adopted by system customers.
Unfortunately, in classic development flows, hardware and software – while ultimately derived from joint requirements – diverge in their development and in the worst case integration doesn’t happen until a “big-bang” integration test is done.
The upper portion of Figure 1 shows such a disconnected hardware-software development flow from requirements through preliminary design, unit coding, testing and integration. The system integration at the very end often brings surprises that cannot be overcome without significant re-development. The industry has been striving for years to achieve a fully agile system development flow as indicated in the lower portion of Figure 1. Integration ideally should happen early and then be repeated often.
Figure 1 – The need for an agile hardware/software development flow.
Given its importance, software has become the long tail in development cycles and its efficient development and testing is of great concern. Design teams are attempting to develop software as early as possible on whatever representation of the hardware they can get their hands on – achieving what the industry sometimes calls the great shift to the left. In an ideal world software development would be enabled at the very start of a chip-development project, but in reality users face various development options across levels of abstraction and different execution engines.
Hardware execution engines to enable early software development
Figure 2 illustrates the situation, outlining the various development engines that project teams consider using to bring hardware and software together as early as possible.
Figure 2 – There is no one-fits-all solution
During a chip development project, verification and software development is mainly done on four different core execution engines….next…
Virtual prototypes
Virtual prototypes are transaction-level representations of the hardware, able to execute the same code that will be loaded on the actual hardware, and often executing at well above 100 MIPS on x86-based hosts running Windows or Linux. To the software developer, virtual prototypes look just like the hardware because the registers are represented correctly, while functionality is accurate but abstracted. As an example, processor pipelines and bus arbitrations are not represented with full accuracy. While they can be made available to users early in the design flow, one of the challenges project teams need to consider is that modelling may take time and effort for which the return on investment (ROI) has to be considered carefully. Especially for designs that contain a large percentage of IP-reuse, remodelling of existing RTL may be a non-feasible hurdle to take in cases for which an IP provider does not provide transaction-level models for the licensed IP. In addition, even if transaction-level models have been developed, keeping them synchronised with the actual implementation as it is undergoing changes, requires effort that often is not invested and leads to a situation that the initial models are no longer synchronised with the final implementation. As a result, for smaller designs virtual prototypes may not even be considered by project teams.
RTL simulation
Register transfer level (RTL) simulation executes the same hardware representation that is later fed into logic synthesis and implementation. This is the main vehicle for hardware verification and it executes in the Hertz range, but it is fully accurate as the RTL becomes the golden model for implementation, allowing detailed debug of the hardware. However, its limitations in speed make it infeasible for larger scale software development such as operating system (OS) bring-up. If at all used for software development, users will do smaller scale software development for drivers and lower layers of the software stack. Due to RTL being the “golden” description from which the implementation can be automatically derived using logic synthesis, RTL simulation is the minimum that project teams require for verification of the hardware.
Emulation
Emulation executes the design using specialised hardware—verification computing platforms—into which the RTL is mapped automatically and for which the hardware debug is as capable as in RTL simulation. Interfaces to the outside world (Ethernet, USB, and so on) can be made using rate adapters or virtualised interfaces. In-circuit emulation takes the full design and maps it into the verification computing platform, allowing much higher speed – up into the MHz range – enabling hardware/software co-development. The stimulus is the same that the actual chip after implementation will use, for example an OS booting on it, as well as connection to the chip-environment. Processor-based emulation has the key advantage of fast bring-up with software-like, predictable compile into the emulator, as well as observability and control otherwise only known in simulation.
Still, when considering emulation as an add-on to traditional RTL simulation, project teams need to assess whether the additional cost and additional effort spent provides the right ROI. Especially for complex designs that required lots of verification cycles and designs with large software content, emulation has become a must-have in today’s large SoC designs because the risk of not verifying prior to tape-out that an OS boots correctly, and that hardware and software interact as specified, is simply too big.
FPGA-based prototyping
FPGA-based prototyping uses an array of FPGAs into which the design is mapped directly. Due to the need to partition the design, re-map it to a different implementation technology, and re-verify that the result is still exactly what the incoming RTL represented, the bring-up of an FPGA-based prototype can be cumbersome and take months (as opposed to hours or minutes in processor based emulation). Hardware debug is mostly an offline process. In exchange, execution speed will go into the tens of megahertz range, making software development a realistic use-case with JTAG adapters connected in the same way they will be to the actual chip in a development board. The return of investment of FPGA based prototyping is highly dependent on the amount of software content to be developed. The number of users being able to do software development on emulation is limited by its availability and speed. However, the lower cost of FPGA based prototyping makes it applicable to more software developers, as long as the effort to bring up a hardware design in FPGA based prototyping is not prohibitively large.
next; more ‘engines’ as project resources…
In addition to the four core engines, Figure 1 shows three more engines project teams can choose from:
OS Simulators or Software Development Kits (SDKs) are abstracting the hardware even further than virtual prototypes, requiring cross-compilation of the software, but in exchange allowing cost-effective distribution to large populations of software developers. Probably the most prominent examples are the Android SDK and iPhone iOS SDKs.
Formal Analysis and Verification is great for IP, works on early RTL and is exhaustive. However, it does not scale well with design size and does not extend easily to software execution.
Finally, the actual chip – once back from fabrication – is used in prototyping boards for software development on debuggers using JTAG based hardware connections. Efficient software development and debug can be done as the prototyping board runs at the actual intended speed. However, hardware insight is limited to the on-chip instrumentation as it has been provided by hardware developers. If a defect in the chip cannot be masked and patched by the software executing on it, an expensive re-spin is required.
Hardware/software development flows enabled by hybrid execution of engines
Key tasks to be performed during hardware/software development include system modelling and trade-offs, early software development, IP selection and design verification, SoC and sub-system verification, gate-level, timing and power signoff, HW/SW validation for SoC and bare-metal software, software integration and QA as well as system and silicon validation. They are indicated in Figure 3 along a design flow from specification to post-silicon validation. On the vertical axis they are aligned along the hardware – software stack from hardware IP though sub-systems, systems-on-chip (SoC) and the SoC in the actual system, executing bare metal software, operating systems (OSs) with their drivers, middleware and software applications.
Figure 3 – Sweet spots for execution engines to run verification and software development
The landscape of development tasks is overlaid with the development engines mentioned previously, aligned by their sweet spots. The connections into the portions of the software stack that can be executed on them are indicated by lines and bullets.
From here it can be easily illustrated why hybrids – the combinations of engines – can be considered as the next steps in the evolution of hardware/software developments.
First, “Simulation acceleration” executes a mix of RTL simulation and hardware-assisted verification, with the test bench residing on the host and the design under test (DUT) executing in hardware. As indicated by the name, the primary use case is acceleration of simulation of the RTL. This combination allows engineers to use the advanced verification capabilities of language-based test benches with a faster device under test (DUT) that is mapped into the hardware accelerator. Typical speed-ups over RTL simulation can reach or exceed 1000x but is typically limited to 10s of kHz.
Second, the combination of RTL and TLM Simulation allows utilisation of the accuracy of RTL simulation in conjunction with the speed of TLM simulation. Due to the intrinsic speed limitations of RTL simulation, this combination is especially beneficial for situations in which only a small part of the design is executing as RTL, providing the accuracy that the lower portions of the software stack may require.
Third, the combination of processor sub-systems in virtual platforms at the TLM level with RTL execution in emulation or FPGA based prototyping, similarly allows making use of the accuracy of RTL execution in conjunction with the speed of TLM simulation. With the intrinsically higher speed of hardware-assisted execution, now more of the chip can remain at RTL accuracy, which overcomes some of the modelling challenges indicated earlier, for virtual platforms. Instead of remodelling existing RTL at the transaction-level, existing RTL can be executed in emulation of FPGA based prototyping.
next; Towards agile H/W-S/W development
Towards agile hardware/software development
With the obvious need for early integration of hardware and software as indicated in Figure 1, the question remains why the industry has not solved this integration challenge yet. There are both business and technical reasons.
On the business side, the often-complex considerations for project teams to make decisions which engines to deploy – as described above – illustrate that an agile hardware/software development flow is certainly possible if all engines are deployed. The challenge is that the number of designs providing appropriate return-on-investment (ROI) for such an investment is fairly limited. If the complexity to be verified is not high enough or the software content is not big enough, deploying all engines introduced above in conjunction with each other is simply not economically feasible.
On the technical side, the key challenge has been described earlier in the section on virtual prototyping. Unlike the register transfer level, from which the remainder of the digital design flow can be automatically derived, the transaction-level is often simply not a golden representation from which the full hardware implementation can be automatically derived. High-level synthesis is part of an emerging solution, but is still focused mainly on the block level. Once augmented with automation for integration and verification of full chips containing both new and re-used IP blocks, we may arrive at the transaction-level as a new “golden” model.
In parallel, the emergence of hybrid engine usage as motivated above, allows earlier integration of hardware and software. Recent examples of companies such as Broadcom, PMC Sierra, Samsung and Ricoh using Palladium XP in conjunction with RTL simulation for Simulation Acceleration, did result in significant speed-up of execution between 40x and 1000x, enabling significantly more verification cycles and earlier software development on early hardware representations. Using Palladium XP in conjunction with VSP virtual prototyping, companies such as Broadcom and NVIDIA reported up to 60x faster boot-up of OSs such as Android, Linux and Windows, as well as up to 10x faster software execution once the OS has been brought up. Again, as a result, software can be integrated with hardware much earlier than using emulation by itself.
While the industry is working hard towards moving the “golden” representation from the register transfer to the transaction level, hybrid engine combinations for simulation acceleration, TLM simulation with RTL simulation and TLM simulation with hardware accelerated RTL execution are certainly the next step in the evolution towards fully agile hardware/software development.
Frank Schirrmeister is Group Director Product Marketing, System and Software Realization Group, at Cadence Design Systems.
