
Optimising high-performance CPUs, GPUs and DSPs? Use logic and memory IP
CPUs, GPUs and DSPs typically each have unique performance, power and area targets for each new silicon process node. Each new generation brings a new set of challenges to SoC designers and a new set of opportunities to create higher performance and more power-efficient IP to enable SoC designers to deliver the last megahertz of performance, while squeezing out the last nanowatt of power and last square micron of area. SoC designers need to first be aware of the advances in logic and memory IP and then they must know how to take advantage of these advances for the key components of their chips using the latest EDA flows and tools to stay ahead of their competitors.
In this article we describe available logic library and memory compiler IP and a typical EDA flow for hardening processor cores. We will provide innovative techniques, using those logic libraries and memory compilers within the design flow, to optimize processor area, then go on to describe methods using these same elements for optimizing the performance and power consumption of processors. The article finishes with a preview of how the innovation of FinFET technology will affect logic and memory IP and its use in hardening optimal CPU, GPU and DSP cores.
Why Different PPA Goals for CPU, GPU and DSP Cores?
CPU, GPU and DSP cores co-exist in an SoC and are typically optimized to different points along the performance, power and area (PPA) axes.
For example, CPUs are typically tuned first for high performance at the lowest possible power while GPUs, because of the relatively large amount of silicon area they occupy, are usually optimized for small area and low power. GPUs can take advantage of parallel algorithms that reduce the operating frequency, but they increase the silicon area—accounting for up to 40 percent of the logic on an SoC. Depending on the application, a DSP core may be optimized for performance, as in the case of base station applications with many signals, or optimized for area and power for handset applications.
Logic Libraries for High-Performance and Low-Power Core Optimization
Synthesizable CPU, GPU and DSP cores, today’s high-performance standard cell libraries and EDA tools can achieve an optimal solution without having to design a new library for every processor implementation. To optimally harden high-performance core, designers need the following components in a standard cell library:
Sequential cells
Combinational cells
Clock cells
Power minimization libraries and power optimization cells for non-critical paths
Sequential Cells
The setup plus the delay time of flip-flops is sometimes referred to as the “dead” or “black hole” time. Like clock uncertainty, this time eats into every clock cycle that could otherwise be doing useful computational work. Multiple sets of high-performance flip-flops are required to optimally manage this dead time. Delay-optimized flops rapidly launch signals into critical path logic clusters and setup-optimized flops capture registers to extend the available clock cycle. Synthesis and routing optimization tools can be effectively constrained to use these multiple flip-flop sets for maximum speed, resulting in a 15-20 percent performance improvement.
Combinational Cells
Optimizing register-to-register paths requires a rich standard cell library that includes the appropriate functions, drive strengths, and implementation variants. Even though Roth’s “D-Algorithm” (IBM 1966) demonstrated that all logic functions can be constructed from a single NAND gate, a rich set of optimized functions (NAND, NOR, AND, OR, Inverter, buffers, XOR, XNOR, MUX, adders, compressors, etc.) are necessary for synthesis to create high-performance implementations. Advanced synthesis and place-and-route tools can take advantage of a rich set of drive strengths to optimally handle the different fanouts and loads created by the design topology and physical distances between cells.
Multiple voltage threshold (Vt) and channel length cells offer additional options for the tools as well as variants of these cell functions such as tapered cells that are optimized for minimal delays in typical processor critical paths. Having these critical path-efficient cells and computationally efficient cells, such as AOIs and OAIs, available from the standard cell library provider is critical, but so is having a design flow tuned to take advantage of these enhanced cell options. Additionally, high drive-strength variants of these cells must be designed with special layout considerations to effectively manage electromigration operating at gigahertz speeds.
To help the tools make the correct choices in selecting cells and minimize cycle time, it is often necessary to employ don’t_use lists to temporarily “hide” specific cells from the tools. Grouping multiple signals with similar constraints and loads can also make a major difference in synthesis efficiency. Attaining the absolute maximum performance out of a design requires the tools and flows to be pushed at different steps in the design flow (e.g., initial synthesis, incremental synthesis, clock tree synthesis, placement, routing, physical optimization). Optimization techniques can typically provide a 15-20 percent performance improvement.
Clock Cells
High-performance clock driver variants are tuned to provide the minimum delay to reduce clock latency and minimize clock uncertainty caused by skew and process variability. These can include clock buffers tuned for symmetrical rise/fall times and clock inverters for minimum power. Clock tree synthesis tools must be robust enough to handle the PPA tradeoffs of these variants in order to use them effectively.
Intelligent use of integrated clock gating cells (ICGs) in multiple functional and drive strength variants are critical to minimizing clock tree power, which can easily consume 25 percent-50 percent of the dynamic power in an SoC.
Power Optimization Cells
A power optimization kit consists of all the logic cell functions needed to implement the power optimization techniques for the SoC. These techniques include clock gating, shut down, deep sleep, multiple voltage domains, dynamic voltage and frequency scaling (DVFS), state retention and voltage biasing. This kit contains all of the necessary circuits to perform power optimization functions and provides the EDA views, such as UPF and CPF, that the EDA tools used in a low-power design flow to correctly construct and validate the design.
Memory Compilers for High-Performance Core Optimization
The memories that temporarily store the CPU, GPU and DSP’s data and instructions are frequently in the critical timing path that limits the processor’s maximum clock frequency. These “cache” memories are relatively large, both to accommodate the width of the processor’s instructions and data and to improve application performance by reducing the number of relatively slow accesses to data located either on the SoC’s internal bus or in external memories. In this context, memory performance is typically controlled by the sum of the setup time of the address and the access time of the data. The timing of the memory clock with respect to the address information can be optimized (skewed) to obtain some boost in maximum clock frequency. There are also many smaller memories (or register files) that are used as scratch pads for intermediate results for saving dynamic lookup tables. Memory compilers are used differently in CPUs, GPUs and DSPs and require multiple configurations to address the memory needs of an SoC, as can be seen in Figure 6.
CPUs typically have a limited number of memory configurations that are in the critical path. Fast cache memory instances enable significant CPU speed improvement in CPUs using single port memory with fast setup time and access time. Long channel devices can be used to keep leakage low but speed is the primary criteria.
GPUs typically have a large number of smaller, different memory configurations that are used for temporary storage of graphic data that is processed by the GPU for variations in colors, textures and shading. The massively parallel architectures that lend themselves to solving GPU problems often require two port memories that act as FIFOs in these multi-staged processors that typically run at a fraction of the speed of CPUs but consume area several times that of the processors. Power and area are the primary criteria in GPU development.
DSPs, when used for base station applications, are typically meant to handle many processes so that absolute performance is the target with similar requirements as CPUs. DSPs used in portable handsets require memory performance similar to GPU performance levels.
To maximize the battery life of portable products and minimize the waste heat produced by line-powered devices, the memories’ low-power modes should match the processor core’s low-power strategy:
Voltages for operation, as well as state retention in various low-power modes
Clock gating of computing blocks and memories and clock distribution to wake-up logic
Maintenance of data and logic states to provide rapid, controlled resumption from power-saving modes
As with the standard cell library, the memories’ operating conditions will dramatically affect the maximum clock frequencies they can achieve. In addition to possibly working at overdrive voltage levels, the memories’ data retention characteristics must be characterized for all of the CPU’s power-saving modes.
EDA Tool Flow for High-Performance Core Optimization
Although there are multiple EDA tool sets and flows that can be used to harden CPU, GPU and DSP cores, in this article we will use the Synopsys Galaxy tools with the High Performance Core (HPC) scripting methodology as the basis for discussion.
The HPC scripts are a reference methodology that is non-library/non-core specific, using the latest tools and methodology including Design Compiler Graphical and Synthesis Placement Guidance, along with a single set of user configurable scripts for each step. These scripts include a configurable handoff between Design Compiler Graphical and IC Compiler with common scripts for setup tasks and tool-specific scripts. These scripts enable designers to set up tools for specific processor configuration, floorplan and performance targets, and are up to date with the latest tool versions. HPC scripts are also available for the Lynx Design System, allowing designers to visualize the design flow and manage the project and the runtime execution of EDA tools.
Minimizing Silicon Area in CPU, GPU and DSP Cores
Every designer wants to minimize the silicon area of the logic and memory circuitry of an SoC because silicon area translates directly into cost.
Optimizing Logic Area
High-speed libraries (on the right in Table 1), are tuned with a performance sweet spot for CPUs and base station DSPs. The taller height enables cells to have larger transistors to drive higher currents to other logic with deeper pipelines and/or larger capacitive and resistive loads at the highest frequencies. Additional pickup points provide greater access for routers to get pins. High-density libraries (on the left in Table 1), are tuned with a density and power sweet spot for GPUs and handset DSPs. The shorter height enables denser logic for circuits that have shallower pipelines and/or are clocked at lower frequencies. Routability must be considered in the design of these higher density cells to avoid routing congestion—especially in GPUs where parallel architectures bring many signals together making it difficult for routers to find pickup points. Understanding of router algorithms and techniques (both signal and power) is critical for designing router-friendly high density cells.
Synthesis does a very good job of minimizing area of standard cells, favoring smaller cells in its selection given appropriate constraints. Area can usually be minimized by synthesizing the high density, lower VT libraries. The low VT libraries (and libraries with minimum channel lengths) provide more current so that smaller drive strengths (e.g. 1X drive) are used more frequently as compared to higher drive strengths, which have larger cell area (e.g. 2X, 4X drive). Selecting high-density cells with lower VTs can save area, but at the expense of increased leakage.
Optimizing Memory Area
Memories can take up the majority of the die area of an SoC so memory density can be critical to the overall SoC cost. Additional cache memory can also speed up effective CPU throughput so system tradeoffs must be carefully analyzed. As with standard cells, there are available memory compilers optimized for high speed (typically with larger bitcells) and optimized for high density (typically with smaller bitcells. Given a fixed number of words and bits, one of the major factors in memory area is the bitcell used for the memory array (other factors include column multiplexing, banking, decode options and test options). The memory array dominates the area of large memories. The periphery of the memory contains all of the driving, sensing and control logic and dominates the area of the memory in the case of very small memories. Selecting the optimal memory given its size, performance and power requirements can be a challenge so many instances should be generated, sorted and compared using automation. When minimizing memory area, it is also important to factor in the memory test logic area that can be more efficiently incorporated into the memory itself by the memory compiler instead of having to be placed and routed using random standard cells.
Ultra-High Density 2-Port Memories for Power and Area Minimization of GPUs
GPU cores use many instantiations of single-port and two-port memories. The many FIFOs in embedded GPUs are implemented using two-port (one dedicated read port and one dedicated write port) memory. Reducing the area and leakage of these two-port memories will benefit the GPU cores. Using the Ultra High-Density Two-Port Register File (UHD 2P RF) instead of the High-Density Two-Port Register File (HD 2P RF) is an effective way of reducing area and power in GPUs. The tradeoff is speed. The resulting speed is still higher than required by most of today’s embedded GPU cores. So this is an acceptable tradeoff.
The UHD 2P RF uses a 6-transistor bitcell instead of an 8-transistor bitcell to deliver 1R1W functionality. The UHD 2P RF has the read and write ports on the same clock, whereas the HD 2P RF has different clock inputs for the read and write ports.
Having described how the combination of logic libraries and embedded memories within an EDA design flow can be used to optimize area in CPU, GPU or DSP cores, we now look at exploring methods by which logic libraries and embedded memories can be used to optimize performance and power consumption in these processor cores.
Maximizing Performance in CPU, GPU and DSP Cores
Clock frequency is the most highly publicized attribute of CPU, GPU and DSP cores. Companies that sell products that employ CPU cores often use clock frequency as a proxy for system-level value. Historically, for standalone processors in desktop PCs, this technique has had value. However, for embedded CPUs it’s not always easy to compare one vendor’s performance number to another’s, since the measurements are heavily influenced by many design and operating parameters. Often, those measurement parameters do not accompany the performance claims made in public materials and, even when the vendors make them available, it’s still difficult to compare the performance of two processors not implemented identically or measured under the same operating conditions.
Further complicating matters for consumers of processor IP, real-world applications have critical product goals beyond just performance. Practical tradeoffs in performance, power consumption and die area — to which we refer collectively as “PPA” — must be made in virtually every SoC implementation; rarely does the design team pursue frequency at all costs. Schedule, total cost and other configuration and integration factors are also significant criteria that should be considered when selecting processor IP for an SoC design.
Understanding the role common processor implementation parameters have on a core’s PPA and other important criteria such as cost and yield is key to putting IP vendors’ claims in perspective. Table 3 summarizes the effects that a CPU core’s common processor implementation parameters may have on its performance and other key product metrics.
Elusive Critical Paths
To achieve optimal performance designers must reduce the delay in the critical paths of the CPU, GPU and DSP designs. These critical paths can be in the register-to-register paths (logic) or the memory access paths to/from the L1/L2 caches. Critical paths can move between memory and logic during the design process and sometimes feel like playing Whac-A-Mole® but having well characterized logic and memory IP, a solid EDA flow and mastery of design techniques can help designers to achieve timing closure.
High Performance Critical Path Optimization Techniques
Performance of CPU critical paths can be maximized by selecting the high speed logic libraries and memories and using various design techniques including starting with a proper floorplan, library usage, incremental synthesis, script settings, path group optimization and using over-constraints.
One of the best ways to minimize these critical paths is to start with a good initial floorplan to minimize the physical distance between the memory I/O pins and the critical registers within the processor logic. The ability to change this floorplan is critical as the design progresses and engineering tradeoffs are made to achieve the goals. A good floorplan based on the number of cores and the rest of the high performance core interconnectivity requirements can minimize the physical distance in the top level of the design and reduce timing bottlenecks.
Library usage refers to selecting the best library architecture (in this case High Speed), and selecting the optimal VT and channel length libraries to introduce in the synthesis and place and route flows. This also refers to the practice of don’t_use lists to encourage the tools to select the highest performance cells by “hiding” some of the more area-efficient cells to trade performance for area.
Incremental compile techniques include running synthesis multiple times, sometimes introducing different synthesis options and additional libraries or cells to improve performance with each run.
Minimizing Power in CPU, GPU and DSP Cores
CPU, GPU and DSP cores must achieve high degrees of performance while consuming a minimal amount of energy to achieve longer battery life and fit into lower cost packaging. Power optimization is often the most important constraint, making the design challenge getting the best performance possible within the available power budget. Each new silicon process generation brings a new set of challenges for providers of logic library and memory compiler IP and a new set of opportunities to create more power-efficient IP.
Taking Advantage of Multiple VTs and Channel Lengths for Power Optimization
Most silicon processes support multiple VTs and at 28 nanometer (nm) and smaller nodes, multiple transistor gate lengths with the same gate pitch. This process feature enables multi-channel libraries without the area penalty of designing to the worst-case channel length to achieve footprint compatibility. These interchangeable libraries facilitate late-stage leakage recovery performed by automatic place-and-route tools and very fine granularity in power optimization. Additional VT cells (ultra-high VT, ultra-low VT) provide even more granularity, but with increased costs due to wafer add-ons.
With all of the possible library options, the amount of data presented to the synthesis and place-and-route tools can seem overwhelming. The aggressive use of don’t_use lists (initially hiding both very low and very high drive strength cells) and the proper sequencing of libraries provides an efficient methodology for identifying the optimal set of high-speed and high-density logic libraries and memory compilers that will achieve optimum performance and power tradeoffs at minimum cost. These methodologies are effective on many different circuit types—CPUs, GPUs, DSPs, high-speed interfaces—and are dependent on the specific circuit configuration and process options being used. With a good understanding of the synthesis and place-and-route flow, designers can determine the optimal library combination and sequence for a given configuration of a design.
Acquiring specific libraries for each different type and configuration of CPU, GPU and DSP core implemented on an SoC can be inefficient and costly. A properly designed portfolio of logic cells and memory instances can deliver optimal PPA results in processor core hardenings if it includes a full selection of efficient logic circuit functions, the right set of variants and the right set of drive strength granularities.
Multi-bit Flip-Flops
Using multi-bit flip-flops is an effective method to reduce clock power consumption and is now fully supported in mainstream EDA flops. Multi-bit flip-flops can significantly reduce the number of individual loads on the clock tree, reducing overall dynamic power used in the clock tree. Area and leakage power savings can also be achieved simply by sharing the clock inverters in the flip-flops with a single structure.
Multi-bit flip-flops provide a set of additional flops that have been optimized for power and area with a minor tradeoff in performance and placement flexibility. The flops share a common clock pin, which decreases the overall clock loading of the N flops in the multi-bit flop cell, reduces area with a corresponding reduction in leakage, and reduces dynamic power on the clock tree significantly (up to 50 percent for a dual flop, more for quad or octal).
Multi-bit flip-flops are typically used in blocks that are not in the critical path of the highest chip operating frequency. They range from small, bus-oriented registers of SoC configuration data that are only clocked at power up, to major datapaths that are clocked every cycle and with a number of variants in between. SoC designers use the replacement ratio, measured by how many of the standard flops in the design can be replaced by their multi-bit equivalents and the resulting PPA improvements, to determine their overall chip power and area savings. The single-bit flip-flops to be replaced with multi-bit flip-flops must have the same function (clock edge, set/reset, and scan configuration).
FinFETs Extend the Power Curve at 16nm/14nm
FinFETs are replacing planar FETs (also called “planar CMOS”) in today’s 16/14-nm process nodes. The clear advantages of FinFETs over planar transistors include excellent short channel control that leads to lower leakage due to lower drain-induced barrier leakage (DIBL), short channel effects and lower Vt variability due to lower channel doping. There is also less variability caused by random dopant fluctuations and the lower operating voltage can lead to that 50 percent dynamic power savings.
FinFET logic libraries and memories bring their own set of challenges due to quantized widths (and channel lengths) of the fins, and body biasing (often used to achieve lower leakage or perform process compensation) is totally ineffective. Also, the higher parasitics of the three-sided gate that enable lower leakage can increase the dynamic power. There are potential self-heating issues and the thermal aspects of electrostatic discharge (ESD) must also be managed. Finally, degradation and aging can be a factor as PBTI and NBTI are looking worse than they were with planar transistors. Designing logic libraries and memories for FinFETs that shield SoC designers from these challenges requires expertise in TCAD, device and parasitic extraction, transistor modeling, FinFET-specific layout and place-and-route tools. SoC design at the 16/14-nm FinFET node will be hard enough.
Conclusion
Each new SoC process generation brings a new set of challenges and a new set of opportunities for logic library and memory compiler IP to enable optimal SoC PPA. SoC designers need to be aware of and know how to take advantage of advances in library IP using the latest EDA tools. This is especially true when hardening CPU, GPU and DSP cores for high performance, low power and low area. A single source of logic and memory IP that enables this optimization can significantly reduce design time of hardening cores to SoC-specific requirements. Synopsys delivers effectively architected, efficiently designed, accurately modeled logic libraries and memory compilers, thoroughly integrated into EDA flows, silicon-proven and rapidly delivered through an experienced worldwide support infrastructure.
