Dynamic partitioning speeds memory characterization
The ubiquitous and ever increasing presence of microprocessors in many system-on-chip (SoC) designs has led to a significant proliferation of embedded memories. These chips can have more than 100 embedded memory instances, including ROMs, RAMs and register files that consume up to half the die area. Some of the most timing-critical paths might start, end, or pass through some of these memory instances. Thus, the models for these memory elements must accurately account for process, voltage, timing and power variability to enable truthful chip verification. Memory characterization is the process of abstracting a memory design to create an accurate timing and power model most commonly used by downstream implementation and signoff flows. Ad-hoc approaches to memory characterization do not accurately model the data required for faithful SoC signoff, delaying tapeout and increasing the total cost of the design.
Memory characterization often requires hundreds of SPICE simulations. The number of memory instances per chip and the need to support a wide range of process, voltage and temperature corners (PVTs) make these simulations a daunting task. Also, the growing size of memory instances and sensitivity to process variation add more dimensions to an already challenging undertaking. Further, the need to create library variants for high-speed, low-power and high-density processes makes it imperative to automate the memory characterization flow.
Overview of memory characterization methodologies
Broadly speaking, there are two main methodologies for memory characterization. The first one is to characterize memory compiler-generated models and the second is to characterize individual memory instances. Further, there is an assortment of approaches for instance-based characterization, including dynamic simulation, transistor-level static timing analysis and ad-hoc divide and conquer.
Memory compilers construct memory instances by abutted placement of pre-designed leaf cells (e.g., bit-columns, word and bit line drivers, column decoders, multiplexers and sense amplifiers, etc.) and routing cells where direct connection is not feasible. The compiler also generates a power ring, defines power pin locations and creates various electrical views, netlists and any additional files required for downstream verification and integration.
Memory compilers do not explicitly characterize the generated cells but instead create models by fitting timing data to polynomial equations whose coefficients are derived from characterizing a small sample of memory instances. This approach enables memory compilers to generate hundreds or thousands of unique memory instances, differing in address size, data width, column/row density and performance. However, the model accuracy of this approach is poor.
To safeguard against chip failure due to inaccurate models, the memory compiler adds margins. These can lead to more timing closure iterations, increased power and larger chip area, however. In addition, the fitting approach doesn’t work well for the advanced current-based models: effective current source model (ECSM) and composite current source (CCS), which are commonly used for timing, power and noise at 40 nm and below.
To overcome the inaccuracies of compiler-generated models, design teams resort to instance-specific characterization over a range of PVTs. This is a much more time-consuming process that yields more accurate results. However, often due to limitations in the characterization approach and available resources, the accuracy improvement is not as much as it could be, while the cost is high.
Approaches for memory characterization
One method for instance-based memory characterization is to treat the entire memory as a single black box and characterize the whole instance using a FastSPICE simulator. The advantage of this method is that it enables the creation of accurate power and leakage models that truly represent the activity of the entire memory block. It can also be distributed across a number of machines to speed-up simulation time. Unfortunately, this approach is not without disadvantages—namely, a FastSPICE simulator trades off accuracy for performance. Further, the black box approach still requires users to identify probe points for characterizing timing constraints. For a custom memory, the characterization engineer can get this information from the memory designer, but this information is not available from memory compilers. Finally, this method doesn’t work well for generating some of the newer model formats such as noise models, and cannot be scaled to generate process variation models needed for statistical static timing analysis (SSTA).
A second approach for memory characterization is using transistor-level static timing analysis (STA) techniques that utilize delay calculators to estimate the delay of sub-circuits within the memory block to identify the slowest paths. The advantages of this method are fast turn-around time and the fact that it does not require vectors to perform timing analysis. However, STA techniques suffer from identifying false timing violations that require further analysis with SPICE/FastSPICE simulators to determine if these are of real concern.
The STA approach is also impaired by pattern matching, which needs to be constantly updated as new circuit structures are introduced across different designs and new revisions. The existence of analog structures like sense amps in the critical paths of the clock-to-output data makes the setup of the STA approach more demanding. Further, the underlying usage of transistor-level delay calculators, which are process- and technology-dependent, undermines its claimed SPICE accuracy, as it assumes that delays from active elements and the RC parasitic elements can be separated. However, this is no longer the case due to the presence of parasitic elements in between finger devices, which are a typical result of extraction on memory structures in advanced process nodes.
Finally, a static delay calculator is in general severely compromised—either in terms of runtime, accuracy, or both—by the presence of large transistor channel-connected regions. Therefore, a memory array is arguably the worst application for a static timing analysis. This is especially true in the presence of power-gating methodologies, which are commonly adopted for any memory design at 40 nm and below.
Another approach to memory characterization is to statically divide the memory into a set of critical paths, characterizing each of these paths using an accurate SPICE simulator and then integrating the electrical data from each of these components back into a complete memory model. The advantage of this approach is the accuracy gained using SPICE simulations with silicon-calibrated models. The simulations can also be distributed across a computer network with each critical path being simulated independently.
The disadvantage is that for advanced memories, in which there is significant coupling or a virtual power supply network, the circuitry making up the critical path grows too large for a SPICE simulator to complete in a reasonable amount of time. In addition, SSTA model generation, especially for mismatch parameters, becomes prohibitively expensive in terms of turnaround time with such a large circuit. Further, the static-divide approach is challenged by the need to correctly identify the clock circuitry, memory elements and tracing of critical paths through analog circuitry such as sense amps for different memory architectures with varying circuit design styles.
Requirements for a new characterization methodology
The variation in the memory architectures and usage styles (multiple ports, synchronous or asynchronous, scan, bypass, write-through, power-down, etc.), requires a significant amount of effort by the characterization engineer to manually guide the process, creating stimulus files, simulation decks and other needed data. This makes the characterization process even slower and more error prone.
Memories now contain greater functionality, particularly by adding structures and techniques to minimize power consumption, both static and dynamic; for example, adding header and footer cells, power-off modes and, in some cases, dynamic voltage and frequency scaling for multi-domain power management. In addition, fine-line process geometries require more comprehensive characterization in order to acquire the complete data needed to capture coupling impact. Thus, the abstracted models must faithfully incorporate the effects of signal integrity and noise analysis. This, in turn, requires more extracted parasitic data to remain in the netlist, increasing SPICE simulation run-times and reducing productivity.
There’s a clear and pressing need for a new memory characterization approach, with the following characteristics:
- The new solution must offer high productivity, especially for large memory arrays, in order to accelerate the availability of models to design and verification teams. The characterization must be straightforward to configure and control, as well as offer high throughput.
- The generated models should be of the utmost precision—the impact on parametric timing, noise and power data from the characterization process must be minimized; certainly under 0.5%.
- The solution must be tightly integrated with a leading-edge SPICE/FastSPICE simulation platform for uncompromised accuracy and scalable performance.
- The models must provide robust and comprehensive support for all the analysis and signoff features required by the downstream tools for corner-based and statistical verification methodologies.
- The models must be consistent with other characterized components, such as standard cells, I/O buffers and other large macros, in the selection of switching points, characterization methodologies and general look and feel.
Dynamic partitioning—memory characterization for 40 nm and below
The existing methods of memory characterization, either block-based or static divide and conquer, can be augmented to address current characterization challenges using dynamic partitioning (see figure 1).
Rather than relying on static path tracing, dynamic partitioning leverages a full-instance transistor-level simulation using high-performance, large-capacity FastSPICE simulator and acquisition data-specific vectors to derive a record of circuit activity. The critical paths for each timing arc, such as from the clock-to-output bus or from the data or address buses and clock to each probe point, can be derived from the simulation results. The probe points where the clock and data paths intersect can be automatically derived from a graph traversal of the circuit without requiring design-dependent transistor-level pattern matching. A “dynamic partition” can then be created for each path, including all the necessary circuitry along with any “active” side-paths such as coupling aggressors.
The dynamic partitioning technique is particularly effective for extracting critical paths through circuitry that contains embedded analog elements; for example, sense amplifiers along the clock-to-output path. Figure 2 shows a dynamic partition for a memory path from clock-to-output with less than a thousand transistors.
Dynamic partitioning benefits
Another benefit of dynamic partitioning is that memories—like many analog blocks—exhibit the characteristic that partition boundaries can adjust during operation. A “read” operation following a “write” to a distinct location exercises quite different parts of the instance. By offering a more flexible and general-purpose solution, dynamic partitioning provides superior results in these situations. Once this comprehensive partitioning is complete, accurate SPICE simulations are performed independently on decomposed partitions, with the assembled results faithfully representing the timing behavior of the original, larger instance.
The dynamic partitioning approach uses the full-instance FastSPICE simulation to determine large-scale characteristics like power and partition connectivity, while the much smaller individual partitions, each containing typically less than a thousand transistors, are simulated using a SPICE simulator to ensure the highest levels of accuracy. The approach requires tight integration with a high-performance, high-capacity FastSPICE simulator that accommodates multi-million-bit memory cells with RC parasitic elements with superior turnaround performance. Communication between the full-block and partitioned sub-block simulations ensures identical initial conditions and DC solutions, essential for obtaining precise and consistent data. This analysis and partitioning technology can be coupled to sophisticated job control for parallel execution and distribution across the network.
In addition, the partitions can be represented as super-cells to standard cell library characterization platforms, enabling the generation of current source models (CCS and ECSM) for timing, noise and power analysis, as well as the generation of statistical timing models using the same proven statistical characterization methods for standard cells. The consistent application of a library-wide characterization approach ensures interoperability between all instances in the design, eliminating hard-to-find timing errors due to library element incompatibilities.
As each partition is simulated using a SPICE simulator, the generated timing models are nearly identical to simulating the complete block using SPICE, which is impractical for all but the smallest memory instances. The only source of error is related to tying off inactive gates and wires. Table 1 shows the accuracy of results from characterizing the entire block using the black box FastSPICE approach and the results from dynamic partitioning as compared to SPICE golden results. The accuracy loss due to dynamic partitioning is less than 1.5% for delay, transition and constraints, while using the black box FastSPICE approach results in up to a 4.5% difference.
Table 1: Accuracy of Black Box FastSPICE, Dynamic Partitioning vs. SPICE
In addition to delivering accuracy, dynamic partitioning greatly improves the CPU time and total turn-around time for memory characterization by an order of magnitude or more. Table 2 shows the total CPU time comparison for a memory instance characterization using the black box FastSPICE approach compared to using the dynamic partitioning approach.
Table 2: Performance improvement from dynamic partitioning
The dynamic partitioning approach can be quickly deployed either for instance-based characterization or integrated into a memory compiler. The additional information required is minor; either stimulus or functional truth-tables derived from the memory datasheet. It is applicable to all flavors of embedded memory such as SRAM, ROM, CAM and register files and also for custom macro blocks such as SERDES and PHY.
Applying dynamic partitioning to large memory instances eliminates the errors seen in solutions that rely on transistor-level static timing analysis, that use the black box FastSPICE approach, or that divide the memory statically. This new methodology provides the most precise and comprehensive models for design, verification and analysis available today, while enabling very fast runtimes, often an order of magnitude faster for large macros.
With this superior level of precision, designers can be much more selective in the application of timing margins during design and verification. As a result, they can achieve improved design performance, optimal die area and closer correlation with manufactured silicon. When coupled with a proven standard cell library characterization platform, a consistent characterization methodology can be used for cells and macros, critical for electrical signoff and analysis of large and complex SoC designs. Utilizing dynamic partitioning, designers benefit from the comprehensive, accurate and efficient models the approach provides, along with the significant speedup in total turn-around time.
About the authors
Federico Politi is a software architect responsible for the Cadence Liberate MX memory characterization tool. Prior to joining Cadence, Federico was a senior architect at Altos DA and the original author of Liberate MX. Prior to that, he held R&D positions at Magma DA and Circuit Semantics, Inc. Federico is an expert in circuit analysis, logic abstraction and transistor-level timing analysis. He holds an MSEE from the University of Trieste, Italy and has 14 years of experience in EDA.
Ahmed Elzeftawi is a senior product marketing manager at Cadence Design Systems, supporting library characterization and modeling products. Prior to joining Cadence, Ahmed held various roles in product development, applications engineering and technical marketing at Mentor Graphics. Ahmed holds a bachelor of science in electrical and communications engineering from Cairo University and an MBA from Santa Clara University. He has 13 years of experience in EDA.