Eight tips to accelerate SoC physical design at RTL
Silicon design usually suffers from the disconnection of needs between logical designers and physical architectures. This disconnect leads to costly iteration loops to reconcile incompatible options taken by design teams working in isolation. In this article, we will review the essential points to consider in order to ensure a smooth transition between the logical and physical worlds.
Logical and physical implementation context
Let’s have a look at the situation. At the beginning of the design process, the SoC’s initial representation is captured based on the functional description of the circuit and the logical architecture suitable to achieve the functionality and performance. This is usually expressed as shown in figure 1.
Fig. 1: Typical SoC today.
When it comes to physical implementation – with the assumption that flattening the entire design is not an option due to the size of modern SoCs and the limited capacity of place and route tools at deep submicron nodes – we have to determine a suitable hierarchy for the backend implementation to result in an optimal design.
The traditional approach was to mimic, in the physical domain, the hierarchy inherited from the RTL coming from the logical assembly of the SoC – see figure 2a. The main drawbacks with this approach are the huge complexity of the top-level floor plan, the overall synchronization of the sub block’s development and the final timing convergence.
Fig. 2: Different physical implementation strategies of the SoC.
Recently, it became more common to harden some specific parts of the design, creating the adequate level of hierarchy in the RTL and flattening the remaining part of the SoC – see figure 2b. The benefit of this methodology is that it limits the complexity of the top-level floor plan. However, since the bus fabric is implemented at the top level, timing convergence remains a challenge, and the wire dominant nature of the bus potentially makes the silicon utilization less than optimal.
To further improve the previous approach, a designer can choose not to have any logic at the top level of the circuit by pushing all the circuit components, including bus fabric, within the physical partitions leaving only inter-partition connections at top level – see figure 2c. The designer could alternately connect those physical partitions by abutment (inserting feedthroughs for the connections having to traverse a physical partition). This approach leads to an extremely optimized usage of the silicon (as the wire dominant nature of the bus is merged within blocks which are more gate intensive), along with a predictable timing closure, provided a number of good design practices have been followed.
Tips for optimized soc realization
At the beginning of the implementation process, it is important to identify the degrees of freedom that exist: The IO ring (typically predefined) provides strong constraints to the placement of interface blocks, while the elements that primarily connect to the internal bus have more flexibility for their location within the die.
After this step, a partitioning study must be started in order to define the proper grouping of IPs within physical partitions, at a stage where enough information is available (at pre-RTL stage, at RTL, or at netlist level) depending on tools available and desired precision.
Balancing the physical partition area
If at all possible, having physical partitions of similar size simplifies the top level floor plan along with the pin distribution at the partition border. Accommodating blocks of extremely different sizes often requires creating complex rectilinear shapes leading to a more congested routing.
Minimizing top level routing
Reducing the number of wires to be routed at top level (or using feedthroughs) is a good design practice to make the top level more easily implementable. In order to achieve the target, cloning of specific logic blocks which generate many distributed wires across the die (like reset and DFT controllers, or some clock generators) will help.
Minimizing high speed connections
If the connections between physical partitions can be low-speed connections, the final timing convergence will be easier. Clever IP grouping choices will help achieve this goal.
Optimizing the clock distribution
Enforcing “clock confinement” (i.e. having synchronous clock domains bounded inside a single physical partition, as opposed to an SoC-wide distribution of synchronous clock) is the best strategy along with minimizing high speed connections to ease the top level timing convergence. If properly achieved, along with reasonable timing budgeting at the physical partition border, it will allow confining the timing challenges within physical partitions (and then solving them locally). As a result, the final assembly of the SoC will not exhibit problems requiring reopening already timing clean partitions.
Honoring power domains
Multiplication of independent power domains (either standby area or DVFS domains) in circuits for the mobile market creates additional constraints to honor when defining the physical implementation strategy. Special care must be taken to ensure that different power domains will be properly isolated and that feedthrough signals (which can possibly cross power domain boundaries) will be buffered on the proper power supply.
Optimizing the bus fabric architecture
All the techniques listed above will drive the design team to rethink the bus fabric architecture according to the physical implementation needs. Doing this will allow the design team to match bandwidth and latency requirements and take each of the previous points into account during the final optimization of the bus design.
Physically driven grouping of IPs has major implications on the bus architecture which cannot be anticipated at the early design stage. For a given class of traffic, permutations of IP connection slots will be done to allow the proper IP grouping to take place. And the timing optimization of the bus design (pipeline insertion) cannot be done before this final version of the bus being available.
Verification of clock and power domain crossing
A single clock domain crossing error can kill the functionality of the entire SoC, by introducing non deterministic flip-flop metastability behavior. This leads to random failures depending on PVT conditions which are extremely costly to debug at silicon level. This verification can’t be delayed until the final SoC assembly but needs to occur in a hierarchical way; it must be an intrinsic part of the handoff of the IP. Special care must be taken as the context of integration of an IP may change the clock relationships with respect to the original IP designer assumptions.
All asychronism/synchronism assumptions have to be carefully reviewed. It is also important to eliminate synchronous connections between different power domains as those are almost impossible to close across the variety of possible voltage scenarios.
Completeness of timing constraints
The quality of the timing constraints that will be used to signoff the design is another key item to watch. The creation of the final SoC level SDC is a combination of a bottom up and top down approach. Timing exceptions are inherited from the IP, and clock constraints are propagated from the top of the SoC. Insuring the coherency between the various sets of SDCs is not a trivial task, especially when design is built from multiple sources (3rd party IPs, internal design reuse, new design blocks, etc …) and the team is located on different continents (which is the case for most modern designs). Here, communication between the team members is the challenge.
Those timing constraints have to be validated early in the design process for all modes of operation to drive the physical implementation. If properly budgeted at the border of each physical partition (and signed off at the partition level after implementation), then the final assembly and chip timing closure will not be a major challenge.
Restructuring the design RTL
We discussed the benefit of restructuring the design hierarchy to fit with optimized implementation.
The question now is: at which stage of the design should this restructuring occur? There are two main reasons to restructure at the RTL:
Benefit from RTL synthesis at physical partition level
a) Propagating the input tie-up, tie-down, and floating outputs though the hierarchy will reduce the size of the netlist to implement. This reduction comes for free during the synthesis process, while it is more complex to implement at the netlist level where scanned flip-flops prevent this optimization to occur within a stitched scan chain.
b) In-context synthesis with wire load models estimated from the predicted size of the physical partition will produce an “easier to implement” netlist with respect to timing criteria.
c) Physically-aware DFT insertion (scan ordering and compressor insertion) will reduce congestion spots.
The enabler for RTL restructuring to occur is the starting of high-level design floor planning and timing evaluation from the RTL representation. There are now reliable tools on the market to perform this task with adequate precision, in order to generate the directives for the RTL & SDC manipulation to happen. The sequence of operations is depicted in figure 3.
Fig. 3: Moving from original RTL to restructured RTL.
To better control schedule predictability
Starting this physical activity at the RTL is another way to discover early on (at a time where the RTL design may still be influenced), blocking points that could end up in long fixing iterations if discovered at the final stages of the design closure. This is called “prototyping.” A coarse grain synthesis followed by a fast placement generates a representation of the whole SoC which is used to evaluate congestion and timing issues early on, and determine corrective actions. It is the starting point of the partitioning activity we discussed above.
The benefit here is to eliminate the loops from final timing analysis to RTL design, replacing them with local loops that are easier to control.
The case study highlighted below shows the advantage of allocating time for “prototyping” the physical implementation of the whole SoC in advance (scenario 2 and 3 compared to scenario 1 in figure 4) along with the additional benefit of doing this prototyping at the RTL rather than at the netlist level (scenario 3 compared to scenario 2).
Fig. 4: Case study on prototyping benefits.
Design complexity and market pressures make the die area and schedule control of modern SoCs important success criteria. It is therefore crucial to anticipate and validate critical design decisions as soon as possible. It is now possible to do this assessment at the RTL to bring more physical awareness within the RTL design phase to further optimize the silicon utilization and bring more controllability in the implementation process.
About the author
Francois Rémond is Solutions Architect at Atrenta www.atrenta.com