Slash power SoC consumption in the interconnect

Slash power SoC consumption in the interconnect

Technology News |
By eeNews Europe

A modular approach to SoC interconnect slashes power consumption with unit-level clock gating.

While power management has only grown in importance for system-on-chip (SoC) developers, the one crucial area that is often overlooked is the interconnect. While most power management efforts focus on the computational aspects of the SoC, designers who adopt a more modular interconnect could reduce die size, alleviate routing congestion, and, by doing so, cut overall chip power consumption by as much as 0.7 milliwatts. A reduction this significant could be a game-changer in next-generation systems for mobility and power-conscious data center applications.

The modular concept is different from other types of interconnects because it consists of a distributed architecture of switches, buffers, firewalls, pipe stages, and clock and power domain crossings. By using a universal transport protocol between all of the separate units on the chip, the modular approach enables designers to implement unit level clock gating to eliminate clock tree switching power where no traffic is present.

Modular on-chip network-on-chip (NoC) technology also reduces power consumption by localizing logic, minimizing long wires, and keeping capacitance low. Designers who want to further enhance the power management abilities of their SoC design can explore measures to reduce the area and leakage power consumption of the chip by using the simplicity of a NoC transport protocol to serialize data paths and thereby minimize logic.

Low Power Consumption
The top level interconnect fabrics commonly used today typically rely on long wires that draw a disproportionate amount of power relative to the amount of logic area consumed on a chip. A clock tree is usually the greatest power sink within an interconnect, and clock gating provides the greatest potential for reducing this. Additionally, leakage power is the second greatest power sink, and reducing the logic area needed for the fabric can minimize leakage.

Designers considering a modular NoC interconnect will learn about the power and area benefits of the localization of clock tree management, data path serialization, and precisely located pipe stages.

Busses and Crossbars: Brief History of Interconnect
The history of interconnect fabrics shows how the philosophy of modular NoC design came to be, and addresses issues of scalability.

A SoC is a chip with a CPU and peripherals, and developers came up with interface protocol standards to link the elements together. With the advent of additional bus masters, connections to the peripherals were shared. Controlling access to the bus required a central arbiter, such as those used in board-level protocols.

Figure 1: A shared bus with an arbiter shows how access control requires a central arbiter.

Over time, SoC designs added more and more IP cores. As their designs became more complex, they required more bus interfaces. When operations on a chip ramp, a bus master can waste significant time waiting for access to the bus, even when different masters are requesting transactions to different slaves.
To combat wait time delay, crossbar switches were created to allow concurrent accesses between different masters and slaves within on-chip interconnects. Below is a logical diagram showing four masters performing four simultaneous transactions to four different slaves.

Figure 2: The logical view of a crossbar switch in SoC design demonstrates the relationship of a multiplexer at each slave.

Physically, a crossbar switch is implemented with a multiplexer (mux) at each slave. Each mux is coupled with an arbiter in a distributed arbitration scheme.

Figure 3: The implementation of a 4 master, 6 slave crossbar shows how the size of routing full data paths around the SoC is impractical.

This approach scales up to several master and slave interfaces. However, beyond a certain number, the size of routing full data paths around the SoC becomes impractical for place and route.

Figure 4: An SoC floor plan grows more complex as feature lists grow and IP blocks are added.

For more complex chips with more than several master/slave interfaces, it is necessary to design separate interconnects in multiple physical regions, depending on the placement of IP core groupings. Bridges linking between regions provide necessary connectivity between masters and slaves.

Figure 5: Interconnect of 4 masters by 6 slaves with a bridge carries logic delay overhead.

Bridges add cycles of latency to data transactions because they carry a logic delay overhead.

Crossbar interconnects fix a system architecture problem of concurrent accesses but create physical implementation issues in chips with large numbers of master and slave IP blocks.

Modular Design and NoCs
To reduce latency, addresses can be decoded at the master interface and converted to a simple route ID. An on-chip network of arbiter-muxes and router-demuxes can use simple route IDs and spread the distribution of routing by linking simple pseudo-switch muxes around the chip. This allows better placement of the interconnect logic. Placement is increasingly important for the growing number of wires in chips because it makes routing easier.

NoC interconnect addresses both problems, and it has become widely used in advanced designs in mobile phone applications processors, digital TV and set-top box controllers.

Figure 6: Interconnect of 4 masters by 6 slaves with a NoC

Designers have been asked to integrate more features in their SoCs, so the demands on interconnect technology have grown. To keep pace, the following features are in high demand:

  • Interfaces to different transaction protocols
  • Switches (demux-routers and arbiter-muxes)
  • QoS (priority)
  • Buffers
  • Data path serialization
  • Statistics probes
  • Debug tracing
  • Firewalls
  • Register slices (pipe stages)
  • Clock domain crossings
  • Voltage domains
  • Power domains

These have caused new challenges in interconnect design.

Designers want IP to be reusable and reconfigurable. Supporting the growing feature requirements within the logic of crossbars creates complexity and can slow critical paths. Furthermore, many wires are toggled even for a small volume of traffic, which consumes a disproportionate amount of power. However, a reusable modular interconnect design offers advantages in simplicity, speed, area, and power efficiency by overcoming complexities of older bus and cross bar technology.

Transaction, Transport and Physical Layers
NoC technology employs a 3-layer protocol with the transaction layer serving as the highest. It performs the reads and writes requested using AMBA, PIF, OCP, or other industry standard protocols. It is also the interface visible to the designers of the IP blocks connected through the interconnect.

The transport layer protocol in NoC is managed by network interface units (NIUs). It creates one or more packets for each transaction. All packets have a header. Read data and write data packets include the data payload after the header. The packet header encodes addresses, transaction parameters, and sideband signals as fields. The NIU controls outstanding transactions and tagged sequences. The header format is minimal, and optimized differently for each NoC. The header is used at each pseudo-switch within the interconnect to route requests from initiators to targets and responses from targets to initiators. The request and response paths are independent, which eliminates logic and architectural dependencies, making deadlocks impossible.

Figure 7: Multiplexing of address/control signals with data between the transaction interface and the packet transport interface simplifies interconnect design.

The modular design enables transported packets to be transferred on the physical layer using a very simple protocol. The protocol consists of the following signals:

  • Data [N bits] (driven by the sender)
  • Valid [1 bit] (driven by the sender)
  • Ready [1 bit] (driven by the receiver)

“Valid” and “Ready” implement flow control, which enable back-pressure feedback. This simple handshake protocol exists between all units of the NoC. Standardizing on a simple interface allows units to be connected interchangeably, in the style of children’s plastic interlocking building blocks.

Clock Tree Gating
With well-known chip design methodologies, it is possible to gate the clock at each flip-flop during cycles in which toggling is not required. This is applicable to the flops in all interconnect technologies; however, it does not address clock tree power consumption.

The clock tree is a single signal and therefore much narrower than data paths. However, to reach all physically distributed flops, the clock tree has a lot more metal than each data path bit. Since clocks, by definition, toggle twice per clock cycle, the clock tree typically consumes significantly more power than data paths.

In a crossbar, every clock net toggles even when and where data is not flowing. While it is theoretically possible to achieve some clock gating to all crossbar logic in cycles when no data is transferred anywhere in the crossbar, it is impractical. It would require a large clock gating mux of multiple distant signals to generate enable signals back to several distant flops.

Therefore, building the interconnect from atomic modules of combinatorial logic allows unit level clock gating with much finer granularity than is possible within a monolithic crossbar.

Figure 8: Unit Level Clock Gating using combinatorial logic is possible by building the interconnect through a modular approach.

Registers within and between units only toggle when the valid handshake signal is asserted, indicating that data traffic is present. Gating logic is local to each unit, making paths shorter and minimizing the muxing required to generate the enable signal. Clock gating is distributed, and each module of the modular interconnect is gated off for idle clock cycles, regardless of the state of the rest of the system. This gives nearly ideal minimum switching power consumption.

Other Benefits of Modularity
Aside from clock gating, other benefits include improved use of mixed threshold voltage (Vt) synthesis, reduced leakage power consumption, improved logic simplicity, and localization.

The ability to insert pipe stages anywhere between small modules to meet timing requirements with minimum latency improves the ability of synthesis tools to close timing. With greater margin, synthesis promotes fewer paths from default high Vt cells to faster low Vt cells. In this way, pipelining between the elements of a modular design reduces leakage.

Furthermore, easier timing closure also improves the use of EDA tools to optimize for minimum area. (A smaller die area reduces leakage power.)

A 64-bit AXI transaction interface protocol requires at least 272 wires. For the modular approach, a 64-bit packet interface requires 148 wires (64 bits data + 8 byte enables + ready + valid = 74 in each of the request and response networks). As a result, packetizing transactions to transport them between initiators and targets reduces wire count within the chip floorplan by a factor of 1.8 (272/148 = 1.8).
Because this approach uses a simple physical layer protocol for interfaces between units, it is easy to change the serialization of packet data. All that is required is a simple mux and register to reduce the data path width.

Changing the serialization of data paths to be no larger than needed to meet bandwidth requirements in different parts of the chip reduces the interconnect logic area for all parts of the chip that require less than the maximum bandwidth. Generally, a large majority of the top level interconnects in most chips do not require maximum bandwidth.

By localizing units such as the muxes between interfaces, the average length of wires between units is shorter. That means that less current is consumed due to the capacitance of wires. It also simplifies the back end layout process by reducing connectivity dependencies between logic that is necessarily placed at great distances.

Results on Set-Top Box (STB) SoC
Using a hypothetical modular NoC interconnect for a mid-range set-top box SoC supporting 1080p120 video display demonstrates the advantages of modular approach. The model uses an interconnect of 11 master and 6 slave NIUs and consumes a logic area 183k gates.

Clock-gated switching activity for three scenarios offers analysis. The first is a worst-case video processing scenario where the video decoder, set for 120 Hz display output, and the CPU combine to heavily load the system and consume nearly all available DDR memory bandwidth.

The second scenario depicts average case video decode complexity. The third scenario represents web browsing with no video decode a modest display rate 30 frames per second.

Table 1: Video decode activity of a set-top box chip provides analysis of the effects of clock-tree gating.

Crossbars have to be enabled for every cycle during DDR activity, so the modular design reduces power through a toggling of DDR activity by a factor of either 2.3x in the first case, 2.5x in the second or 3.4x in the third.

In a standby scenario, modular NoC interconnects have demonstrated even greater toggle savings over crossbars. Furthermore, larger chips have more master NIU logic that accesses the same limited, shared resources. Such chips have a larger number of flops gated for a larger percentage of time. As a result, toggle savings for a modular NoC design improves with increased chip size.

Reducing Clock Tree Power Consumption
A modular NoC significantly reduces power requirements of the top level interconnect fabric in highly integrated chips. By localizing clock gating, clock tree power is consumed only along the routes in which data is transferred and only during the cycles when they are transferred. This greatly minimizes clock tree power consumption. Furthermore, localized serialization minimizes the data path logic needed to support the bandwidth requirements of each link. This in turn reduces leakage area. Additionally, modularity allows fine granularity of pipelining in order to close timing without wasted margin. This in turn allows the synthesis tools to use smaller, more efficient gates.

Also See:

    Automate current measurements when characterizing SoCs
    Efficient analysis of CDC violations in a million gate SoC, part 1
    Stars of DesignCon: Inside out test verifies low-power SoCs


Linked Articles