MENU

Overcoming the embedded CPU performance wall

Overcoming the embedded CPU performance wall

Technology News |
By eeNews Europe



The physical limitations of current semiconductor technology have made it increasingly difficult to achieve frequency improvements in embedded processors, and so designers are turning to parallelism in multicore architectures to achieve the high performance required for current designs.

Current status of multicore SoC design and use
The last few years there has been an increase in microprocessor architectures featuring multi-threading or multicore CPUs. They are now the rule for desktop computers, and are becoming common even for CPUs in the high-end embedded market. This increase is the result of processor designers desire to achieve higher performance. But silicon technology has reached its limit for performance. The solution to the need for ever increasing processing power depends on architectural solutions like replicating core processors inside microprocessor-based systems-on-chip (SoC’s).

Moore’s law states that the number of transistors that can be fit onto a square inch of silicon doubles every two years, as the size of transistors shrinks. It was postulated by Gordon E. Moore in 1965, who at that time was Fairchild Semiconductor’s Director of R&D and later co-founder of Intel.

Although the word “law” is used to describe his projection, Moore’s prediction is not a law of physics, but a conjecture based on empirical observation of the technology in the 60’s and 70’s. In the short history of modern computing, there have been many guesses and predictions with no few mistakes. And that makes Moore’s law more impressive considering it has been accurate since it was first postulated right up to present time – and it is expected to hold for at least another decade.

Moore’s law continues to hold because the ability to shrink the size of the components on a chip has enabled designers to continuously increase density of transistors in processors, memories, etc. With smaller transistors you can add more functional units to your processor and make more complex architectures in the same size.

Thanks to this higher density, techniques like branch prediction or out-of-order execution are now common features in modern processors, even though they are resource hungry. This leads to improved IPC (Instruction Per Cycle), i.e. improved instruction throughput, one of the two fundamental sources of the overall performance on a processor. A smaller transistor size also allows higher clock rates. When you shrink the gate length of a transistor by 1/k you can obtain a circuit delay reduced in the same amount. Transistor switching time decreases as circuit delay decreases, so you can achieve a clock rate multiplied by a factor of k. Operating at higher frequencies processors achieve higher performance, but at a cost.

However, designers are now encountering some practical restrictions to following this progression. Increasing density of transistors and frequency on a chip produces limiting consequences that have more influence as you go further down in transistor size. Two that are of primary concern and are the main barriers to further progress are higher power consumption and higher transmission delays.

Power consumption on a chip
The power consumption on a chip and the associated heat dissipation are becoming a big barrier for hardware designers. With the constant increase in number of transistors, current processors are demanding a considerable amount of energy in a very small area. This means a high power density to be dissipated. And it is not only the number of transistors. High operating frequencies also have a serious impact on power consumption, as we will see next.

To get an idea of the evolution of these parameters in the last decades, Figure 1 shows transistor count and operating frequency increments for x86 Intel architectures over a period of 20 years, starting with the 80386 architecture, the first 32-bit x86 processor.

Click on image to enlarge.

Figure 1: Transistor count and frequency for the X86 architecture

Note that both parameters are shown on logarithmic scales, which denotes the huge progression they have kept. With respect to power, Figure 2 shows typical power dissipation for these processors, this time on linear scale.

Figure 2: Power consumption of succeeding generations of X86 processors

The increase in number of transistors continues. Some of the lastest Intel Core i7 processors feature more than 2200 million transistors. The dissipated power also increases slightly, depending on models, reaching values of 130 W. However, clock frequency in these new processors is not increasing and remains around 3.5 GHz.

One of the reasons for this stagnation is that current integrated circuits have reached physical limits of power density, generating as much heat as the chip package is able to dissipate, and consequently hardware designers have had to limit frequency increments. It is true that Intel has never sacrificed performance for power efficiency, but now physical consequences leave them with no option but to look carefully at power consumption.

Some equations better demonstrate how frequency and transistor count affect power consumption on a chip. A few simple mathematical relationships will make it clear why these parameters are so important in today’s designs.

The following equation shows how power dissipation on a chip relates to operating frequency and other factors:

This is the expression for power dissipation in CMOS technology, the dominant semiconductor technology for integrated circuits today. The first part (addend) of the equation accounts for the dynamic power consumption on the chip (i.e. the power consumption caused by charging and discharging capacitive loads when transistors are switched) that represents the useful work performed by the chip. A is the activity factor meaning the proportion of switching transistors in each cycle (since not all transistors have to switch every clock cycle); C is the capacitive load of the transistor; V is the voltage; and f is the frequency.

The second addend in the equation also accounts for dynamic power although in minor quantity, in this case because of the transitory short circuit current (Isc) that flows through transistors from voltage source to ground during finite rise or fall time t. And the last addend accounts for the static power consumption, i.e. the power consumption due to leakage current (Ileak) and the only one that is present in a circuit that is powered but inactive. It applies to the whole circuit independently of transistors state and therefore the activity factor does not appear in this addend.

If we observe the first term of the equation we can see why power has being increasing only linearly while frequency has been doing it logarithmically. The reason is the quadratic dependence on the voltage.

Engineers have been able to continuously reduce this voltage from 5V down to below 1V, which has helped them to control dissipated power without losing performance. Unfortunately, many factors are interdependent and engineers have to make trade-offs constantly. For example, imagine we want to decrease dynamic power consumption on a chip (consider only first term of the equation) by reducing the supply voltage initially fixed at 2V. If we are able to reduce it to 1.7V, it is only a 15% decrease in voltage but we get a significant 28% decrease in power. However, reducing supply voltage has a side-effect on the maximum frequency for the circuit and on the threshold voltage of transistors (the voltage at which a transistor switches on):

In our example, if you had a threshold voltage of 0.5V and the circuit was operating at a frequency of 4GHz you would have to reduce the threshold voltage to a value of approximately 0.32V in order to maintain the same operating frequency. However, this might be not feasible, since threshold voltage depends on technological parameters and beyond some specific value it is not possible to reduce it without making changes in your semiconductor manufacturing process. Without changing threshold voltage, maximum frequency would then be reduced to 3GHz, a 25% decrease.

On the other hand, although you were able to reduce supply and threshold voltage without affecting performance, leakage current depends exponentially on threshold voltage:

The voltage VT is the thermal voltage, that depends on the absolute temperature T; k is the Boltzmann constant and q is the electrical charge on an electron. At usual temperatures the thermal voltage value is around 30 mV. For large values of threshold voltages compared to the thermal voltage the effect on leakage current is negligible, but for small ones, around 100mV, the effect becomes relevant.

Moreover, it is not only the thermal voltage dependent on temperature, threshold voltage usually also varies with temperature and both variations are added together on their effect on leakage current. The increase on leakage current implies increase on static power consumption so this imposes a practical limit on the voltage reduction technique for low values.

Figure 3 shows these effects for two different temperatures. The first curve with T=300K is the presented exponential equation on threshold voltage. The second curve with T=330K is an estimation taking into account variation on threshold voltage as a result of incrementing temperature. In this way, the abscissa still represents nominal threshold voltage but real threshold voltage on the transistor is biased toward lower values by the effect of temperature, thus having a higher effect on leakage current.

Figure 3: Effect of threshold voltage and temperature on leakage current

Leakage current also depends on gate insulator thickness. With very thin gate dielectrics, electrons can tunnel across the insulation generating tunneling currents and leading to high power consumption. This effect is very important in current semiconductor technology processes given the actual sizes in use of 32nm and below for gate lengths.

Of course, the core of a processor is not the only component on a chip that consumes energy. Memories, for example, also consume a considerable amount of energy and modern processors dedicate a large area of the die to incorporate several levels of cache memory.

Engineers apply several design techniques to reduce leakage current or the activity factor of the memory (the A factor in the power dissipation equation shown) and in this way they mitigate power consumption.

For example, the hierarchical organization in levels of cache not only improves data access time, it also helps in reducing power consumed, since smaller, nearer caches require less energy than larger, further ones. With this organizational solution it is possible to reduce power while preserving performance. In line with this idea, another commonly used solution is to organize memory into banks for efficiency. In this case it is possible to activate only the bank being accessed and thereby save energy.

However, looking for higher performance is not always the right thing to do. Sometimes it is adequate to reduce power at the cost of some throughput. There are processors dedicated to specific applications that are always doing the same kind of calculations, for example DSPs. Audio processing, digital filters, or data compression algorithms are typical applications on these devices, where assessments are characterized by how much energy an operation requires and how long it takes for these processors to make such calculations.

A processor that initially takes more time than another executing an algorithm but that consumes less power can, in the end, be more energy efficient. A metric employed for measuring this efficiency is MIPS/W (Million Instructions Per Second-per-Watt). Although metric MIPS has to be taken with care, in general devices with higher MIPS/W are considered more efficient and this is especially interesting for embedded devices, particularly battery-powered devices. Indeed, at this time there is increasing interest and pressure to have energy efficient processors in the world of servers and data centers.

Transmission delays on a chip
The other main factor limiting increasing density of transistors and frequency on a chip is wire transmission delays. The very high frequencies on the order of gigahertz used in modern processors means that a clock cycle occurs every fraction of a nanosecond. This small cycle time is becoming a problem for signal propagation.

Reducing feature size on a chip has enabled a decrease in gate length and capacitance on transistors and so increases clock rates, overcoming capacity bound constraints. But wires on a chip are becoming slower due to higher resistance and capacitance. The width and height of wires now are smaller and this results in higher resistance due to a smaller wire area.

With smaller area and hence less wire surface, surface-related capacitance decreases but the distance between neighboring wires is also being reduced and this produces a higher coupling capacitance. Coupling capacitance increases at a faster pace than surface capacitance decreases, thus counteracting its effect and producing a combined effect of higher overall wire capacitance.

Wire transmission delay is directly proportional to the product of its resistance and capacitance, Rw x Cw, so with each new technology shrinking feature size we get higher wire delays. With faster clock rates and slower wire transmission velocity, the distance that a signal can travel and hence the chip area that can be reached in a single clock cycle are reduced, leading to a new situation in which the constraint now is communication bound.

Given a concrete micro-architecture this would not be a big problem since circuit size would decrease in quadratic proportion. But in order to make the most of smaller transistor size and get higher IPC, designers develop more complex micro-architectures, making deeper pipelines, adding more execution units, and using large micro-architectural structures. Now, higher delays in communications across the chip put a practical limit on the size and even the placement of these structures, and on the maximum operating frequency.

As an example, the design of the misprediction pipeline used in the Intel Pentium 4 required twice as many stages as the Pentium III pipeline. With higher clock rate and wire delays, pipeline has to be divided into smaller pieces and do less work during each pipeline stage. But wire delays had become so large that two of the stages of the Pentium 4 pipeline were extra stages required to drive signals from one stage to the following one in order to have enough time to perform the required computation, since much of the clock cycle time was spent by the signal in reaching the next stage.

A similar example of how wire delay affects a design can be found on the Advanced Microcontroller Bus Architecture (AMBA) specification from ARM. The Advanced System Bus (ASB), introduced in the first AMBA specification and designed to interconnect high-performance system modules, uses bidirectional buses and a master/slave architecture.

On its second AMBA specification, the Advanced High-performance Bus (AHB) was introduced to improve support for higher performance and as a replacement for ASB. In this new bus specification, apart from other features, bidirectional buses have been substituted for a multiplexed bus scheme. Initially this modification would seem to add unnecessary wires and complexity to the circuit. But the effect of wire delays in very high performance systems sometimes makes it necessary to introduce repeater drivers (as seen in the Pentium 4 case). This is possible in the unidirectional buses that make up a combined multiplexed bus but it is very hard in bidirectional buses.

The challenges ahead
We have seen the two main restrictions that technology imposes to continue applying Moore’s law and improving performance on a processor. But technology is constantly evolving. Scaling down feature sizes has enabled increased density of transistors and frequency, and designers are still managing to shrink transistor size and increase the number of transistors on a chip to more than a billion.

Predictions were that semiconductor technology processes would reach 35nm gate lengths in 2014, but actually they’ve been manufacturing at 22nm since 2011. Power dissipation and transmission delay problems are motivating everyone in the industry to investigate new materials for making transistors, and new organizational and architectural solutions are already being applied in modern processors. High-k gate oxides (k refers to the dielectric constant of a material) are replacing the silicon dioxide gate dielectric used for decades, allowing thinner insulators and controlling leakage currents.

New use of low-k dielectrics makes it possible to reduce coupling capacitance and therefore transmission delays. Traditional micro-architectures implementing a single and large monolithic core are evolving to simpler multicore micro-architectures to allow mainly local communications and thus avoid large delays.

Recently, some chip manufacturers, such as Intel, have announced three-dimensional integrated circuits. Its new Ivy Bridge family of processors, the successor to the Sandy Bridge family, is based on a new tri-gate transistor technology that boosts processing power while reducing the amount of energy needed.

Using 3-D transistors instead of the previous planar structure transistors, pipeline stages can be vertically stacked on top of each other, effectively reducing the distance between blocks and eliminating wire delay effects. According to Intel, its 22nm 3-D Tri-Gate transistors consume less than half the power when operated at the same clock frequencies as planar transistors on 32nm chips, exceeding what is typically achieved from one process generation to the next.

Multicore architectures are evolving quickly. For example, Tilera has developed the first 100-core processor on a single chip! To achieve such a level of integration Tilera combines a processor with a communications switch that their designers call a “tile.” By combining such tiles the company is able to build a piece of silicon creating a mesh network. Processors are usually connected to each other through a bus, but as the number of processors increases this bus quickly becomes a bottleneck. With a Tilera tiled mesh, every processor gets a switch and they all talk to each other as in a peer-to-peer network. Besides, each tile can independently run an real time operating system. Alternatively you can take multiple tiles together to run an operating system like SMP Linux.

And investigations are being conducted to develop amazing graphene transistors, each of which is made from a sheet of carbon just one atom thick. Theoretically, these transistors will get very high operating frequencies, toward 1 THz (1000 GHz), and it will be even possible to manufacture them on flexible substrates. There are still lots of challenges for this technology, though, and we will probably have to wait several years to see these advances become reality.

Conclusion
The problem now facing the industry is how to take full advantage of this huge parallel processing power. But the embedded software industry is already developing powerful tools to help build the new and complex many-core applications world.

Proposals like OpenMP and MPI for shared and distributed memory architectures, or OpenCL (Open Computing Language), the open standard for parallel programming of heterogeneous systems, are very promising. With OpenCL you can develop software for systems with a mix of multicore CPUs, GPUs, and even DSPs. But probably the biggest challenge is to change programmers’ mindsets to learn how to write highly parallel and reliable software in these systems.

Julio Díez is a software engineer with fifteen years of experience mainly in the embedded world. He has spent the last six years developing communication and security software for embedded systems, including the first secure communication system in its class for the Spanish NSA. He is interested in, multicore architectures, operating systems, software design, and parallel programming. He holds a bachelor’s degree in telecommunications engineering from Technical University of Madrid, Spain. You can reach him at juliod73@yahoo.es.

Editor’s note: This article originally appeared on Embedded.com.

If you enjoyed this article, you will like the following ones: don't miss them by subscribing to :    eeNews on Google News

Share:

Linked Articles
10s