Predicting thermal runaway – Part 1

Technology News |
By eeNews Europe

In part 1 of this two-part article, we discuss what thermal runaway is and some settings in which it is of concern (and some in which it is not). In Part 2, we’ll discuss how to predict it, which is the first step toward preventing it.

Thermal runaway may occur in a semiconductor packaging application when the power dissipation of the device in question increases as a function of temperature. In more particular, it describes the situation when no nominally steady-state operating point of the device, under the influence of the specific thermal system, can be established. Ordinarily, of course, a device that dissipates a fixed amount of power can always achieve a steady-state operating condition, though the specific junction temperature attained may fall beyond recommended limits. If the thermal system around the device is characterized as having a steady state thermal resistance, then we can describe this equilibrium (or steady state) condition as follows:

where TJ is junction temperature in °C, Q is device power dissipation in watts, θJx is steady state thermal resistance of the system in °C/W, and Tx is thermal ground for runaway based on θJxl or (°C).

For present purposes, it is crucial to use a fixed reference temperature in the model; that is, the temperature that provides the thermal “ground” for the system. Thus, in an air-cooled system, we would speak of x as being the ambient air temperature, Ta. In a water-cooled system in which the device of interest is mounted securely to a water-cooled block, it would be more appropriate to refer to the coldplate temperature, Tc, as the reference (or, if the system is very efficient and the thermal resistance from the housing of the device to the coolant is negligible in comparison to the device’s internal resistance from junction to housing, perhaps Tc is simply the housing temperature).

From Equation 1, we can see that a small perturbation in power will result in a small perturbation in the junction temperature. If the power level briefly rises above the equilibrium value and then returns, the equilibrium temperature will eventually be restored. Likewise, if the power falls and returns, the temperature will drop and then return. This is because we have defined the system to be capable (at steady state) of dissipating exactly the amount of power needed to achieve the original steady state junction temperature. Indeed, solving Equation 1 for power, given the other values, we have:

Note also that the rate of change of system power dissipation with respect to changes in junction temperature, dQ/dT is given by:

From Equations 2 or 3, then, we see that for a small increase in TJ, the system (described by θJx) can dissipate slightly more power than the original Q.

Straight-line devices and systems
We can graphically illustrate the identification of the device/system operating point. We begin with a device that produces a fixed amount of power regardless of its temperature (see figure 1). With power as the vertical axis, and temperature as the horizontal axis, this means we have a “device” line that is horizontal. The “system” line is a straight line of slope 1/θJx (Equation 3), intersecting the horizontal axis at thermal ground, Tx. Where the device line and system line cross is the nominal operating point (TJ, Q).

Figure 1: This plot of power as a function of junction temperature shows the device line above which the system cools itself and below which the system heats up.

Wherever the system line is higher than the device line, more power is leaving the system than is being introduced by the device, so the system cools. Conversely, wherever the device line is above the system line, the system heats up. We thus see that to the right of the equilibrium operating point, more power can be dissipated by the system than the device is producing, and to the left, more power is coming in than may be dissipated. In either case, one might picture the resulting imbalance in power as a restoring force that causes the junction temperature to move toward the operating point.

Power dissipation as a function of temperature
What if the power dissipation is a function of temperature, however? Clearly, if the power falls as temperature increases, and rises as temperature decreases, the system still will experience tendencies to restore the nominal equilibrium value in the face of small perturbations (seen figure 2). Note that it doesn’t matter how steeply the power falls with temperature, the system line will always remain above the device line for temperatures above the operating point, and below it to the left of the operating point. Hence all such systems are stable. (The possibility of oscillatory behavior is not excluded, but is beyond the scope of this discussion.)

Figure 2: For cases in which power decreases as a function of increasing temperature, the system tends to restore itself to equilibrium after small perturbations.

On the other hand, if the power increases with temperature, things get more interesting. Power may, for example, increase with temperature without causing any particular difficulty. Indeed, if the device line has a constant slope, the device and system lines will always intersect at a unique point, and if the slope of the device line is smaller than that of the system line, the intersection continues to represent a stable operating point  (see Figure 3). Note: If the lines happen to be parallel, then there is either no valid operating point at all, or the operating point is undefined.

Figure 3: The slope of the device line in this plot of power versus junction temperature is shallow enough that the system can maintain stable operation.

What happens in this system if the power or temperature makes a brief excursion away from equilibrium? Suppose the temperature goes up. According to the device curve, the device will now dissipate more power; but according to the system curve, the system can dissipate even more power. Hence, as with the previously considered situations, the system will tend to drive itself back towards the original operating point, as it will with small downward perturbations.

Now consider a system with a much steeper slope on the device line (see Figure 4). The device and system line still have a unique intersection, so it would seem that the system should have a valid operating point. Not so, unfortunately. Consider what happens if there is a small perturbation in temperature. According to the system line, a small increase in temperature results in a small increase in the ability of the system to dissipate power. According to the device line, however, that same small perturbation in temperature results in an even larger increase in power dissipation. Hence the system cannot dissipate that much power, even at the increased temperature, so the temperature will rise even more, the dissipation will increase even more, and so on, until catastrophe results. (It may also be interesting to consider a negative perturbation: If the temperature drops slightly, the system line says the system is capable of dissipating less heat, but the device line says the device will produce even less power, so the system becomes a thermal “black hole” and sucks all the energy out of the universe! Obviously one or more of our simplistic assumptions in the model breaks down at some point.)

Figure 4

Figure 4, therefore, illustrates the essence of thermal runaway. And even though we’ve hypothesized brief perturbations, as if time had something to do with it, it is a concept essentially grounded in steady state system descriptions. Certainly time enters the picture if we are able to detect the onset of runaway and change the system quickly enough (for instance, turn off the power supply, changing the device line to a constant zero, independent of temperature).

Devices with non-linear power characteristics
We can make even more interesting observations when we start looking at device lines that are not straight. Consider a system featuring a device curve with a negative second derivative, which means that although power does increase with temperature, its slope decreases with temperature. Although there are two intersections between the system and device lines, the lower operating point cannot be maintained (see Figure 5). As in the case modeled in figure 4, the relative slopes are in the wrong relationship, resulting in non-restoring perturbations. In fact, if the device can be initialized to some temperature above the lower point, operation will “run away” to the upper point, which is stable, as we showed in Figure 3. If the system cannot be initialized to at least the lower intersection temperature, then it cannot be powered up at all.

Figure 5: A nonlinear device that features decreasing power dissipation with increasing temperature can achieve stable operation.

What if the device line never reaches the system line at any point, however? With no intersections, there are no potential stable operating points; indeed, this represents a system that can never be powered up, because no matter the temperature, given the cooling system, the device cannot produce enough steady state power to keep itself warm, so to speak, (see Figure 6). If we decrease the slope of the cooling system (meaning we increase thermal resistance), eventually a stable operating point may be established.

Figure 6: Reducing the cooling tendencies of the device shown in figure 5, for example by increasing thermal resistance, can allow it to establish a stable operating point.

So much for the device featuring a power versus junction temperature curve with a negative second derivative. Now, let’s consider a device in which the slope of the curve increases with temperature. Again, we see two intersections between the device and system lines (see figure 7). In this case, however, only the lower operating point is stable. If junction temperature is perturbed above the upper intersection, it will run away and never return. Anywhere between the upper and lower points, it will run away back down to the lower point.

Figure 7: A device for which power dissipation rises nonlinearly with increasing junction temperature may achieve stable operation at low junction temperatures but enter thermal runaway at high junction temperatures.

In a more extreme case, the device line never intersects the system line (see figure 8). As for the case described in figure 6, the system does not have a steady-state operating point. Unlike that earlier example, however, any attempt to power up this system will result in thermal runaway, because at every temperature, the device produces more power than the cooling system can dissipate.

Figure 8: A nonlinear device exhibiting power versus junction-temperature behavior with a positive second derivative can exhibit such extreme heating that it has no operating point, but overheats the minute it is turned on.

Finally, we consider the perfect runaway system (see figure 9). The device line, again with a positive second derivative, intersects the system line at exactly one point. For smooth curves, this must occur at tangency, that is, where the two curves have the same slope. Since the slopes are equal at the intersection point, there is no tendency to heat or cool at that exact point. We call this the point of neutral stability. Although negative perturbations will push the system back towards the neutral point, we can’t expect this system to stay at equilibrium indefinitely. Even an infinitesimal perturbation upwards means there will be more power in than the system can dissipate, hence runaway will commence.

Figure 9: the perfect runaway system, the device line and system line intersect tangentially, providing a point of neutral stability that will allow device operation, but can be easily perturbed into thermal runaway.

Same device, different straight-line systems
Throughout the preceding discussion, we’ve focused on changes in the device characteristics as a means of gaining an understanding of the nature of the thermal runaway phenomenon itself. For these purposes, we have considered the system line to be fixed. By now, the significance of the relationships between the slopes of the device and system lines, and the relative position of one line above the other, should be clear. (It should also be clear that whether either or both lines are straight or curved is not of direct importance to the concept.)

In semiconductor applications, the problem will likely be turned around. Typically, we will have a particular device, chosen primarily for the sake of its electrical properties, and we will want to know what it will take to keep it operating at an acceptable temperature. If, as a result of the thermal properties of the package, we simply cannot design an acceptable thermal system in which to place it, we may have to select another device with more favorable thermal characteristics. Restrictions on the devices thermal characteristics are generally secondary, however, and we won’t know if they’re significant until we try to design in the device we want.

Accordingly, let’s consider systems for which the device curve is fixed (in these examples having positive second derivative, which, as we’ll see later, is true of “power-law” devices), and see what we can learn about the system characteristics with respect to thermal runaway. For simplicity, we will continue to assume system lines as perfectly straight, characterized by a constant slope (the reciprocal of the system thermal resistance) and a specific reference temperature (thermal ground).

Let’s consider a generic power-law device with three possible linear cooling scenarios (see figure 10). System line A, which represents the original system, is characterized by its slope (the reciprocal of its thermal resistance junction−to−ground, θJx1) and its x-intercept (the thermal ground, Tx. System line A makes two intersections with the device line, but as we discussed earlier, only the lower temperature intersection represents a stable operating point. There is plenty of distance between the stable point and the unstable point, so we have some confidence that such a thermal system will work for this application.

Figure 10: As these three system curves show, performance stability of a system can be strongly affected by changes to reciprocal thermal resistance (slope of the lines) and ambient temperature (x-intercept).

There are, however, two important factors in a thermal design. What if ambient temperature (thermal ground) goes up? And what if the thermal resistance of the system increases, or simply proves to be higher than planned? First, let’s consider the question of increasing ambient temperature. We can show this graphically by maintaining the slope of the system line but shifting it to the right to increase ambient temperature to Ty. At this point, the device line is tangent to the system line, indicating perfect runaway conditions (system line B).

Clearly we cannot expect to safely operate our system if the ambient conditions approach the critical ambient temperature associated with a particular runaway temperature, TR1. System line C demonstrates how increasing thermal resistance Jx2 gives rise to a lower temperature of potential runaway, TR2. All things being equal, however, On the other hand, system line C also shows that for tells us that given  a fixed ambient Tx, the system can be much worse in terms of thermal resistance while still avoiding how much worse the system can be (i.e., bigger thermal resistance θJx2), yet still avoid runaway. This gives us a completely different potential runaway temperature, TR2. A surprising thing we will discover about power-law devices is that the temperature offset between thermal ground and thermal runaway (for any system line tangent to the device curve), is fixed based on the power law, and is actually independent of the thermal system!

This is not to say, however, that the lower of these two potential runaway scenarios is more likely to occur. What should be clear, however, is that as the operating margin of the designed system decreases (i.e., the placement of the stable actual operating point with respect to the two runaway conditions), the two runaway points converge, and the system will not be able to tolerate perturbations in either ambient or slope.

In part 2 of this article, we will discuss what constitutes a power-law device and how we actually predict thermal runaway.

About the author
Roger Stout is a senior research scientist and charter employee of ON Semiconductor. He received his BSE in Mechanical Engineering at ASU in 1977, and went on as a Hughes Fellow to earn his MSME at the California Institute of Technology in 1979.  He then joined Motorola in the equipment engineering side of the semiconductor business, which after about four years evolved into factory automation and control engineering. In about 1990, he took on the responsibility for thermal characterization of ASIC products. Roger holds six patents, and has been a registered Professional Engineer (Mechanical) in the state of Arizona since 1983.


Linked Articles
eeNews Europe