Predicting thermal runaway – Part 2
In Part 1 of this two-part article, we explored the theory of thermal runaway and some settings in which is it of concern (and some in which it is not). In Part 2, we will discuss power-law devices, straight-line systems, and how to predict thermal runaway.
Thermal runaway occurs in semiconductors when the power dissipation of the device in question increases as a function of temperature. As we showed in part 1 of this article, under a certain set of conditions, a device may be unable to establish a nominally steady-state operating point. In such a case, it may either fail to operate all or it may slide into thermal runaway.
Before we can discuss how to predict thermal runaway, it is worthwhile to consider when we can consider a device power characteristic to be a power law at all. Simple devices such as diodes and rectifiers, and even many transistors under certain conditions, typically have a current vs. temperature characteristic at fixed voltages that follows a power law. Leakage current in diodes, for instance, is often described using the rule of thumb that leakage doubles for every increase in temperature of 10°C. This particular power law could be expressed as:
where I is current, Io is leakage current at 0°C, and T is junction temperature. If the behavior is really a power law, obviously we can figure out what the leakage is at 0°C, given its value at any other temperature.
Alternatively, we can express Equation 4 in terms of the base of the natural logarithms, as in:
So we see that any power-law behavior, that is, behavior described by a geometric increase in the dependent variable (e.g., a factor of 2) for a linear increase in the independent variable (e.g., every 10°C), is just another exponential function, as in Equation 6:
From this example, it may be seen that we can define the “strength” of the power law in terms of that parameter λ. If device behavior is governed by a power law, then knowing the value of the independent variable (in this case, current) at any two corresponding values of the independent variable (in this case temperature), we can say:
So far in this example, we’ve only talked about current, and we need power. Given the premise that a diode under reverse bias (i.e., leakage mode) sees a constant voltage (reasonably assumed to be independent of the temperature of the particular diode), its power dissipation Q, therefore, is also a power law:
where VR is runaway voltage and Qo is power dissipation at 0°C. Note that under constant current, which is the primary operating condition for many applications of diodes, the power law relationship does not hold with temperature. Indeed, it is typical for diodes that at a fixed current, voltage goes down linearly with temperature. As a result, diodes in constant-current operation display linearly decreasing power with increasing temperature, and, as we showed in figure 2 of part 1, are essentially immune to thermal runaway concerns.
For the power law situation, however, we may calculate the slope of the device curve by taking its derivative, that is:
Thus we see that the slope increases with increasing temperature, and all the previous discussions based on device curves with positive second derivatives apply. In particular, Figure 10 of part 1 provides the setting for the subsequent mathematical development.
We are now in a position to find the mathematical solution to “perfect runaway’’ in a power-law device cooled by a linear thermal system. To begin, let’s recap the two pertinent equations.
the linear system line:
where θJx steady state thermal resistance of the system and Tx is thermal ground for runaway based on θJx or °C and
the power-law device line:
In this system of two equations, we have two variables, junction temperature T and power dissipation Q. We also have four independent parameters: the thermal ground reference, Tx, the thermal resistance of the cooling system, θJx, the reference power level Qo, and the strength λ of the power-law function itself. By finding an exact solution to this set of equations, that is, a pair of values (T, Q) that satisfies both equations, we will also create some relationships between the various parameters.
For clarity in solving the equations, it will be helpful to substitute variables. Let us define:
where z is non-dimensional junction temperature
With this new variable, we can rewrite our two equations as follows. From Equation 2, we have:
the linear system line, and from Equation 8 we have:
in the power-law device line (curve).
Equation 14 suggests that we might choose to define the non-dimensional power q as:
thus, we can make a final restatement of the two governing equations as follows.
the device line:
the system line:
Eliminating q from between Equation 16 and Equation 17, we are left with a remarkably simple relationship defining points z that are the intersection points (i.e., the operating points) of our system
We can interpret our transformed problem, originally seen in Figure 10 of part 1, as the pure exponential function and its intersection with a straight line of slope k, passing through the origin of the (z, q) coordinate system (see figure 11).
Finding the runaway point
Now, for points representing perfect runaway of this system (i.e., the system line being tangent to the device-line), there is an exact solution. It begins by recognizing a unique property of the pure exponential function, namely, that the slope of the function is equal to its value at every point. In other words, everywhere on the device line:
Indeed, we may make an even more interesting observation for lines tangent to the pure exponential curve (see Figure 12). Since the slope may be defined as the “rise over the run,” it must be the case that the “run” is exactly unity for every tangent extended to its z-intercept. That is, in general the distance between the z-coordinate of any point on the curve, and the corresponding z-intercept of the tangent through that point, is:
Thus, for zR1, the runaway point arising from a shift in the original ambient:
So we find the maximum ambient causing this runaway (from Equation 21) to be:
x-axis displacement for every tangent extended to the
z-intercept of the device line is unity, reducing the
non-dimensional runaway junction temperature to ln(k1)-1.
For zR2 the runaway point arising from an increase in system thermal resistance, since the original ambient is represented by z = 0, we now find from Equation 21 that:
We should now return to the original, dimensional, form of the problem to fully understand the significance of Equations 21 through 26. First, converting Equation 21 back into the original coordinates, we can express runaway temperature TR as:
This is the result previously expressed verbally, namely for every perfect runaway system line (i.e., the system line is tangent to the device line), the distance between thermal ground and the runaway point is actually fixed based solely on the power-law strength, and doesn’t depend on the particular value of thermal ground or the system line’s slope! This delta is precisely the strength of the power law, λ.
From Equation 23, we find:
This expression tells what the runaway temperature would be for a system with the same slope as the actual system, as ambient temperature increases until runaway occurs. Clearly it depends not only on the system thermal resistance θJx, but also on the power-law characteristics of the device (and not merely on λ, the strength, but also on the specific reference power level, Qo). So again from Equation 27, we have:
If instead we assume a fixed thermal ground, and want to know the largest thermal resistance for which the system will remain stable, Equation 26 tells us that:
and this runaway temperature, from Equation 25, is the fixed λ above the original thermal ground; that is:
Power laws based on power
Finally, even though a real device may not follow the particular rule of thumb quoted earlier, if it follows a power law at all (hence the strength can be calculated), these general results will apply. For completeness, let us also express the power-law strength directly in terms of power (rather than current). Instead of Equation 7, we might prefer, therefore:
It may be noted that any two reference temperatures and power levels may be used to calculate λ, but a zero-temperature power level must be used in Equations 28, 31, and 32. If neither of those reference temperatures is zero, then either one may be used to calculate Qo, once has been calculated, using:
The operating point itself
All these runaway limits themselves are obviously of some interest, but they remain hypothetical in the sense that if the system is designed to operate in a safe region, the operating point itself will be safely removed from these theoretical runaway points. As indicated in the preceding development, the system may still experience restoring forces for brief power or temperature excursions taking the junction temperature all the way up to the upper intersection between the system and device lines, considerably beyond either runaway temperature TR1 or TR2. The problem, of course, is that if the perturbation lasts for “very long” (somewhat vaguely defined), the system will actually attempt to find a new steady state at this upper point, which, once reached, is not stable. Worse, if the perturbation is actually caused by a permanent shift in either ambient or system resistance, then the two intersection points, and the two disparate runaway points, will converge toward a single dangerous operating/runaway point somewhere in the middle.
So how much margin does one actually have in a specific design? To answer that, one must solve Equation 19 for the particular thermal ground and system resistance of interest. Due to its transcendental nature, Equation 19 has no general closed-form solution, so it must be solved graphically or numerically. Even so, some general statements can be made. First, it has already been observed that the critical value for perfect runaway, at the designed thermal ground, results when k = e (Equation 26). This means that for any proposed combination of thermal ground and system resistance, if k > e, there are going to be two intersection points, the lower one being the stable operating point. On the other hand, if k < e, there will be no solution at all.
The first step in finding the actual operating point is therefore to compute k and see if there is a solution. (It is probably easier to calculate the ratio k/e to see if it’s greater than unity, since you probably don’t remember e, and it’s built into your calculator or spreadsheet anyway!) If there is not a solution, there are three possible approaches:
(a) lower ambient: Equation 31 gives the maximum ambient tolerable (with no margin), given the device and the originally supposed system resistance;
(b) lower system thermal resistance: Equation 32 gives the maximum tolerable theta (with no margin) given the device and the originally supposed thermal ground;
(c) a different device must be used, and given that most devices are going to have a roughly similar power-law strength, this means finding a device with a lower Qo – almost certainly meaning one with a larger piece of silicon inside.
A quantitative example
Consider a particular rectifier in an SMB package. From its data sheet we find the following information:
Assuming reverse bias operating mode, Equations 34 and 35 give us, for the two reverse voltages (note similarity in λ at the two temperatures).
The data sheet also gives a psi-JL value of 25°C/W, so let’s suppose that in an actual application we can achieve a system thermal resistance of 100°C/W, in still air, and with other heat sources in the vicinity. Let’s also suppose that we’re going to have a worst-case ambient of 75°C. Given all these values, we can now compute all the remaining quantities for which we’ve previously derived expressions:
For the lower-voltage application, we’ve got lots of margin. First, the k/e ratio is much larger than unity, so we know we’ve got a stable operating point. Second, given our choice of 100°C/W for θJx, max ambient is comfortably high, and runaway temperature exceeds maximum rated junction temperature. Third, we’re nowhere near the limiting system resistance (1055°C/W), so its associated runaway temperature of only 92.9°C is no concern.
On the other hand, at the higher voltage application, we have problems. First, the k/e ratio says there is no solution for the operating point. We could have concluded this indirectly by noting that our chosen theta exceeds the max theta calculated based on the given ambient, and that maximum ambient exceeds our chosen ambient, given theta. Unfortunately, we can’t lower the maximum ambient, but upon further consideration, we believe that a theta of 80°C/W is quite realistic. Recomputing values, we now find that k/e has risen to 1.609, maximum ambient has risen to 83.5°C, and runaway temperature based on this theta will be 101.3°C. Both runaway temperatures are well below maximum rated temperature for the device, but we won’t know our “real” margins without calculating the actual operating point, and its unstable upper partner. So, using the goal-seek feature of a spreadsheet program, or the iterative procedure outlined in the sidebar, we find the two mathematical solutions of Equation 19, for k = 1.609e, to be z = 0.312 and z = 2.315.
These translate into a stable operating point at 80.6°C (and 0.09 W), and its unstable partner at 116.3°C (0.69 W). So no matter how you look at it, even this 80°C/W system design will succumb to thermal runaway somewhat below the maximum operating temperature of the device. We have at least established some reasonable margins around the design point.
A seeming paradox
Interestingly, Equation 30 shows that for a given device in a given thermal system, thermal runaway temperature does not depend directly on the value of thermal ground. It does, however, require defining the system thermal resistance to encompass the entire thermal path from junction down to thermal ground. A specific example will serve to illustrate what may appear to be a nonsensical consequence, but which in fact will emphasize this requirement, once understood.
Consider two systems based on the same device exposed to exactly the same set of local conditions – junction temperature, lead temperature, and power dissipation – but with two different system resistance values. According to Equation 30, the differing system resistance values will give rise to vastly different runaway temperatures (see figure 13). Clearly, the only difference in the two applications is the combination of thermal ground and total system thermal resistance. This raises an important question: If the device is experiencing what appears to be the exact same boundary conditions, how could it possibly be sensitive to the thermal resistance beyond those boundaries, and thus have different runaway values? Or, restating the question, if we increase the power level in Case A until TJ reaches 125°C, why will runaway occur, whereas if we increase the power level in Case 2 until TJ reaches 125°C, it will remain stable?
The answer lies in the very reason that the total system thermal resistance must be used in calculating runaway temperature. Consider first what happens in Case A if the power is increased by a small increment, say 0.1 W. Thermal ground, of course, remains fixed at 25°C (that’s why it’s thermal ground). TL, on the other hand, rises by 10°C, and TJ rises by 15°C. Consider now Case B for the same small increment of power of 0.1 W. Again, thermal ground remains fixed (this time at 74.9°C). Now, however, TL, rises by only 0.02°C, and TJ rises by only 5.02°C.
We haven’t been explicit about the device power-law parameters (though they went into the runaway calculation, and into the very fact that the operating points of Case A and Case B are what they are). Indeed, changing the power must be viewed as a temporary disturbance to the system anyway, since you can’t actually change the power at all and still remain in equilibrium – that’s why the power law device has a specific operating point in a particular context.
We can see this, however: The fact that a small perturbation in power pushed Case A’s TJ up by 15°C, whereas that same perturbation elevated Case B’s TJ by only one-third that amount, suggests that Case A is much more sensitive to small perturbations. (This could have been turned around to show that it only takes about a third as much power in Case A to cause a similar increment in TJ as is required in Case B. So we’d say Case B is more “robust” to power fluctuations.) We can then make the argument that because this is a power-law device, small perturbations in Case A are more dangerous than in Case B. This may be translated into a lower runaway temperature for Case A.
Another way to think about it is that if we’d designed our two systems to run at a very small margin below the runaway temperature of Case A, then due to this differing sensitivity of TJ to power, a very small perturbation in Case A would send it into runaway, whereas the same small perturbation in Case B is just that, a small perturbation.
There is yet a third way to look at it. Consider how little the lead temperature changes in Case B, for the same increment in power as in Case A. Because the lead is so much closer to thermal ground in Case B, it makes sense that at a given operating point, power and temperature at the junction should be much more firmly anchored. Potential runaway must therefore be farther away from the operating point, even though the operating point itself is the same in both applications.
Avoiding thermal runaway
We’ve focused on power-law devices for this discussion, in short because it allowed us to make some precise statements about their behavior, even if in a somewhat idealized thermal system. Nevertheless, it should be clear that regardless of the actual shape of the device line or the system line, the concepts driving a thermal design with respect to thermal runaway are generally applicable, even if they must be implemented graphically or through some other iterative method of calculation.
Perhaps most surprising result was that for a perfect runaway system, the offset between thermal ground and runaway temperature depends solely on the power-law strength (see equation 27). We should never design a system to operate at precisely the runaway point, however, so although this result is an interesting mathematical fact, it is only of secondary importance. In fact, there are two different theoretical runaway points – one based on holding system resistance constant and varying ambient, and the other based on holding ambient constant, and varying the system resistance. The latter is always the lower of the two, and is at the fixed power-law strength λ, from the nominal thermal ground. In many situations, relatively speaking, there will be much more margin in system resistance than in thermal ground, so the former runaway point will be the one of primary interest.
Thermal runaway has sometimes been stated exclusively in terms of the relative slopes of the device and system lines. That is, if:
then runaway will occur. It should be evident from this study that this is somewhat of an oversimplification. If both device and system lines are straight lines, it is surely true. If either or both lines are curved, however, then not only the relative slopes, but also their second derivatives and the specific placement of the lines relative to each other, will be factors. More particularly, for power-law devices, there is not just a stable operating point and potentially two runaway points, but also an unstable higher operating point. It is actually this upper, unstable operating point that limits the junction temperature and spells disaster if junction temperature exceeds this value. The system can tolerate small perturbations in ambient or system resistance, so long as the conditions do not persist. When conditions depart from the nominal for a sufficiently long period of time, however, the two operating points begin to converge, bringing the two runaway points inward with them. All four points thus coalesce at a common value and runaway then occurs. Clearly, therefore, the precise manner and duration by which the system is perturbed from the operating point dictates the exact temperature at which runaway ultimately occurs.
It is also certain that one can never have a stable operating point when the condition of Equation 36 is violated locally; for instance, the upper point just referred to in power-law device operation. As we saw in the case of devices with negative second derivatives, however, this only means that the lower of two theoretical operating points cannot be maintained. The system may “run away” from the lower point, but not necessarily to oblivion, only to the upper, stable point.
Accounting for all these factors, in many applications, thermal runaway is likely to occur at a temperature far below the maximum rated temperature of the device. In other words, if thermal runaway is a failure mode of the application, it may be the most stringent constraint on the thermal design.
About the author
Roger Stout is a senior research scientist and charter employee of ON Semiconductor. He received his BSE in Mechanical Engineering at ASU in 1977, and went on as a Hughes Fellow to earn his MSME at the California Institute of Technology in 1979. He then joined Motorola in the equipment engineering side of the semiconductor business, which after about four years evolved into factory automation and control engineering. In about 1990, he took on the responsibility for thermal characterization of ASIC products. Roger holds six patents, and has been a registered Professional Engineer (Mechanical) in the state of Arizona since 1983.
Did you find this article of interest? Then visit Military & Aerospace Designline, where we update daily with design, technology, product, and news articles tailored to fit your world. Too busy to go every day? Sign up for our newsletter to get the week’s best items delivered to your inbox. Just click here and choose the "Manage Newsletters" tab.