Continue to Site

Eng-Tips is the largest engineering community on the Internet

Intelligent Work Forums for Engineering Professionals

  • Congratulations GregLocock on being selected by the Eng-Tips community for having the most helpful posts in the forums last week. Way to Go!

MTBF Degradation of CPU's After Sustained Overtemp 3

Status
Not open for further replies.

CliffMichael

Electrical
Sep 7, 2002
6
I am seeking reliability White Papers or other relevant info on MTBF calculations or "seat-of-the-pants" assessment of long term degradation of microprocessors used in common PC servers after sustained ambient overtemperature conditions.

Background: An operating server room's HVAC system failed over a weekend. Recorded ambient and CPU temps approached 80-100 degrees C. After the HVAC was restored, all systems appeared to function normally.

Question: For a given life expectancy of uP's, is it possible to assess or calculate a percentage loss-of-life reliability quotient?

Thanks, Cliff Michael
cliffmichael@netscape.net
918 625-1563
 
Replies continue below

Recommended for you

48 to 72 hours at a moderately high temperature like that won't have a significant impact on a processor's reliability, but it will have some. For instance, "Intel’s internal goal is that the failure rates of systems in service be less than 1% cumulative at 7 years and less than 3% cumulative at 10 years" (see Now, I think that's pretty aggressive, when you consider how many tens of millions of processors they ship per year. While I'd have to spend a couple moments to research what current thinking is for activation energies of current uP technologies, if we just assume that an aggressive "rule of thumb" for acceleration of thermally activated failure mechanisms is 10x for every 10C above some base temperature, and we assume that base ambient is 50C (not unlikely for a CPU's local ambient in a server), then even the 100C temp (I'm assuming that's an ambient - not die temp. If it's die temp, "who cares" is the answer) results in only 3 days * 10 * 5 or 150 days reduction in life.

Now, we're playing with statistics here, and calculating actual impact on a single device is of course impossible (and even on your room full of devices). Given that a server's useful life is probably under 5 years, this 5 month impact is probably not much to worry about since, depending on supplier, the design life is 7 to 20 years...

Other things besides the processor are far more likely to show degraded performance due to that overtemp condition, such as electrolytic capacitors. So watch your power supplies...
 
Oops - for accuracy sake, let me correct this. If the acceleration factor is 10x per 10 degC, then a 50 degC increase in temp would result in a 10^5 (100k) acceleration factor, not 10x5. Fortunately the more regularly accepted and utilized acceleration factor "rule of thumb" is more like 2x for every 10 degC...that results in a more reasonable 2^5, or 32x, acceleration factor...or about 96 days...still inconsequential. That's why processors should be able to withstand 1000 hours of lifetest at 125 degC with zero fallout.

Sorry for any confusion.

Mike
 
Hi, I've been walking around with a question in my head and after seeing this thread thought to ask it here. Apart from abnormal conditions - high temperature etc..- does electronic equipment age?? Or is aging just the accumulated effect of say overtemps, spikes, surges etc.. Is 'ageing' and 'degradation' the same thing



 
In short, electronics does age independent of overages.

Classic failure rate analysis is governed by the Arhenius equation. Thus, all non-catastrophic overages represent accelerations of the basic failure rate, hence, burn-in can be used to accelerate failures and/or to predict what the failure rate will be for given conditions.


TTFN
 
Electronic equipment does age. There is generally a shelf life associated with each component that makes up the equipment. Most notable are electrolytic capacitors, which tend to dry up with time.
 
I may cause a hornet's nest to be disturbed here, but over the last five years (maybe more) there has come to be a general belief that temperature alone does not affect the reliabiliy of semiconductors as much as was once thought, especially not VLSI. Also, surface mount components seem to be more reliable than their through hole counterparts.

The real killers for electronics reliability is temperature cycling and vibration. These create mechanical failures in areas we don't usually pay enough attention too, soldering and plated through holes to inner layers for example.
 
sreid, no hornet's nest at all. It's my experience that mechanical failure is the first order of magnitude problem with most assemblies...MIL-HDBK-217 and Bellcore based reliability calculations/estimates reflect that fact. It's not "electronics reliability", though, that's excessively negatively affected by temp cycle/vibration, it's mechanical reliability. So that's one reason good design practice might include part count reduction and integration....fewer solder joints.

Reliability of semiconductor devices is affected by both thermal and electric field. Hot electrons, for instance, are only mildly affected by temperature, but raise the voltage and watch out! Punch-through in short-channel devices (and similar problems like leakage) are voltage-driven; that's the main reason VDD keeps going down in new generations of devices....they'd fail in a second operating at 5V!

Mike

--
Mike Kirschner
Design Chain Associates, LLC
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor