Eng-Tips is the largest engineering community on the Internet

Intelligent Work Forums for Engineering Professionals

Space Shuttle Endevour launch delays - electronic box failures.

Status
Not open for further replies.

DHambley

Electrical
Dec 7, 2006
246
0
0
US
About the "electronic box failures" which are preventing the space shuttle launch - There should be NO failures of any system for Christs sake! We spend so much tax money for engineering firms to "prove" that every system is ultra-reliable for aerospace, so these continual failures are unacceptable.

If you are involved with the design of spacecraft systems, please respond with your opinion about why these "computer failures" and "electronic failures" have been continually happening for the last 30 years, every time a shuttle is about to launch.

My opinion - NASA is more impressed with complying with their paperwork bureaucracy than actual calculations. I have argued till I'm sick of it, with reliability engineers for NASA about blindly following an incorrect parameter on a data sheet than the actual laws of physics.
I've seen where, after a design review with NASA, showing everything is perfect because it complies with their requirements, and then going down to the production floor, turning on a system to run a test, it breaks. It Freaking BREAKS one day after "proving" to NASA that it's going to last for 20 years.
 
Replies continue below

Recommended for you

There seems to be some confusion between random failure vs. infant mortality.

Despite being in service for nearly 4 decades, I would consider the Shuttle to be still in initial low-rate production at best. The other observation one should make is that the whole point of pre-launch testing is to minimize the number of failures DURING flight. Better that they find failures now rather than in the middle of something more life threatening. We have had, relatively few in-flight failures, and no failures that have crippled the Shuttle, so knock on wood, and the paper work seems to be correct, overall.

TTFN

FAQ731-376
Chinese prisoner wins Nobel Peace Prize
 
A "zero-failure" design's impracticality is not the cost, but the SWaP (size weight and power).

All things, mechanical and electronic, have failures. Therefore, to make something failure free over its intended lifetime, would require sufficient redundancy, spares, and data encoding to maintain the appearance of lack of failures under every mission condition. That level of increase in extra hardware would most likely have made the Shuttle unflyable and unlaunchable.


TTFN

FAQ731-376
Chinese prisoner wins Nobel Peace Prize
 
Greg, Sorry to say your rude response shows who is actually "clueless" here. You didn't comprehend the question: Why do you think these failures continue to occur? You obviously don't know do you, so don't bother to comment.
 
Ah, did baby get upset?

So in your undoubtedly vast experience of system build/integration how would you avoid infant mortality and build problems with prototype parts? Why would that approach be cheaper than NASA's?


I wouldn't mind seeing some explanation for the exact cluelessness of your post. "There should be NO failures of any system for Christs sake! ". Ridiculous. childish. petulant.

Cheers

Greg Locock


New here? Try reading these, they might help FAQ731-376
 
One saying in all the jobs I've worked on "Those that haven't broken anything haven't done anything". More directed at complaining managers than anything. A modified version of that applies here. Failures happen, be happy the guys who designed their Built In Test Equipment had a good design.

I hear Thomas Edison was a huge screwup, all his crap failed until he got it right.
 
"A "zero-failure" design's impracticality is not the cost, but the SWaP (size weight and power). "


Sat and thought about that a few days, IRstuff, and in general I agree with you - for a system that "must not fail", one builds with hefty safety margins to guard against the unknowns in the design.

But, for a flightweight system, one can design closer to the margins, if you have the luxury of time (or at least manpower) and money, so basically money. Test a bazillion possible ideas to find the most reliable combinations, test those across the entire design space of possible conditions and combinations of same, iterate...

Never really happens, but it IS possible. Something close to that happened in some places, for some systems, during WW2, and something somewhat less comprehensive during the Apollo build up.
 
Electronics, unfortunately, are different from mechanical structures. Mechanical structure must experience stress in order to fail, i.e., no stress, no fails, for the most part. Electronics are fundamentally different. That's why they are modelled with a constant failure rate model, because there are a portion of failures are completely independent of stress, or more precisely, just the mere fact of applying power and operating them induces the failures.

That said, one can decrease the failure rate by cooling the electronics, so a device that's, say, 50°C cooler would have roughly 320 times longer MTBF, but, it still would not be infinity.

TTFN

FAQ731-376
Chinese prisoner wins Nobel Peace Prize
 
"Mechanical structure must experience stress in order to fail, i.e., no stress, no fails, for the most part. Electronics are fundamentally different. That's why they are modelled with a constant failure rate model, because there are a portion of failures are completely independent of stress, or more precisely, just the mere fact of applying power and operating them induces the failures. "

Err, right, but - the application of power creates thermal gradients which then generate thermal stresses. Designing those little semiconductors chips to better tolerate (for instance) repetitive thermal gradients will reduce failure rates dramatically. Ok, there are also other things that happen due to just the temperature effects alone (diffusion of dopant layers for one), but if that is a design constraint, one can design around it or optimize the design to mitigate its effect.

In my mind, the biggest problem with optimization of an electronic design is the sheer number of variables - how many bazillion individual junctions do uC's have these days, and is there any way we could analyse those IC configurations, and the software to run on them, to give the most optimal distribution of power across the chip face? I know Intel does that to some degree, but they must design for a certain "nominal" user type. What if that chip was the core of a new launch vehicle control, would it be worth the cost to provide an optimal design for the control IC and software...

Greg, that quote is just a 21st century paraphrase of Murphy's law.
 
"but if that is a design constraint, one can design around it or optimize the design to mitigate its effect."

Not really. You'd have to use design rules from 20 years ago, which would result in chips that are 400 times less dense, and hence, the initial statement about SWaP. And, as you may recall, even devices from 20 years ago were not immune to failures.

To date, no one has ever designed a process or device that was complete immune to failure; I don't think that it can be physically done. It can be minimized by grossly increase SWaP, but it cannot be eliminated.

TTFN

FAQ731-376
Chinese prisoner wins Nobel Peace Prize
 
"You'd have to use design rules from 20 years ago"

Maybe. Or maybe you would have to use a new set of design rules, based upon research and testing appropriate to current scale technology. I know some chip design houses do this type of research, I don't know what the outcome is/was. I do know that if you don't do the research, you can count on not knowing what those "new" design rules might be.

Ok, I do get your point IRstuff, and yes the quicker cheaper better route to high reliability is to use old clunky hardware and known systems that can be designed for robustness (vs. bleeding edge hardware that nobody really knows very well), but to be clear I don't think true "zero failure" is possible either, just that one can achieve (let's say) an order of magnitude of reduction in failure rates for n*(order of magnitude) more spending on design refinement. And, at some point, the risk of failure within the mission lifetime can be reduced to an acceptable level...if you have done the math and the testing, and have a clear number in mind for "acceptable level".
 
The research you refer to has already been done, and the result is the technology as we know it today. The failure rates of individual transistors has drastically dropped during the course of IC development in the last 40 years. That's one of the fundamental reasons that MIL-STD-217, as originally written, is no longer useful; the predictions of MSI and LSI device failures based on SSI failure rates were completely and totally out of whack.

A typical failure is junction spiking, where the metallization diffuses into the semiconductor junction and penetrates through the junction and shorts it out. Different metallization chemistries and barrier materials were already developed to counter that, which was needed when junction depths dropped from the 1.5 microns of 30 years ago to the 0.1 micron regime of today. The only mitigation available is to reduce the current flow through the junction, which would slow everything down, which would require massively more devices running in parallel to make up for the lost speed. However, the additional devices would still add to the overall failure rate, and at some point, the redundancy control electronics would become a significant fraction of the operational circuiry, and that would also contribute to the failure rate.

I spent a big chunk of my early career on fault-tolerant processor systems, and there were always structures or places that could not be adequately protected, no matter what you did.

TTFN

FAQ731-376
Chinese prisoner wins Nobel Peace Prize
 
Following you pretty well, Irstuff, though not my area of expertise at all. I do know that you speak truly about the old MIL-STD-127. I will have to accept your original statement then, at least regarding latest gen. uC's.

"The only mitigation available " ... I.e. the one we've found so far. If it was important enough, who knows what new techniques might be found if money was no object? Do some of the newer 3D stacking gate technologies help mitigate, i.e. having reduced current flow per gate, but more gates per junction?

 
Not really. The 3D thrust is more out of desperation, since we're (finally) near the actual limits of the physics, i.e., the gate dielectrics cannot get much thinner, and the junctions cannot be placed much closer, and the transistor widths cannot be much smaller. As it is, we've blown through what we thought were fundamental limits 30 years ago, and by quite a margin.

There's definitely always a possibility for a breakthrough, but we've generated several generations of PhDs whose careers were based on the study of the semiconductor technology, and nothing has come of all that brainpower, but who knows?

TTFN

FAQ731-376
Chinese prisoner wins Nobel Peace Prize
 
I had a physics prof back in the early 80's who made this comment on shuttle failures / delays: "When you take something with ten million components, and the odds of failure are 10,000,000:1, what do you think is going to happen?"

I guess that is how Gregg's comment is derived.

Scott

I really am a good egg, I'm just a little scrambled!
 
Status
Not open for further replies.
Back
Top