When do you know for sure?

Skogsgurra · Aug 12, 2004

This "class" of question has bothered me over the years:

We have an intermittent fault in some kind of a plant or equipment. Failures occur sometimes once or twice a day and sometimes there are one or two months between failures.

There is no possibility to provoke the system, so an accelerated test cannot be done.

Is there any standard method to have a "probable availability figure" and for how long does one have to observe the system to be able to produce such a number with, say, a 95 or even 99 percent confidence?

My thinking is that one would have to observe the system for many years to be able to say anything with some certainty. Anyone has an idea about this?

BML · Aug 12, 2004

If you have historical data you can try to fit them to an exponential or Weibull distribution. Then keep updating your curve fit as you get new data. You will likely never know for sure the exact parameters of the distribution, since several values close together will likely yield the same approximate answer. In that case, you can look for best and worst case responses to try and put boundaries on the parameters.

Also, keep in mind that you may not be able to provoke the failure because you don't really know what is causing it. There may be some as-yet unnoticed input or combinations of inputs providing the stimulus for failure, so don't count that out yet.

Hope this helps!

Skogsgurra · Aug 12, 2004

Thanks, BML.

That would help if I wanted to describe the distribution. And perhaps that is what I need to do to be able to find how many observations (or how long time) I need before knowing with some certainty that the problem has been eliminated.

I think that the distribution is purely random in most cases and it is for such a failure that I would like to know if there is any known method to tell the "boss" (customer, production manager, utility CEO etc) for how long one has to wait before one can say that "that's it". The only input in such a case is often notes saying that failures have occurred at that and that time.

We often do not have more than ten or twenty observations taken over one or two years and we want to tell the "boss" that we have to observe the system for x months before we can say that the measures taken to rectify the problem have had the desired effect. It is the "x" and the corresponding confidence number I am looking for.

You are probably right. We should plot the historical data to get some picture of the failures. Problem is that I have no idea how to do that. Our statistics course was rather meager, I'm afraid.

GregLocock · Aug 12, 2004

I agree with BML, a Weibull distribution is probably the best (easy) way to examine this problem.

Perhaps less daunting is just to work out the time between failure for each occurrence. If you then plot that as a distribution graph you will be able to see what shape the distribution is. If it looks vaguely bell shaped (I'll be most surprised) then you can say work out the standard deviation and the mean.

However, just from your description so far we know you have a lot of infant mortality (that is, when you fix a breakdown you introduce new problems) so a normal distribution is unlikely.

Here's an intro to the bathtub curve

http://www.itl.nist.gov/div898/handbook/apr/section4/apr451.htm

Here's something less daunting

http://www.itl.nist.gov/div898/handbook/apr/section4/apr411.htm

That website is a terrific resource by the way.

Cheers

Greg Locock

Skogsgurra · Aug 13, 2004

Thanks Greg,

I shall dig into that. There is one thing about these failures that I didn't mention: They are not breakdowns, only failures to operate properly. The equipment works after a reset. So it is more about infrequent disturbances and not about infant mortality.

This situation occurs in many different processes. I have had it in the signalling system for a subway, in a paper machine drive, in a telephone exchange etcetera.

The measures taken are normally filtering, tightening terminals, better grounding and so on. In one case we even changed the cable type. It helped. The problem is that we cannot say with any certainty that it hepled. At least not without waiting three or four (or five?) times the longest interval between earlier failures. And, with three months between episodes, that could mean at least a year's waiting. It would be nice to know when we can send the invoice and also tell the customer that we really helped.

This situation is not very often. Most times we find an obvious reason for the problems and then we don't have to wait and see.

melone · Aug 13, 2004

The first thing that I would do is try and figure out the EXACT failure mechanisms. If you can't recreate the problem, then you will have a VERY difficult time determining whether you have solved the problem.

Skogsgurra · Aug 13, 2004

Agree, melone. But sometimes you are not allowed to do that. A paper machine with USD 10 000+ production per hour is not something you are allowed to stop for investigations.

You have to work indirectly by analysing possible causes and apply probable mitigations and see if it influences the failure rate. It is this latter thing that I would like to find a more scientific approach to. If there is one.

melone · Aug 13, 2004

Then you have to decide what is more important, fixing the problem or keeping the machine running! Just because you can't take the machine down at a whim, doesn't mean that you can't come up with a logical, practical, creative set of test that can be run at one time, thus maximizing the amount of time the machine "sits idle".

Think of it this way, how much are they spending in wasted product and the salary of the people working to solve this problem. Eventually keeping the machine running will be more costly than allowing you to get in their and truly see what's going on. Once you understand the true root cause of the problem, you turn the machine back on, and sleep good at night KNOWING that you will not run into this problem again. Does management want to solve the problem, or just "band-aid" it?

BML · Aug 16, 2004

Melone is correct. Sometimes it's hard to explain to someone that "fixing" an infrequent and low-cost problem isn't the right thing to do. It's also difficult sometimes to get someone to agree to allow you to examine the equipment, because "they have to keep it running."

Ultimately, if you want to use statistics based on MTBF, you will have to wait until the next failure occurs, compare the probability of that TBF to the MTBF before you made a change, and see if there is a significant difference.

If you are not in a position to do that, then you have to get more creative. For example, do a fault tree and reverse fault tree analysis on the equipment to try and figure out everything that could possibly go wrong. Then try to narrow that down to a list of significant factors (by DOE, for example). Of those, some can be controlled, and some not. Control the ones you can, and monitor the ones you can't. Then you should be able to at least simulate the effect on performance of changing one of the control factors. This should let you estimate a new MTBF. But again, validation will require that you actually measure the results.

Hope this helps!

Skogsgurra · Nov 5, 2004

Thanks a lot guys!

Now, we have that problem again. A lifting bridge that was rebuilt about ten years ago has started to behave erratically. It used to happen once or twice a month. And now it hasn't happened for about a month. We have arranged logging of movements, brakes, motor currents etc on both sides of the bridge. It took several days to do that.

Now, the customer says that the bridge "seems to be fixed" and says that no logging is needed any more. We have been told to remove the equipment - without having any results. The bridge operator (the national sea-way authority) understands no electricity and no statistics. Nor does the bridge owner (the national road and bridge authority).

We refuse to take the equipment down before we know what is going on. And this is for several reasons:
A) We have invested (or rather the bridge operator has) a lot of money and there is no return on that investment.
B) Next time the bridge gets stuck in open position (two national roads going across it) there will be a lot of media coverage - and not very positive...
C) We have a reputation that we get to the bottom of all tasks that we undertake - and we do not want to have that reputation spoiled.

Once again: How do you tell a next to ignorant customer about these things? We can not educate him in statistics. Is there a way?

GregLocock · Nov 5, 2004

$

Work out the total cost to society of a bridge malfunction, compare it with the cost of monitoring. That'll give you some sort of break even analysis.

I've just done a course in statistics, sadly it doesn't answer your question, but a Weibull analysis is ou main tool for looking at life type data for small numbers of samples.

Cheers

Greg Locock

Skogsgurra · Nov 6, 2004

Thanks Greg,

I just wrote them a letter regarding cost comparison. I threw in media reactions as well. I think that will start them thinking...

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

When do you know for sure?

Skogsgurra

Electrical

BML

Industrial

Skogsgurra

Electrical

GregLocock

Automotive

Skogsgurra

Electrical

melone

Electrical

Skogsgurra

Electrical

melone

Electrical

BML

Industrial

Skogsgurra

Electrical

GregLocock

Automotive

Skogsgurra

Electrical

Similar threads

Part and Inventory Search

Sponsor