Eng-Tips is the largest engineering community on the Internet

Intelligent Work Forums for Engineering Professionals

  • Congratulations waross on being selected by the Eng-Tips community for having the most helpful posts in the forums last week. Way to Go!

Grid MTBF or Failure Rate

Status
Not open for further replies.

GaryRudolph

Computer
Nov 25, 2008
2
CA
We have a grid of 20 computers where each data is written to at least two computers. A failure or loss of data would occur when two servers fail at the same time. If each computer has an MTBF of 45k hours how do I across calculating the system MTBF where a system failure only occurs when 2 or more nodes fail at the same time?

We can assume 0 time to repair, because within less than a minute the backup data becomes primary and the primary creates a backup across the remaining members of the grid.

Thanks, Gary
 
Replies continue below

Recommended for you

I would approach it with Monte Carlo simulation. Model it in the computer. Run a few million simulations of it. Perform necessary statistics.
 
If you have 20 computers, how is it that you have a system failure if only two go down? Are you referring to the two computers to which you have just written the last data stream? In other words, do you really have 10 data streams and always write each stream to the same two computers, or is each data stream written to at least 2 computers chosen at random from the 20?

**********************
"Pumping accounts for 20% of the world’s energy used by electric motors and 25% to 50% of the total electrical energy usage in certain industrial facilities." - DOE statistic (Note: Make that 99.99% for pipeline companies)
 
Data is federated across all 20 computers. With one piece of data being on exactly 2 computers. So, if I lose 2 computers at the same exact time (or within replication time which is within seconds) I will have data loss. If I lose 1, the backup replicates to the other servers, and then lose another, no problem. Basically, if think of 4M pieces of data federated across 20 computers. The federation scheme guarantees even distribution.

I'm really only concerned with losing 2 computers at the same exact time.
 
Some quick calculations here.

There are 20C2 or 190 different possible computer/computer pairings for a specific data set at a given time. Of the 190 pairings, your definition of success is only interested in 1 of those pairings. Based on that, I would say the reliability is officially "very high".

To put some more meat on this, if you assume the computer failure rates follow an exponential distribution (not sure about how accurate it is for computers), then the reliability of 1 computer is exp(-t/MTBF).

Since exponential distribution is continuous, the probability of the second failing at the same time approaches 0. Lets be conservative and assume its 100th of the first probability. This implies a higher reliability for the second computer.

They would operate similar to a parallel fashion creating a 2 computer system reliability of: 1-(1-exp(-t/MTBF))(1-exp(-t/MTBF)).

Apply this probability to the binomial distribution to see what the probability of occurrence is:20c2*(1-(1-exp(-t/MTBF)(1-exp(-t/MTBF)))^1*(1-exp(-t/MTBF)(1-exp(-t/MTBF)^189. This quickly approaches 0.

I would never expect to see this failure mode occur in anyone here's lifetime based strictly on the computer's reliability performance. Best of luck!
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top