Grid MTBF or Failure Rate

GaryRudolph · Nov 25, 2008

We have a grid of 20 computers where each data is written to at least two computers. A failure or loss of data would occur when two servers fail at the same time. If each computer has an MTBF of 45k hours how do I across calculating the system MTBF where a system failure only occurs when 2 or more nodes fail at the same time?

We can assume 0 time to repair, because within less than a minute the backup data becomes primary and the primary creates a backup across the remaining members of the grid.

Thanks, Gary

ReliaEng · Dec 15, 2008

I would approach it with Monte Carlo simulation. Model it in the computer. Run a few million simulations of it. Perform necessary statistics.

BigInch · Jan 4, 2009

If you have 20 computers, how is it that you have a system failure if only two go down? Are you referring to the two computers to which you have just written the last data stream? In other words, do you really have 10 data streams and always write each stream to the same two computers, or is each data stream written to at least 2 computers chosen at random from the 20?

**********************
"Pumping accounts for 20% of the world’s energy used by electric motors and 25% to 50% of the total electrical energy usage in certain industrial facilities." - DOE statistic (Note: Make that 99.99% for pipeline companies)

http://virtualpipeline.spaces.live.com/

GaryRudolph · Jan 4, 2009

Data is federated across all 20 computers. With one piece of data being on exactly 2 computers. So, if I lose 2 computers at the same exact time (or within replication time which is within seconds) I will have data loss. If I lose 1, the backup replicates to the other servers, and then lose another, no problem. Basically, if think of 4M pieces of data federated across 20 computers. The federation scheme guarantees even distribution.

I'm really only concerned with losing 2 computers at the same exact time.

ReliaEng · Jan 5, 2009

Some quick calculations here.

There are 20C2 or 190 different possible computer/computer pairings for a specific data set at a given time. Of the 190 pairings, your definition of success is only interested in 1 of those pairings. Based on that, I would say the reliability is officially "very high".

To put some more meat on this, if you assume the computer failure rates follow an exponential distribution (not sure about how accurate it is for computers), then the reliability of 1 computer is exp(-t/MTBF).

Since exponential distribution is continuous, the probability of the second failing at the same time approaches 0. Lets be conservative and assume its 100th of the first probability. This implies a higher reliability for the second computer.

They would operate similar to a parallel fashion creating a 2 computer system reliability of: 1-(1-exp(-t/MTBF))(1-exp(-t/MTBF)).

Apply this probability to the binomial distribution to see what the probability of occurrence is:20c2*(1-(1-exp(-t/MTBF)(1-exp(-t/MTBF)))^1*(1-exp(-t/MTBF)(1-exp(-t/MTBF)^189. This quickly approaches 0.

I would never expect to see this failure mode occur in anyone here's lifetime based strictly on the computer's reliability performance. Best of luck!

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Grid MTBF or Failure Rate

GaryRudolph

Computer

ReliaEng

Military

BigInch

Petroleum

GaryRudolph

Computer

ReliaEng

Military

Similar threads

Part and Inventory Search

Sponsor