Eng-Tips is the largest engineering community on the Internet

Intelligent Work Forums for Engineering Professionals

  • Congratulations waross on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

What is the minimum sample size to generate a Weibull distribution?

Status
Not open for further replies.

debun

Mechanical
Jul 29, 2008
34
0
0
US
One of the advantages of the Weibull is you can form a distribution with a much smaller sample size than say a histogram.
I experimented with using only the first 5 data points from a sample set of 26 to do a Weibull analysis. The first 5 points fit a 2 parameter Weibull and gave an R^2 values of 0.99. The second data set (remaining 21 points) changes to a 3 parameter Weibull with an R^2 value somewhere in the neighborhood of 0.976. So my questions are

1) What is the minimum sample size you need to accurately represent the population? For example from this sample size X the PDF is within some metric (std deviations, % etc) of the true PDF. Clearly R^2 value alone isn’t a good metric

2) I would like to plot the remaining 21 data points over the CDF calculated from my first 5 points. I have used 2 methods. Simply calculating the MR and plotting test cycle vs MR. The second using the Beta and Eta to calculate the CDF percent and plotting test cycle vs CDF percentile. Is there a good way to represent predicted CDF vs test data?

Here is my data set.
154173
171158
83431
201778
117578

192083
136262
149487
148009
98317
69798
94195
62548
103574
108364
132377
143047
85272
95760
214166
289237
161265
172490
99972
117440
89717
 
Replies continue below

Recommended for you

1) 3 unknowns require a minimum of 3 data points, but if the data points are widely dispersed, then your mean squared error should at least be a guide to how many data points are required.

2) Typically, you plot the data over the predicted equation curve

TTFN
faq731-376
7ofakss

Need help writing a question or understanding a reply? forum1529


Of course I can. I can do anything. I can do absolutely anything. I'm an expert!
There is a homework forum hosted by engineering.com:
 
There is no specific number of data points that I have ever heard of. The number of data points is determined by repeatability, spread and ultimately the design limits. Statistically speaking all of your data points could be the ideal value but the next 100 data points could be scattered. Or your first values may be the 2 sigma values. Do your analysis, calculate the error and review the data. Then decide on the sample size. I have been stuck with sample sizes from 1 (due to cost) to 1,000.
 
"One of the advantages of the Weibull is you can form a distribution with a much smaller sample size than say a histogram."

This practice could be seen as putting the cart before the horse. Be careful not to force inadequate data into reinforcing unsubstantiated foregone conclusions that your model can be in fact described by a Wiebull distribution.

OMG%20something%20else.png
 
@IRstuff or anyone that would like to answer. I'm not following how/why you would use your MSE. Isn't that just the same as the r^2? The larger that r^2 gets the smaller your SSE and MSE will get just that I have a relative scale 0-1 for r^2.

Using MSE are you suggested I get 0.009 and I don't think the 5 data points accurately predicts the remaining 21.

How do you plot the remaining 21 points over the predictive CDF? What method do you use to calculate the Median ranking?

 
@BigInch How is one to prevent themselves from forcing "inadequate data into reinforcing unsubstantiated foregone conclusions"? I have a good fit with my Weibull but so if you find yourself with only 5 data points how can you gauge how many more samples you will need?
 
Well, how else would you define a "fit?" If the mean squared error went down substantially, then the initial fit was poor, but that's a trap. The problem with a generalized "fit," particularly with 3 parameters is akin to polynomial fits, i.e., you keep upping the order, and the fit keeps getting better, but all it's doing is fitting the noise. The Weibull function is problematic in that there is no physical basis for the parameters, since they essentially exist simply for the sake of fitting data to a smooth curve.

Offhand, I would argue that 5 data points would be woefully inadequate for fitting 3 parameters; a minimum of 3 times the number of parameters would seem like a good place to start, but even then, their spacing and relative similarity need to be considered. Since "noise" is inherent in any sort of reliability data, more data points

If you look at your data more closely, you can see that there is one specific datapoint that is problematic, because it seems to be substantially different than the rest.

TTFN
faq731-376
7ofakss

Need help writing a question or understanding a reply? forum1529


Of course I can. I can do anything. I can do absolutely anything. I'm an expert!
There is a homework forum hosted by engineering.com:
 
@IRstuff Good point about the polynomial. Since MSE is SSE/n and SSE is used to calculate your correlation coefficient r^2, is r^2 not a suitable proxy for MSE? I can potentially see that value going up or down as n increases. MSE for the 21 points (0.03)is higher than the 5 points (0.09). What is a suitable MSE? How do you judge sample size with MSE? It seems that you can have a good fit and not be representative of the true population

The first 5 data points I used a 2 parameter and for the remaining 21 I used 3 parameter based on my regression values. A lot of these seems to be based on your median rankings.
 
Again, given a certain level of variability in the points, "representative" is a very relative term. A better R does mean a better fit, but because the points are never going to perfectly on a line, that's as good as it gets. That's why looking at the mean squared error is somewhat more physically meaningful because you are minimizing the distance of each point to the predicted line. However, since the points are never perfect, getting additional points could still mean that there's a better line to be found. The best you can do is whatever you can do with your given data, but, obviously, if you must predict an equation with only 5 points, you must expect there to be a potentially higher error against the other points.

TTFN
faq731-376
7ofakss

Need help writing a question or understanding a reply? forum1529


Of course I can. I can do anything. I can do absolutely anything. I'm an expert!
There is a homework forum hosted by engineering.com:
 
iF you know your Weibull curve is good? Then why worry about sample size? Simply pick a value and see if it falls on the curve.

If you know your data fits some Weibull curve, plot it. Then see what data values don't land on it.

If you don't know which Weibull curve is the correct one, changing sample size will just keep on giving you different Weibull curves, each with slightly different parameters than the other.

To arrive at a Weibull curve that best describes a set of n values, consider all n values and calculate Weibull parameters.

If introducing new values to the sample size doesn't introduce disproportionate noise to the point where the curve doesn't change much from the previous curve, ie. the curve's parameters do ot change significantly, then you've got enough samples to predict the curve described by all (and presumably future) samples.




OMG%20something%20else.png
 
@Biginch In my first post the fit from the first 5 points do not fit the remaining 21 points. Hence r^2 values albeit good for both may not predict representation to the true population. The 5 vs 21 is arbitrary. If we imagine you do a test and have X samples how do you know that that quantity is enough? That is the question.
 
Exactly. You can also ask, HOW DO YOU KNOW IT'S NOT.

If you have a sample of data and you calculate the best fit curve, it will be the best fit curve for the sample of data you have. If you change the data set and do it again, it will be the best fit curve for that data set.

If you are convinced that the Weibull curve that fits the whole sample population is the "ONE" true curve, take 5 random values from your population, get the parameters for the best Weibull fit. Now take six different, random values and calculate the Weibull parameters for that set. Keep on doing that with ever more and more random values. When you see that Weibull's parameters aren't changing much (within your margin of error), you've discovered how many values you need.

OMG%20something%20else.png
 
Bear in mind, again, there is inherent variability in the data. If you collect another 26 data points, the resultant Weibull curve may also be different. Perfect repeatability only exists in school

TTFN
faq731-376
7ofakss

Need help writing a question or understanding a reply? forum1529


Of course I can. I can do anything. I can do absolutely anything. I'm an expert!
There is a homework forum hosted by engineering.com:
 
Exactly. The number of Weibull curves possible is limited only by the amount of data you can collect and the number of combinations you can make with it. If they all happen to exactly fit one curve, I'd be suspicious. Very, very suspicious.


OMG%20something%20else.png
 
Status
Not open for further replies.
Back
Top