Eng-Tips is the largest engineering community on the Internet

Intelligent Work Forums for Engineering Professionals

  • Congratulations waross on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Standard Deviation question 2

Status
Not open for further replies.

kooko

Geotechnical
Apr 29, 2006
12
0
0
Replies continue below

Recommended for you

There is a tool that gives the vertical error bars at each point if that's what you're looking for.

Create your x-y scatter plot. Right click on the series and select "format data series". Then click on the "Y error bars" tab. Then you have several options for seting up the bar. One is a fixed absolute, one is a fixed percentage, and one is custom which allows you to select a colulmn of your spreadsheet containing the standard deviation amount for each x point (needed if it's different at each point).

=====================================
Eng-tips forums: The best place on the web for engineering discussions.
 
A more usual approach to this sort of problem is to deal in terms of confidence levels for a prediction in the value of the dependent variable (Y). These are often presented graphically, superimposed on your trend line and/or your raw data. They consist of a pair of curved lines, a concave line above the trend line and a convex line below the trend line. The curved lines come closest to the trend line when the independent variable (X) is at the mean value of the data points.

The curved lines can represent either:
(1) the bounds within which there is a 95% (say) probability that the EXPECTED value of Y will lie for a given value of X; or
(2) the bounds within which there is a 95% (say) probability that an INDIVIDUAL value of Y will lie for a given value of X.

Obviously the latter bounds lie further from the trendline than the former.

I do not believe Excel directly offers this sort of plot, or even the underlying calculations to produce the plot. However the necessary formulae can be found in any good Stats reference book, and are easily implemented.
 
Good point. I guess that figuring out the intervals of interest would be the much more challenging part than plotting them.

Excel linest function does return some statistical parameters like the standard deviation of the estimate for the slope and standard deviation of the estimate for the offset.

Let me try a simplistic stab at how we would proceed to use those to build confidenc eintervals.

Let's say we are fitting y = mx+b based on a set {xi,yi}.
Assume the independent variables xi are known exactly and there is an error in the dependent variables with a normal distribution.

The linest function will return estimates of m and b (call them mhat and bhat) along with their standard deviations (call them sm and sb).

We compute our estimate of y (the line) by yhat=mhat x + bhat

What is the standard deviation of yhat (call it sy) ?

I think it is sy = x sm + sb

Knowing the standard deviation, I think we can set up the confidence interval under an assumption of normal distribution. 95% confidence interval is two standard deviations on either side of the line. 99% confidence interval is three standard deviations on either side of the line. The bounds would look like a cone which is tightest around the line where x=0 and gets wider as x increases.

No guarantees I haven't missed something. But it sounds to me like it would get you pretty close if those assumptions are correct. If I remember right, maybe student's t needs to be used instead of a normal distribution since we are using estimated standard deviations rather than known standard deviations. And of course if you have error in your x values, that would make it more complicated as well.

=====================================
Eng-tips forums: The best place on the web for engineering discussions.
 
ElectricPete. I cannot remember the formulae off the top of my head, don't have a text book at hand, and wouldn't normally attempt to derive it from first principles anyway.

Be that as it may, you've made a good start. But I can see three immediate points to be made.

(1) The variable "b" has nothing to do with anything, and is merely a consequence of someone's arbitrary choice of an axis system. You'll get further if you express the regression equation in the form
(Y-Ybar) = m*(X-Xbar)

(2) You have added the standard deviations. If variables are independent, you add variances, not standard deviations. Thus your equation
sy = x sm + sb
if it were relevant (which it isn't) would be
V(y) = x V(m) + V(b)

(3) Your comment about possible inaccuracies in the measurement of the X values opens a whole new can, even a whole new barrel, of worms. It is a fundamental requirement of the so-called "simple linear regression model" that the explanatory variable (stats-speak for "X") is non-stochastic (stats-speak for "known exactly", or at least known to a higher order of accuracy than Y).
 
Didn't need my textbook, as I have some hand notes in my files.

Have a total of n data points.

Model is Y = a + b*X + e where e is the error, and the error is assumed to be distributed normally with mean of zero and variance of s^2.

Let E() represent expected value
Let V() represent variance
Let S() represent summation over the n points
Let Xb represent S(X)/n and Yb represent S(Y)/n

Then
E(b) = [n*S(XY)-S(X)*S(Y)]/[n*S(X^2)-(S(X))^2]
E(a) = Yb - E(b)*Xb
both of which are calculated by Excel.

Also
E(s^2) = [S((Y-Yb)^2)-E(b)*S((X-Xb)*(Y-Yb))]/(n-2)
V(b) = E(s^2)/S((X-Xb)^2)
V(a) = E(s^2)*[1/n + (Xb)^2/S((X-Xb)^2)]

Finally the results we are after
E(Y) = E(a) + X*E(b)
V(E(Y)) = E(s^2)*[1/n + (X-Xb)^2/S((X-Xb)^2)]
V(Y) = V(E(Y)) + E(s^2)

Confidence intervals follow, using Student's-T with (n-2) degrees of freedom.

(These formulae are correct on the page in front of me. At the end of a hard day, they might not be correct as I have typed them. No responsibility accepted. Check with a text book.)
 
The point being, in a handwaving sort of way, that the estimate of the mean, ie the middle of the sample, is more robust than the estimate of the gradient, therefore there is less uncertainity about the true location near the middle than at the ends.

Cheers

Greg Locock

Please see FAQ731-376 for tips on how to make the best use of Eng-Tips.
 
My first try definitely way off-base. I vaguely remembered that E( ) was a linear type functon (E[cX+dY]=cE[x]+dE[Y]), but I misapplied that concept to s instead of E (assumed s was linear). I chalk it up to the time of my last post being past my bedtime ;-) (couldn't have been my fault!)

Julian's looks much more rigorous and correct. If I read it right, it tells us that if we draw lines at two standard deviations out... they will be parallel to the original line, Right? I remember somewhere way back in my stats class a result where the confidence interval was tightest in the middle of the range of x and wider towards the ends which I think is what Greg suggested. That must have been a different type of analysis than a simple fit of a line.





=====================================
Eng-tips forums: The best place on the web for engineering discussions.
 
MMMmm......
Formulae are different for CI's and CL's, CL's (percentiles)are parallel to the regression lines, CL's are tracts of hyperbolic curves (the rationale being that, as Greg says,the uncertainty is materialize by rotating the regression lien about its mean value, so taht teh freedom is greater near the endpoints.

Should have a file built a while back, see if I can retrieve it and post it

Merry Xmas!!

[santa]
 
I'm sorry for the typo, CI's are hyperbolae, CL's (confidence levels) are parallel lines. CI's are percentiles computed on the distribution of the population's mean, CL's percentiles computed on the distribution of the limited sample.

Sorry have been relaxing during the holydays, I'm going to look for that file, and eventual references, tonight!
 
This might help, a website that talks about linear regression and standard deviation.

I am only a little bit familiar with the particulars of the derivation--this site says that "Linear regression assumes that scatter of points around the best-fit line has the same standard deviation all along the curve...." Then uses some term I have not heard before: homoscedasticity.

To calculate the standard deviation for the entire curve, calculate something called the standard deviation of the residuals, sxy:

sxy=Sqrt(SSeq/(N-2)), N is number of pairs. SSeq is the sum of the squares of the vertical distance of the points relative to the line y=mx+b (the result of your curve fit).

This website is put together by a company that sells software to do this curve fitting. I don't know the software personally. I would be curious though why this software doesn't seem to have much interest in 'sxy' to get an idea of error in the curve fit, but seems more interested in confidence and/or prediction intervals. All these quantities are useful of course.
 
Dear friends,
please find attached the .xls file: conf_bands,
where signals from 5 ground penetrometers are analyzed.

Data are taken along a depth interval from 5 different verticals in the same location.

The data display a trend. So I wanted to build, for inference purposes, a linear approximation of the trend of ground resistance (Qc) versus depth for the whole location, with its CI's and CL's.

Please note that the predictive value of the confidence levels (percentiles) tends to be poor: the 5% line, for example, rules out or nearly out all data, not just 95% of them.
This because data are clustered in particular regions of the spae of variables.
I believe a polynomial regression would be far more efficient.
Only, I could not find anything about confidence intervals and levels of a polynimial curve approximation.
Do you have any clues??

P.S.: I usually work on a black screen background, so maybe you'll need to optimize the graph colours in the attached file.
 
 http://files.engineering.com/getfile.aspx?folder=70026b2c-c775-4d5a-8b97-7fd6f181c59f&file=Conf_bands.xls
Status
Not open for further replies.
Back
Top