Continue to Site

Eng-Tips is the largest engineering community on the Internet

Intelligent Work Forums for Engineering Professionals

  • Congratulations cowski on being selected by the Eng-Tips community for having the most helpful posts in the forums last week. Way to Go!

Linear regression analysis for y = ax +b 1

Status
Not open for further replies.

Dandalf

Civil/Environmental
Sep 2, 2003
2
I need to know if I can use the linear regression analysis function in Excel with relevanct to comparing two values which although are not related, should be equally paired.
More specifically in-situ pH determined valueas are being compared to lab determined values from sample analysis. These values should be the same (x=y), however haveing plotted a scatter graph in excel of one set of values verses their paired other value, and fitting a line of best fit (through least squares). Is this applicable when looking for a bias and also when comparing the R-squared values of the line of best fit when removing out-lier points.
Thanks all,
Dan ;)
 
Replies continue below

Recommended for you

I would also consider the 95% confidence interval about the line to compare the values. If your line y=x lies in that region then your hypothesis would be correct. If it were outside the region then you wouldn't be confident that it were true. A line of best fit on its own isn't much use really. Sadly excel doesn't do confidence intervals and you'd have to calculate it yourself. It's better to use statistical software if you're not familiar with the theory.
I have seen a method for ignoring "rogue" points from your data but I'd be suspicious of it unless there were some reason in the experiment for excluding them.
 
I agree with both of Corus's points.

On the first point, Excel's built-in regression function will not directly give your the confidence bands around your regression line, but it will give you the various statistical measures that you will need to calculate it. (The method is described in most introductory statistical texts.)

One his second point, the one concerning outliers, he is absolutely correct. You should not even start to consider any statistical testing for outliers unless you have strong a priori reasons for suspecting that your data might contain them. Otherwise, based on a whim, you will end up distorting your data.

If you do have genuine reasons, then there are many papers around that will give you a method for detecting an outlier in a single sample. (For example Dixon's Test as described in the Statistics text book by Sokal & Rohlf, or Grubbs's method whose reference I could find if pressed.) But you have two samples, supposedly tightly correlated. An extremely high value of y might look odd if examined only in the context of the set of y-observations, but it might not be odd at all if it was paired with a high value of x. The variable that you test for outliers will have to be some sort of composite variable, probably the deviation of the y variable from the calculated regression line.

But even then you are not out of the woods, because the variance of this variable is not constant. (Horrible sentence, I know, but I cannot think of a better way to express it.) Technically, the variable you are testing is not homoskedastic, and this invalidates the methodology of the test. One way around this is to "normalise" the variable before you test it, by dividing each value by the variance of its expected value. (This "expected value" is what comes out of the original regression, and the formula for its variance should be given in whatever text book you use to establish the confidence bands around your regression line as discussed in my second paragraph.)

HTH
 
Cheers guys! So basically I need to remove the objective argument for removing the outliers and replace with a statistal one!
Can I use the equation y = ax + b (or y = mx + c, for engineers) with regards to comparing the in-situ pH and the lab determined pH, as they aren't really related to each other by y = mx + c.
The intercept was set at zero (origin).
The line of best fit fell pretty much on gradient = 1. However I wish to use the gradient and R-squared value to illustrate errors between the two, but the points were scattered both sides, so no real bias.
Dan :)
 
Unless you have a REALLY good explanation for why the outliers exists, DO NOT REMOVE THEM!!!!!
 
A line of best fit is the line which minimizes the distance of the points from the line and generally you will find the same number of points on either side of the line, hence there is no bias. The 95% CI lines will give you a measure of the error in that it will say that you are 95% confident that the in-situ values are the same as the lab values +/- some value. This error will be greater at the extreme values of the data as there is less data to compare, and hence your confidence in the results will be less.
 
In reply to Dandalf's second post (4 Sep 03), it all depends what he is trying to do. The traditional use of regression stresses the conceptual relationship between the two variables. X is called the "explanatory variable" and Y is called the "dependent variable": this terminology says it all, and is why it is the sum of the Y-"errors" that is minimised.

If Dandalf wants to be able to predict his in-situ pH from his lab-measured pH, then the former is Y and the latter is X. If all he wants to do is see how well the measurement methods compare, then he is not positting a functional relationship, and so perhaps he would be better calculating a correlation coefficient rather than establishing a regression relationship.

Almost as a passing commment, Dandalf states that he forced his intercept to be zero. When I was framing my first post I thought about that issue. If he is trying to establish a functional relationship between the two measurements, then he should NOT force the intercept to be zero. We all know that it should be zero in an ideal world, but that is not the point: he is trying (say) to establish a systematic "error" in the use of the in-situ test as a substitute for a full lab test, and there is absolutely no reason to assume that this error will be so considerate as to have a zero intercept. (If you consider that mandating a zero intercept is acceptable because the ideal relationship would have a zero intercept, then you would have to accept also that a slope of exactly unity should be mandated for similar reasons. Then you would have totally castrated your regression.)
 
Hello Dandalf,
The correct method for comparing two sets of values that should be the same on a point-by-point basis is with a paired T-test. This can be accomplished in Excel through Tools-->Data Analysis. Comparing the two columns on an X-Y plot is very useful for flagging outliers, and for developing a calibration curve once you have proven that the values are not statistically different.
Regards,
Joe
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor