Correlation and Significance

Dave442 · Apr 2, 2013

Hi all,

I have not studied statistics in a long time and could use some advice. I have carried out six numerical analyses to investigate the deformation of six similar devices. Now I wish to determine whether there is a relationship between the geometrical properties of the device, such as its width/thickness, and the stress that is calculated within the device.

I have plotted the device thickness against the maximum stress predicted within the vessel and noticed a linear(ish) relationship. As my sample size is quite small I am not sure which correlation coefficient I should use to quantify the correlation between the two variables. If I have done the maths correctly, I get r = 0.88 and p < 0.05 using Pearson's coefficient and r = 0.97 and p < 0.002 using Spearman's coefficient.

Based upon what I have read I am more inclined to use Spearman's coefficient but I am not very confident.

Any advice would be appreciated!
Dave

IRstuff · Apr 2, 2013

The coefficients, as described, are not relevant to sample size, per se. The Spearman's coefficient is telling you that there IS a positive correlation between the independent and dependent variables. The Pearson's coefficient is telling you that it's NOT a linear relationship. However, a sufficiently noisy dataset can degrade or obscure the results.

http://en.wikipedia.org/wiki/Spearman's_rank_correlation_coefficient

http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient

TTFN
faq731-376

Need help writing a question or understanding a reply? forum1529

Dave442 · Apr 2, 2013

Hi IRstuff,

Thanks for the quick reply. I understand the coefficients are designed to measure the strength of the linear/monotonic correlation. I'm just not sure when one should be employed over the other. In a number of articles it is stated that Pearson's coefficient assumes bivariate normal distribution (or something relatively close to bivariate normal distribution). This can be hard to verify when the sample size is small.

As such, should the Spearman coefficient be employed to be safe?

Also, you state that my Pearson coefficient implies that there is not a linear relationship. Wouldn't this depend upon the choice of statistical significance? i.e. Assuming a 5% level of significance wouldn't my coefficient indicate a strong, statistically-significant linear correlation?

Thanks for the advice,
Dave

IRstuff · Apr 2, 2013

Both assume large distributions, so I don't see that as a discrimator.

They are not mutually exclusive, i.e., you do not "use" one instead of the other, since they are not equivalent measures. A Spearman score of 1 only tells you that the variables are monotonically correlated, and tells you absolutely nothing about the shape of the curve. A Pearson score of 1 tells you explicitly that you have a straight-line correlation.

Your specific example is telling you that while your variables are correlated, they are NOT obviously "linear(ish)" This could mean that that there's too much noise, or that the relationship is vaguely resembles a linear one.

TTFN
faq731-376

Need help writing a question or understanding a reply? forum1529

Dave442 · Apr 2, 2013

Ok, maybe I am asking the wrong question..

I understand the Pearson and Spearman coefficients are designed to measure different correlations (linear and monotonic, respectively). As I have a small sample size, however, I do not know whether my data is parametric (i.e. normal distribution) or non-parametric (i.e. not normal distribution). The Pearson coefficient assumes the data is parametric whilst the Spearman coefficient assumes the data is non-parametric.

I am trying to figure out if I should I assume my distribution is parametric and use the Pearson coefficient to analyse the correlation or should I play it safe, assume my distribution is non-parametric and use the Spearman coefficient to analyse the correlation?

Also, I don't understand how Pearsons r = 0.882 does not indicate a strong linear correlation. A value of 0 indicates no linear correlation and a value of +1 or -1 indicates a perfect inverse or perfect positive linear correlation. Shouldn't a value of 0.88 thus indicate a strong positive linear correlation?

Thanks again,
Dave

IRstuff · Apr 2, 2013

"Shouldn't a value of 0.88 thus indicate a strong positive linear correlation?"

No, an exponential, x^2.535, will produce a correlation coefficient of 0.882, and x² produces a correlation coefficient of 0.94

As for Spearman's, the Wiki article on the subject shows similar examples of clearly non-line distributions that produce very high >0.9 coefficients.

TTFN
faq731-376

Need help writing a question or understanding a reply? forum1529

Dave442 · Apr 2, 2013

Ok,

The Pearson correlation coefficient requires that the underlying relationship is assumed to be linear. Having plotted the variables against each other and noted a distinct linear correlation I have adopted Pearson's correlation to quantify the linear correlation. Based on these assumptions, my Pearson r = 0.88 indicates a strong positive linear correlation.

I have also looked into the alternative scenario of quantifying the monotonic correlation between my variables using the Spearman coefficient. The Spearman coefficient only assumes that the relationship is monotonic so I dont see any problem with nonlinear distributions producing high correlation coefficients. My Spearman r = 0.97 indicates a strong positive monotonic correlation.

My problem is that I do not know if my data is parametric (normally-distributed) or non-parametric (not normally distributed). As such, I am looking for advice on whether to go with the Pearson coefficient (which assumes that the data is parametric) or the Spearman coefficient (which does not assume that the data is parametric)?

I am thinking the Spearman coefficient is the safer bet?

Thanks,
Dave

IRstuff · Apr 2, 2013

Not having seen your data, I can't really comment, except to repeat that they are not mutually exclusive, i.e., you cannot say the Spearman is "better" or "safer" because they don't measure the same things. I suggest you review the Wiki article on Spearman, and note the following graph, which has Spearman = 1, and Pearson at 0.88, BUT THE DATA IS NOT A STRAIGHT LINE AT ALL:

You continue to attempt, in my opinion, to squeeze a square peg down the round hole of "linear fit." Again, not seeing your data, I don't know that's what you're doing, obviously, but the fact that you seem to want a coefficient that supports your conclusion, rather than finding something that demonstrates that the data is indeed linear, raises all sorts of warning flags. Your quest, in my view, should be to find a fitting function that minimizes the mean square error (MSE), and that is not a brute-force spline or polynomial fit that would set the MSE to 0. Perhaps you need to revisit the actual physics of your problem and look at the equations from Roark's.

TTFN
faq731-376

Need help writing a question or understanding a reply? forum1529

Dave442 · Apr 2, 2013

Sorry IRStuff, I think I am giving you the wrong idea.

I am not trying to prove that there is a linear dependence between my two variables. I simply want to comment on the fact that there appears to be a general dependence between the two variables and to quantify the strength of this dependence.

Having plotted the data, the dependence between the two variables appears to be linear. Unfortunately, I am a bit concerned over the normality of my data (i.e. whether it is parametric or non-parametric). As I see it I now have two options:

1) I can assume that my data is parametric and that the dependence between the two variables is linear. If I make these assumptions I may quantify the strength of the assumed linear dependence using the Pearson coefficient (A parametric measure of linear dependence).

2) I can assume that my data is non-parametric and that the dependence between the two variables is monotonic. If I make these assumptions I may quantify the strength of the assumed monotonic dependence using the Spearman coefficient (A non-parametric measure of monotonic dependence).

I am not asking which coefficient is better or safer. I am asking whether, in my situation, it would be better/safer to simply assume that the data is non-parametric and to adopt option number 2 above.

Sorry if I am not communicating my problem well, I appreciate your time!
Dave

IRstuff · Apr 2, 2013

Again, regardless of the parametric or non-parametric-ness, they don't measure the same things, so I don't understand why you insist on comparing an apple to an orange, they just aren't equivalent. Again, you do not show what the data looks like, so I only have your assessment. If the data were truly linear, the Pearson's coefficient would be substantially higher.

TTFN
faq731-376

Need help writing a question or understanding a reply? forum1529

Dave442 · Apr 3, 2013

Hi IRStuff,

I am not trying to measure the same thing using two the two different coefficients. I am asking, since I don't know whether my data is parametric or non-parametric, if I should make one assumption (data=parametric / dependence=linear) and measure linear dependence or make a completely different assumption (data=non-parametric / dependence=monotonic) and, instead, measure the monotonic dependence. In either case I'm measuring two different things - the strength of either the linear or the monotonic dependence.

My problem is I don't know If I can make the assumption that allows me to measure the linear dependence as I don't know if my data is parametric.

I found an article that suggests that you can evaluate the normality of the data by comparing the skewness of the variables to the standard error. If the skewness is less than twice the standard error it is reasonable to assume that the data is parametric. Otherwise, it is likely that your data is not normally distributed and, as such, it should be assumed that your data is non-parametric (i.e. you cant use Pearsons parametric coefficient to measure linear dependence and should instead use Spearmans non-parametric coefficient to measure monotonic dependence).

Here is the link:

http://www.statstutor.ac.uk/resources/uploaded/spearmans.pdf

I am no expert so I am not sure if this is reliable?
Dave

IRstuff · Apr 3, 2013

It seems to me that you are somewhat misusing the term "nonparametric." Nonparametric refers to the applicability of the process to be independent of the probability distribution. So, unless you have some insight or data (which still has not been shown) to the contrary, there is no reason to not assume Gaussian statistics. The only question you should be concerning yourself with is whether the data really fits a line or not. For that question, the answer is that NEITHER Pearson NOR Spearman is telling you a good answer, because both are giving you less than absolute answers. The only approach, barring futzing with second, or higher moments, and simple eyeballing, is to compare the MSE of the linear fit with something like a second order polynomial fit.

To further clarify, neither measure is telling you that you have proper fitting function, so the question of parametric vs. nonparametric is completely moot.

TTFN
faq731-376

Need help writing a question or understanding a reply? forum1529

Dave442 · Apr 4, 2013

I am not trying to identify a perfect linear/monotonic dependence or identify a perfectly fitting function.
I just want to comment on the strength of the perceived linear/monotonic dependence.

You state that my data does not fit a line as the Pearson's coefficient is not an absolute value. To use Pearson's coefficient I must already assume that there is a linear dependence. The coefficient only measures the strength of the assumed linear dependence. If I assume that there is a linear dependence between my variables - Pearson's r = 0.88 implies that this assumed linear dependence is strong (not perfect).

You also state that my data does not fit a line as the Spearman's coefficient is not an absolute value. To use Spearman's coefficient I only assume that there is a monotonic dependence. The coefficient only measures the strength of the assumed monotonic dependence. If I assume that there is a monotonic dependence between my variables - Spearman's r = 0.97 implies that this assumed monotonic dependence is very strong (not perfect).

Following the guidelines given in the article that I linked (comparing the skewness of my variables to the standard error) I believe that my data may be skewed (non-parametric). As such, there is good reason not to assume Gaussian statistics.

As I believe some of my data may be skewed (non-parametric), I have decided to employ the non-parametric Spearman rank correlation coefficient to measure the strength of the perceived/assumed monotonic dependence between my variables.

IRstuff · Apr 4, 2013

OK, but I would again argue that a monotonicity measure is totally irrelevant and meaningless, since it is almost a given that whatever you are fitting is almost certainly to be monotonic, unless you are just plotting random variables. So the fact that you have Spearman at 0.97 is irrelevant, particularly in the case of plotting stress vs. thickness. How could the relationship be OTHER than monotonic? If you had Spearman = 1.00, it tells you absolutely nothing new; you already know that there MUST be a plausibly monotonic relationship between stress and thickness. You can look up the stress equations in Roark, and have 100% confidence that the relationship is monotonic, so Spearman = 1.0 is zero new information, and therefore irrelevant. The only possible new information would be if you managed to collect data where the structure yielded, in which case all bets are off anyway.

TTFN
faq731-376

Need help writing a question or understanding a reply? forum1529

Dave442 · Apr 4, 2013

I gave the example of thickness and stress in my original post as it was easy to visualise.

I have identified similar dependencies between the geometrical properties of the device and my other variables of interest. The other variables of interest describe the impact of the device upon its surroundings and these are what I am primarily interested in. It is not possible to intuitively identify dependencies between the geometric properties of the device and these other variables. As such, a strong positive/negative monotonic correlation is not irrelevant and may eventually prove useful in optimising the performance of the device.

My problem was that, because I only have a small sample size, I could not tell whether my variables were parametric (normally-distributed) or non-parametric (not normally distributed) by inspection. I was asking advice on whether to assume that the data was parametric or not. As it turns out you can determine whether your data is parametric or not by calculating the skewness and kurtosis ratio as follows:

Skewness ratio = skewness / standard error of skewness
Kurtosis ratio = kurtosis / standard error of kurtosis

If the magnitude of both the skewness ratio and the kurtosis ratio is less than a value of 2, it is safe to assume that your data is parametric (normally distributed). Conversely, if the magnitude of both the skewness ratio and kurtosis ratio is greater than a value of 2, you should assume that your data is non-parametric (not normally distributed). Now you can decide whether or not you can use a parametric correlation coefficient or not.

With a small sample size this is still not really ideal and I will bear that in mind when interpreting any strong correlations but I think it is the best solution to my problem? Sorry again if I wasn't clearly communicating my problem. I really do appreciate you taking the time to respond.

Many thanks,
Dave

IRstuff · Apr 4, 2013

Again, I think you are obsessing over a minor part of your system. Whether your data is parametric or not is irrelevant compared to whether you have the correct dependency modeled, i.e., it is more important that you know whether your data is linear vs. powered compared to whether the noise is parametric or not. You seem to be so hung up about this, rather than worrying about how to determine whether your regression model is even correct or not. I will again point out that choosing the wrong regression model will result in erroneous correlation coefficients that swamp out whatever effects there might be from having a nonparametric distribution.

TTFN
faq731-376

Need help writing a question or understanding a reply? forum1529

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Correlation and Significance

Dave442

Mechanical

IRstuff

Aerospace

Dave442

Mechanical

IRstuff

Aerospace

Dave442

Mechanical

IRstuff

Aerospace

Dave442

Mechanical

IRstuff

Aerospace

Dave442

Mechanical

IRstuff

Aerospace

Dave442

Mechanical

IRstuff

Aerospace

Dave442

Mechanical

IRstuff

Aerospace

Dave442

Mechanical

IRstuff

Aerospace

Similar threads

Part and Inventory Search

Sponsor