Prouver l'équivalence des deux formules suivantes pour la corrélation de Spearman


14

À partir de wikipedia , la corrélation de rang de Spearman est calculée en convertissant les variables X iXi et Y iYi en variables classées x ixi et y iyi , puis en calculant la corrélation de Pearson entre les variables classées:

Calculate Spearman via wikipedia

Cependant, l'article poursuit en déclarant que s'il n'y a pas de lien entre les variables X iXi et Y iYi , la formule ci-dessus est équivalente à

second formula to calculate Spearman

d i = y i - x idi=yixi , la différence de rangs.

Quelqu'un peut-il en donner la preuve, s'il vous plaît? Je n'ai pas accès aux manuels référencés par l'article wikipedia.

Réponses:


14

ρ = i ( x i - ˉ x ) ( y i - ˉ y )i ( x i - ˉ x ) 2 i ( y i - ˉ y ) 2ρ=i(xix¯)(yiy¯)i(xix¯)2i(yiy¯)2

Puisqu'il n'y a pas de liens, les xx et les yy sont tous deux constitués des entiers de 11 à nn inclus.

On peut donc réécrire le dénominateur:

i ( x i - ˉ x ) ( y i - ˉ y )i ( x i - ˉ x ) 2i(xix¯)(yiy¯)i(xix¯)2

Mais le dénominateur est juste une fonction de nn :

i ( x i - ˉ x ) 2 = i x 2 i - n ˉ x 2= n ( n + 1 ) ( 2 n + 1 )6 -n((n+1)2 )2= n ( n + 1 ) ( ( 2 n + 1 )6 -(n+1)4 )= n ( n + 1 ) ( ( 8 n + 4 - 6 n - 6 )24 )= n ( n + 1 ) ( ( n - 1 )12 )= n ( n 2 - 1 )12i(xix¯)2=ix2inx¯2=n(n+1)(2n+1)6n((n+1)2)2=n(n+1)((2n+1)6(n+1)4)=n(n+1)((8n+46n6)24)=n(n+1)((n1)12)=n(n21)12

Regardons maintenant le numérateur:

i ( x i - ˉ x ) ( y i - ˉ y )= i x i ( y i - ˉ y ) - i ˉ x ( y i - ˉ y )= i x i y i - ˉ yi x i - ˉ xi y i + n ˉ x ˉ y= i x i y i - n ˉ x ˉ y= i x i y i - n ( n + 12 ) 2= i x i y i - n ( n + 1 )12 3(n+1)= n ( n + 1 )12 . (-3( n + 1 ) ) + Σ i x i y i= n ( n + 1 )12 . [(n-1)-(4n+2)]+ixiyi= N ( n + 1 ) ( n - 1 )12n(n+1)(2n+1)/6+ixiyi=n(n+1)(n1)12ix2i+ixiyi=n(n+1)(n1)12 -i(x 2 i +y 2 i )/2+ixiyi= N ( n + 1 ) ( n - 1 )12 -i(x 2 i -2xiyi+y 2 i )/2= N ( n + 1 ) ( n - 1 )12 -Σi(xi-yi)2/2= n ( n 2 - 1 )12 -d 2 i /2i(xix¯)(yiy¯)=ixi(yiy¯)ix¯(yiy¯)=ixiyiy¯ixix¯iyi+nx¯y¯=ixiyinx¯y¯=ixiyin(n+12)2=ixiyin(n+1)123(n+1)=n(n+1)12.(3(n+1))+ixiyi=n(n+1)12.[(n1)(4n+2)]+ixiyi=n(n+1)(n1)12n(n+1)(2n+1)/6+ixiyi=n(n+1)(n1)12ix2i+ixiyi=n(n+1)(n1)12i(x2i+y2i)/2+ixiyi=n(n+1)(n1)12i(x2i2xiyi+y2i)/2=n(n+1)(n1)12i(xiyi)2/2=n(n21)12d2i/2

Numérateur dénominateur

=n(n+1)(n1)/12d2i/2n(n21)/12=n(n21)/12d2i/2n(n21)/12=16d2in(n21)=n(n+1)(n1)/12d2i/2n(n21)/12=n(n21)/12d2i/2n(n21)/12=16d2in(n21).

Hence

ρ=16d2in(n21).ρ=16d2in(n21).


5
You could eliminate the last 80% of this work by starting with the observation that ρρ is invariant under location and scale changes, thereby reducing the problem to expressing xiyixiyi in terms of (xiyi)2(xiyi)2 when x2i=y2i=1x2i=y2i=1; the formula obviously is 12d2i=12(xiyi)2=1xiyi12d2i=12(xiyi)2=1xiyi. Then the only real work to be done is accomplished by your calculation of the denominator.
whuber

@whuber +1, that's a good bit neater. But I think I'll leave it in the longer, less neat, bull-at-a-gate form.
Glen_b -Reinstate Monica

thanks, both answers are good but I have accepted this one as it is the one I started attempting myself.
Alex

I should explain my reasons for going the more prosaic route -- the other answers are neat, illuminating and clever, but require insights that are unlikely to be generated by any but the better students on their own. The advantage of showing it's entirely amenable to straightforward if uninspired manipulation is that it should be within the grasp of even the moderately able if uninspired-to-insight student. Sometimes knowing you don't need any insightful tricks is helpful (to those who don't see them).
Glen_b -Reinstate Monica

I guess it depends on your view of what constitutes a "trick," "manipulation," and "insight." Long batteries of involved algebraic calculations, as you intimate, provide little or no insight (as well as offering many opportunities for mistakes)--and I fear that students may view them as being formidable for their very bulk alone, as well as unmotivated. Other operations, such as a preliminary standardization (which is so helpful here), may initially be viewed as "tricks" but after a few applications should become to be seen as insightful and fundamental tools.
whuber

10

We see that in the second formula there appears the squared Euclidean distance between the two (ranked) variables: D2=Σd2iD2=Σd2i. The decisive intuition at the start will be how D2D2 might be related to rr. It is clearly related via the cosine theorem. If we have the two variables centered, then the cosine in the linked theorem's formula is equal to rr (it can be easily proved, we'll take here as granted). And h2h2 (the squared Euclidean norm) is Nσ2Nσ2, sum-of-squares in a centered variable. So the theorem's formula looks like this: D2xy=Nσ2x+Nσ2y2NσxNσyrD2xy=Nσ2x+Nσ2y2NσxNσyr. Please note also another important thing (which might have to be proved separately): When data are ranks, D2D2 is the same for centered and not centered data.

Further, since the two variables were ranked, their variances are the same, σx=σy=σσx=σy=σ, so D2=2Nσ22Nσ2rD2=2Nσ22Nσ2r.

r=1D22Nσ2r=1D22Nσ2. Recall that ranked data are from a discrete uniform distribution having variance (N21)/12(N21)/12. Substituting it into the formula leaves r=16D2N(N21)r=16D2N(N21).


8

The algebra is simpler than it might first appear.

IMHO, there is little profit or insight achieved by belaboring the algebraic manipulations. Instead, a truly simple identity shows why squared differences can be used to express (the usual Pearson) correlation coefficient. Applying this to the special case where the data are ranks produces the result. It exhibits the heretofore mysterious coefficient

6n(n21)

6n(n21)

as being half the reciprocal of the variance of the ranks 1,2,,n1,2,,n. (When ties are present, this coefficient acquires a more complicated formula, but will still be one-half the reciprocal of the variance of the ranks assigned to the data.)

Once you have seen and understood this, the formula becomes memorable. Comparable (but more complex) formulas that handle ties, show up in nonparametric statistical tests like the Wilcoxon rank sum test, or appear in spatial statistics (like Moran's I, Geary's C, and others) become instantly understandable.


Consider any set of paired data (Xi,Yi)(Xi,Yi) with means ˉXX¯ and ˉYY¯ and variances s2Xs2X and s2Ys2Y. By recentering the variables at their means ˉXX¯ and ˉYY¯ and using their standard deviations sXsX and sYsY as units of measurement, the data will be re-expressed in terms of the standardized values

(xi,yi)=(XiˉXsX,YiˉYsY).

(xi,yi)=(XiX¯sX,YiY¯sY).

By definition, the Pearson correlation coefficient of the original data is the average product of the standardized values,

ρ=1nni=1xiyi.

ρ=1ni=1nxiyi.

The Polarization Identity relates products to squares. For two numbers xx and yy it asserts

xy=12(x2+y2(xy)2),

xy=12(x2+y2(xy)2),

which is easily verified. Applying this to each term in the sum gives

ρ=1nni=112(x2i+y2i(xiyi)2).

ρ=1ni=1n12(x2i+y2i(xiyi)2).

Because the xixi and yiyi have been standardized, their average squares are both unity, whence

ρ=12(1+11nni=1(xiyi)2)=112(1nni=1(xiyi)2).

ρ=12(1+11ni=1n(xiyi)2)=112(1ni=1n(xiyi)2).(1)

The correlation coefficient differs from its maximum possible value, 11, by one-half the mean squared difference of the standardized data.

This is a universal formula for correlation, valid no matter what the original data were (provided only that both variables have nonzero standard deviations). (Faithful readers of this site will recognize this as being closely related to the geometric characterization of covariance described and illustrated at How would you explain covariance to someone who understands only the mean?.)


In the special case where the XiXi and YiYi are distinct ranks, each is a permutation of the same sequence of numbers 1,2,,n1,2,,n. Thus ˉX=ˉY=(n+1)/2X¯=Y¯=(n+1)/2 and, with a tiny bit of calculation we find

s2X=s2Y=1nni=1(i(n+1)/2)2=n2112

s2X=s2Y=1ni=1n(i(n+1)/2)2=n2112

(which, happily, is nonzero whenever n>1n>1). Therefore

(xiyi)2=((Xi(n+1)/2)(Yi(n+1)/2))2(n21)/12=12(XiYi)2n21.

(xiyi)2=((Xi(n+1)/2)(Yi(n+1)/2))2(n21)/12=12(XiYi)2n21.

This nice simplification occurred because the XiXi and YiYi have the same means and standard deviations: the difference of their means therefore disappeared and the product sXsYsXsY became s2Xs2X which involves no square roots.

Plugging this into the formula (1)(1) for ρρ gives

ρ=16n(n21)ni=1(XiYi)2.

ρ=16n(n21)i=1n(XiYi)2.

2
(+1) The geometric interpretation in terms of your famous "rectangles for covariance" answer is very neat but I wonder if casual readers will see it - perhaps a sketch diagram might help (I was tempted to add one myself!). For the curious: the formula r=1s2xy/2r=1s2xy/2 is number 9 in the list of Thirteen Ways to Look at the Correlation Coefficient, by Joseph Lee Rodgers and W. Alan Nicewander in The American Statistician , Vol. 42, No. 1. (Feb., 1988), pp. 59-66. stat.berkeley.edu/~rabbee/correlation.pdf
Silverfish

2
@Silver Thank you for the helpful comments. The Rodgers and Nicewander article is summarized on our site at stats.stackexchange.com/a/104577. Someday I might draw the diagram you describe... .
whuber

5

High school students may see the PMCC and Spearman correlation formulae years before they have the algebra skills to manipulate sigma notation, though they may well know the method of finite differences for deducing the polynomial equation for a sequence. So I have tried to write a "high school proof" for the equivalence: finding the denominator using finite differences, and minimising the algebraic manipulation of sums in the numerator. Depending on the students the proof is presented to, you may prefer this approach to the numerator, but combine it with a more conventional method for the denominator.

Denominator, i(xiˉx)2i(yiˉy)2i(xix¯)2i(yiy¯)2

With no ties, the data are the ranks {1,2,,n}{1,2,,n} in some order, so it is easy to show ˉx=n+12x¯=n+12. We can reorder the sum Sxx=ni=1(xiˉx)2=nk=1(kn+12)2Sxx=ni=1(xix¯)2=nk=1(kn+12)2, though with lower grade students I'd likely write this sum out explicitly rather than in sigma notation. The sum of a quadratic in kk will be cubic in nn, a fact that students familiar with the finite difference method may grasp intuitively: differencing a cubic produces a quadratic, so summing a quadratic produces a cubic. Determining the coefficients of the cubic f(n)f(n) is straightforward if students are comfortable manipulating ΣΣ notation and know (and remember!) the formulae for nk=1knk=1k and nk=1k2nk=1k2. But they can also be deduced using finite differences, as follows.

When n=1n=1, the data set is just {1}{1}, ˉx=1x¯=1, so f(1)=(11)2=0f(1)=(11)2=0.

For n=2n=2, the data are {1,2}{1,2}, ˉx=1.5x¯=1.5, so f(2)=(11.5)2+(21.5)2=0.5f(2)=(11.5)2+(21.5)2=0.5.

For n=3n=3, the data are {1,2,3}{1,2,3}, ˉx=2x¯=2, so f(3)=(12)2+(22)2+(32)2=2f(3)=(12)2+(22)2+(32)2=2.

These computations are fairly brief, and help reinforce what the notation ni=1(xiˉx)2ni=1(xix¯)2 means, and in short order we produce the finite difference table.

Finite difference table for Sxx

We can obtain the coefficients of f(n)f(n) by cranking out the finite difference method as outlined in the links above. For instance, the constant third differences indicate our polynomial is indeed cubic, with leading coefficient 0.53!=1120.53!=112. There are a few tricks to minimise drudgery: a well-known one is to use the common differences to extend the sequence back to n=0n=0, as knowing f(0)f(0) immediately gives away the constant coefficient. Another is to try extending the sequence to see if f(n)f(n) is zero for an integer nn - e.g. if the sequence had been positive but decreasing, it would be worth extending rightwards to see if we could "catch a root", as this makes factorisation easier later. In our case, the function seems to hover around low values when nn is small, so let's extend even further leftwards.

Extended finite difference table for Sxx

Aha! It turns out we have caught all three roots: f(1)=f(0)=f(1)=0f(1)=f(0)=f(1)=0. So the polynomial has factors of (n+1)(n+1), nn, and (n1)(n1). Since it was cubic it must be of the form:

f(n)=an(n+1)(n1)

f(n)=an(n+1)(n1)

We can see that aa must be the coefficient of n3n3 which we already determined to be 112112. Alternatively, since f(2)=0.5f(2)=0.5 we have a(2)(3)(1)=0.5a(2)(3)(1)=0.5 which leads to the same conclusion. Expanding the difference of two squares gives:

Sxx=n(n21)12

Sxx=n(n21)12

Since the same argument applies to SyySyy, the denominator is SxxSyy=S2xx=SxxSxxSyy=S2xx=Sxx and we are done. Ignoring my exposition, this method is surprisingly short. If one can spot that the polynomial is cubic, it is necessary only to calculate SxxSxx for the cases n{1,2,3,4}n{1,2,3,4} to establish the third difference is 0.5. Root-hunters need only extend the sequence leftwards to n=0n=0 and n=1n=1, by when all three roots are found. It took me a couple of minutes to find SxxSxx this way.

Numerator, i(xiˉx)(yiˉy)i(xix¯)(yiy¯)

I note the identity (ba)2b22ab+a2(ba)2b22ab+a2 which can be rearranged to:

ab12(a2+b2(ba)2)

ab12(a2+b2(ba)2)

If we let a=xiˉx=xin+12a=xix¯=xin+12 and b=yiˉy=yin+12b=yiy¯=yin+12 we have the useful result that ba=yixi=diba=yixi=di because the means, being identical, cancel out. That was my intuition for writing the identity in the first place; I wanted to switch from working with the product of the moments to the square of their differences. We now have:

(xiˉx)(yiˉy)=12((xiˉx)2+(yiˉy)2d2i)

(xix¯)(yiy¯)=12((xix¯)2+(yiy¯)2d2i)

Hopefully even students unsure how to manipulate ΣΣ notation can see how summing over the data set yields:

Sxy=12(Sxx+Syyni=1d2i)

Sxy=12(Sxx+Syyi=1nd2i)

We have already established, by reordering the sums, that Syy=SxxSyy=Sxx, leaving us with:

Sxy=Sxx12ni=1d2i

The formula for Spearman's correlation coefficient is within our grasp!

rS=SxySxxSyy=Sxx12id2iSxx=1id2i2Sxx

Substituting the earlier result that Sxx=112n(n21) will finish the job.

rS=1id2i212n(n21)=16id2in(n21)

En utilisant notre site, vous reconnaissez avoir lu et compris notre politique liée aux cookies et notre politique de confidentialité.
Licensed under cc by-sa 3.0 with attribution required.