Effet de la réponse de commutation et de la variable explicative dans la régression linéaire simple

48

Disons qu'il existe une "vraie" relation entre et telle que , où et sont des constantes et est un bruit normal. Lorsque je génère de manière aléatoire des données à partir de ce code R: puis que je rentre dans un modèle , je reçois évidemment des estimations raisonnablement bonnes pour et . $y$ $x$ $y = ax + b + \epsilon$ $a$ $b$ $\epsilon$ x <- 1:100; y <- ax + b + rnorm(length(x))y ~ x $a$ $b$

(x ~ y)Cependant, si je modifie le rôle des variables comme dans , puis que je réécris le résultat pour que soit fonction de , la pente résultante est toujours plus raide (plus négative ou plus positive) que celle estimée par la régression. J'essaie de comprendre exactement pourquoi et j'apprécierais que quelqu'un me donne une intuition sur ce qui se passe là-bas. $y$ $x$ y ~ x

regression

— Greg Aponte
source

1

Ce n'est pas vrai en général. Peut-être que vous voyez juste cela dans vos données. Collez ce code: y = rnorm (10); x = rnorm (10); lm (y ~ x); lm (x ~ y); plusieurs fois dans R et vous verrez que cela va dans les deux sens.

— Macro

C'est un peu différent de ce que je décrivais. Dans votre exemple, y n'était pas du tout une fonction de x, il n'y a donc pas vraiment de "pente" (le "a" dans mon exemple).

— Greg Aponte

lm (y ~ x) correspond au modèle par la méthode des moindres carrés (équivalent à l'estimation de ML lorsque les erreurs sont normales). Il y a une pente.

y = β_{0} + β_{1} x + ε

$y = \beta_{0} + \beta_{1}x + \varepsilon$

— Macro

2

Votre question est posée et répondue (en quelque sorte) à stats.stackexchange.com/questions/13126 et à stats.stackexchange.com/questions/18434 . Cependant, je pense que personne n’a encore fourni une explication simple et claire des relations entre (une) régression de contre , (b) une régression de contre , (c) une analyse de la corrélation de et de , (d) régression des erreurs dans les variables de et , et (e) ajustant une distribution normale bivariée à . Ce serait un bon endroit pour une telle exposition :-).

Y

$Y$

X

$X$

X

$X$

Y

$Y$

X

$X$

Y

$Y$

X

$X$

Y

$Y$

(X, Y)

$(X,Y)$

— whuber

2

Bien sûr, Macro est correct: parce que x et y jouent un rôle équivalent dans la question, la pente la plus extrême est une question de hasard. Cependant, la géométrie suggère (à tort) que lorsque nous inversons x et y dans la régression, nous devrions obtenir le recipocal de la pente d'origine. Cela ne se produit jamais que lorsque x et y sont linéairement dépendants. Cette question peut être interprétée comme demandant pourquoi.

— whuber

23

Soit points de données , dans le plan, traçons une droite . Si nous prédisons comme la valeur de , alors l' erreur est , l' erreur au carré est , et l' erreur quadratique totale . Nous demandons $n$ $(x_i,y_i), i = 1,2,\ldots n$ $y = ax+b$ $ax_i+b$ $\hat{y}_i$ $y_i$ $(y_i-\hat{y}_i) = (y_i-ax_i-b)$ $(y_i-ax_i-b)^2$ $\sum_{i=1}^n (y_i-ax_i-b)^2$

Quel choix de et minimise ? $a$ $b$ $S =\displaystyle\sum_{i=1}^n (y_i-ax_i-b)^2$

Puisque est la distance verticale de partir de la droite, nous demandons la ligne telle que la somme des carrés des distances verticales des points à partir de la droite soit aussi petite que possible. Maintenant, est une fonction quadratique de et et atteint sa valeur minimale lorsque et sont tels que À partir de la deuxième équation, nous obtenons où $(y_i-ax_i-b)$ $(x_i,y_i)$ $S$ $a$ $b$ $a$ $b$

\begin{aligned} \frac{\partial S}{\partial a} & = 2 \sum_{i = 1}^{n} (y_{i} - a x_{i} - b) (- x_{i}) & = 0 \\ \frac{\partial S}{\partial b} & = 2 \sum_{i = 1}^{n} (y_{i} - a x_{i} - b) (- 1) & = 0 \end{aligned}

$\begin{align*} \frac{\partial S}{\partial a} &= 2\sum_{i=1}^n (y_i-ax_i-b)(-x_i) &= 0\\ \frac{\partial S}{\partial b} &= 2\sum_{i=1}^n (y_i-ax_i-b)(-1) &= 0 \end{align*}$

b = \frac{1}{n} \sum_{i = 1}^{n} (y_{i} - a x_{i}) = μ_{y} - a μ_{x}

$b = \frac{1}{n}\sum_{i=1}^n (y_i - ax_i) = \mu_y - a\mu_x$

μ_{y} = \frac{1}{n} \sum_{i = 1}^{n} y_{i}, μ_{x} = \frac{1}{n} \sum_{i = 1}^{n} x_{i}

$\displaystyle \mu_y = \frac{1}{n}\sum_{i=1}^n y_i, ~ \mu_x = \frac{1}{n}\sum_{i=1}^n x_i$ are the arithmetic average values of the

y_{i}

$y_i$ 's and the

x_{i}

$x_i$ 's respectively. Substituting into the first equation, we get

a = \frac{(\frac{1}{n} \sum_{i = 1}^{n} x_{i} y_{i}) - μ_{x} μ_{y}}{(\frac{1}{n} \sum_{i = 1}^{n} x_{i}^{2}) - μ_{x}^{2}} .

$a = \frac{\left(\frac{1}{n}\sum_{i=1}^n x_iy_i\right) -\mu_x\mu_y}{ \left( \frac{1}{n}\sum_{i=1}^n x_i^2\right) -\mu_x^2}.$ Thus, the line that minimizes

S

$S$ can be expressed as

y = a x + b = μ_{y} + (\frac{(\frac{1}{n} \sum_{i = 1}^{n} x_{i} y_{i}) - μ_{x} μ_{y}}{(\frac{1}{n} \sum_{i = 1}^{n} x_{i}^{2}) - μ_{x}^{2}}) (x - μ_{x}),

$y = ax+b = \mu_y + \left(\frac{\left(\frac{1}{n}\sum_{i=1}^n x_iy_i\right) -\mu_x\mu_y}{ \left( \frac{1}{n}\sum_{i=1}^n x_i^2\right) -\mu_x^2}\right) (x - \mu_x),$ and the minimum value of

S

$S$ is

S_{min} = \frac{[(\frac{1}{n} \sum_{i = 1}^{n} y_{i}^{2}) - μ_{y}^{2}] [(\frac{1}{n} \sum_{i = 1}^{n} x_{i}^{2}) - μ_{x}^{2}] - {[(\frac{1}{n} \sum_{i = 1}^{n} x_{i} y_{i}) - μ_{x} μ_{y}]}^{2}}{(\frac{1}{n} \sum_{i = 1}^{n} x_{i}^{2}) - μ_{x}^{2}} .

$S_{\min} = \frac{\left[\left(\frac{1}{n}\sum_{i=1}^n y_i^2\right) -\mu_y^2\right] \left[\left(\frac{1}{n}\sum_{i=1}^n x_i^2\right) -\mu_x^2\right] - \left[\left(\frac{1}{n}\sum_{i=1}^n x_iy_i\right) -\mu_x\mu_y\right]^2}{\left(\frac{1}{n}\sum_{i=1}^n x_i^2\right) -\mu_x^2}.$

If we interchange the roles of $x$ and $y$ , draw a line $x = \hat{a}y + \hat{b}$ , and ask for the values of $\hat{a}$ and $\hat{b}$ that minimize

T = \sum_{i = 1}^{n} (x_{i} - \hat{a} y_{i} - \hat{b})^{2},

$T = \sum_{i=1}^n (x_i - \hat{a}y_i - \hat{b})^2,$ that is, we want the line such that the sum of the squares of the horizontal distances of the points from the line is as small as possible, then we get

x = \hat{a} y + \hat{b} = μ_{x} + (\frac{(\frac{1}{n} \sum_{i = 1}^{n} x_{i} y_{i}) - μ_{x} μ_{y}}{(\frac{1}{n} \sum_{i = 1}^{n} y_{i}^{2}) - μ_{y}^{2}}) (y - μ_{y})

$x = \hat{a}y+\hat{b} = \mu_x + \left(\frac{\left(\frac{1}{n}\sum_{i=1}^n x_iy_i\right) -\mu_x\mu_y}{ \left( \frac{1}{n}\sum_{i=1}^n y_i^2\right) -\mu_y^2}\right) (y - \mu_y)$ and the minimum value of

T

$T$ is

T_{min} = \frac{[(\frac{1}{n} \sum_{i = 1}^{n} y_{i}^{2}) - μ_{y}^{2}] [(\frac{1}{n} \sum_{i = 1}^{n} x_{i}^{2}) - μ_{x}^{2}] - {[(\frac{1}{n} \sum_{i = 1}^{n} x_{i} y_{i}) - μ_{x} μ_{y}]}^{2}}{(\frac{1}{n} \sum_{i = 1}^{n} y_{i}^{2}) - μ_{y}^{2}} .

$T_{\min} = \frac{\left[\left(\frac{1}{n}\sum_{i=1}^n y_i^2\right) -\mu_y^2\right] \left[\left(\frac{1}{n}\sum_{i=1}^n x_i^2\right) -\mu_x^2\right] - \left[\left(\frac{1}{n}\sum_{i=1}^n x_iy_i\right) -\mu_x\mu_y\right]^2}{\left(\frac{1}{n}\sum_{i=1}^n y_i^2\right) -\mu_y^2}.$

Note that both lines pass through the point $(\mu_x,\mu_y)$ but the slopes are

a = \frac{(\frac{1}{n} \sum_{i = 1}^{n} x_{i} y_{i}) - μ_{x} μ_{y}}{(\frac{1}{n} \sum_{i = 1}^{n} x_{i}^{2}) - μ_{x}^{2}}, {\hat{a}}^{- 1} = \frac{(\frac{1}{n} \sum_{i = 1}^{n} y_{i}^{2}) - μ_{y}^{2}}{(\frac{1}{n} \sum_{i = 1}^{n} x_{i} y_{i}) - μ_{x} μ_{y}}

$a = \frac{\left(\frac{1}{n}\sum_{i=1}^n x_iy_i\right) -\mu_x\mu_y}{ \left( \frac{1}{n}\sum_{i=1}^n x_i^2\right) -\mu_x^2},~~ \hat{a}^{-1} = \frac{ \left( \frac{1}{n}\sum_{i=1}^n y_i^2\right) -\mu_y^2}{\left(\frac{1}{n}\sum_{i=1}^n x_iy_i\right) -\mu_x\mu_y}$ are different in general. Indeed, as @whuber points out in a comment, the slopes are the same when all the points

(x_{i}, y_{i})

$(x_i,y_i)$ lie on the same straight line. To see this, note that

{\hat{a}}^{- 1} - a = \frac{S_{min}}{(\frac{1}{n} \sum_{i = 1}^{n} x_{i} y_{i}) - μ_{x} μ_{y}} = 0 \Rightarrow S_{min} = 0 \Rightarrow y_{i} = a x_{i} + b, i = 1, 2, \dots, n .

$\hat{a}^{-1} - a = \frac{S_{\min}}{\left(\frac{1}{n}\sum_{i=1}^n x_iy_i\right) -\mu_x\mu_y} = 0 \Rightarrow S_{\min} = 0 \Rightarrow y_i=ax_i+b, i=1,2,\ldots, n.$

— Dilip Sarwate
source

Thanks! abs(correlation) < 1 accounts for why the slope was systematically steeper in the reversed case.

— Greg Aponte

(+1) but I added an answer with just an illustration of what you just said, as I have a geometric mind :)

— Elvis

Class reply (+1)

— Digio

39

Just to illustrate Dilip’s answer: on the following pictures,

the black dots are data points ;
on the left, the black line is the regression line obtained by y ~ x, which minimize the squares of the length of the red segments;
on the right, the black line is the regression line obtained by x ~ y, which minimize the squares of the length of the red segments.

regression lines

Edit (least rectangles regression)

If there is no natural way to chose a "response" and a "covariate", but rather the two variables are interdependent you may wish to conserve a symmetrical role for $y$ and $x$ ; in this case you can use "least rectangles regression."

write $Y = aX + b + \epsilon$ , as usual;
denote $\hat y_i = a x_i + b$ and $\hat x_i = {1\over a} (y_i - b)$ the estimations of $Y_i$ conditional to $X = x_i$ and of $X_i$ conditional to $Y = y_i$ ;
minimize $\sum_i | x_i - \hat x_i | \cdot | y_i - \hat y_i|$ , which leads to $\hat{y} = s i g n (c o v (x, y)) \frac{{\hat{σ}}_{y}}{{\hat{σ}}_{x}} (x - \bar{x}) + \bar{y} .$ $\hat y = \mathrm{sign}\left(\mathrm{cov}(x,y)\right){\hat\sigma_y \over \hat\sigma_x} (x-\overline x) + \overline y.$

Here is an illustration with the same data points, for each point, a "rectangle" is computed as the product of the length of two red segments, and the sum of rectangles is minimized. I don’t know much about the properties of this regression and I don’t find much with google.

least rectangles

— Elvis
source

14

Some notes: (1) Unless I am mistaken, it seems that the "least rectangles regression" is equivalent to the solution obtained from taking the first principal component on the matrix

X = (y, x)

$\mathbf X = (\mathbf y, \mathbf x)$ after centering and rescaling to have unit variance and then backsubstituting. (cont.)

— cardinal

14

(cont.) (2) Viewed this way, it is easy to see that this "least rectangles regression" is equivalent to a form of orthogonal (or total) least squares and, thus, (3) A special case of Deming regression on the centered, rescaled vectors taking

δ = 1

$\delta = 1$ . Orthogonal least squares can be considered as "least-circles regression".

— cardinal

2

@cardinal Very interesting comments! (+1) I believe major axis (minimizing perpendicular distances between reg. line and all the points, à la PCA) or reduced major axis regression, or type II regression as exemplified in the lmodel2 R package by P Legendre, are also relevant here since those techniques are used when it's hard to tell what role (response or predictor) plays each variable or when we want to account for measurement errors.

— chl

1

@chl: (+1) Yes, I believe you are right and the Wikipedia page on total least squares lists several other names for the same procedure, not all of which I am familiar with. It appears to go back to at least R. Frisch, Statistical confluence analysis by means of complete regression systems, Universitetets Økonomiske Instituut, 1934 where it was called diagonal regression.

— cardinal

3

@cardinal I should have been more careful when reading the Wikipedia entry... For future reference, here is a picture taken from Biostatistical Design and Analysis Using R, by M. Logan (Wiley, 2010; Fig. 8.4, p. 174), which summarizes the different approaches, much like Elvis's nice illustrations.

— chl

13

Just a brief note on why you see the slope smaller for one regression. Both slopes depend on three numbers: standard deviations of $x$ and $y$ ( $s_{x}$ and $s_{y}$ ), and correlation between $x$ and $y$ ( $r$ ). The regression with $y$ as response has slope $r\frac{s_{y}}{s_{x}}$ and the regression with $x$ as response has slope $r\frac{s_{x}}{s_{y}}$ , hence the ratio of the first slope to the reciprocal of the second is equal to $r^2\leq 1$ .

So the greater the proportion of variance explained, the closer the slopes obtained from each case. Note that the proportion of variance explained is symmetric and equal to the squared correlation in simple linear regression.

— probabilityislogic
source

1

A simple way to look at this is to note that, if for the true model $y=\alpha+\beta x+\epsilon$ , you run two regressions:

$y=a_{y\sim x}+b_{y\sim x} x$
$x=a_{x\sim y}+b_{x\sim y} y$

Then we have, using $b_{y\sim x}=\frac{cov(x,y)}{var(x)}=\frac{cov(x,y)}{var(y)}\frac{var(y)}{var(x)}$ :

b_{y \sim x} = b_{x \sim y} \frac{v a r (y)}{v a r (x)}

$b_{y\sim x}=b_{x\sim y}\frac{var(y)}{var(x)}$

So whether you get a steeper slope or not just depends on the ratio $\frac{var(y)}{var(x)}$ . This ratio is equal to, based on the assumed true model:

\frac{v a r (y)}{v a r (x)} = \frac{β^{2} v a r (x) + v a r (ϵ)}{v a r (x)}

$\frac{var(y)}{var(x)}=\frac{\beta^2 var(x) + var(\epsilon)}{var(x)}$

Link with other answers

You can connect this result with the answers from others, who said that when $R^2=1$ , it should be the reciprocal. Indeed, $R^2=1\Rightarrow var(\epsilon) = 0$ , and also, $b_{y\sim x}=\beta$ (no estimation error), Hence:

R^{2} = 1 \Rightarrow b_{y \sim x} = b_{x \sim y} \frac{β^{2} v a r (x) + 0}{v a r (x)} = b_{x \sim y} β^{2}

$R^2=1\Rightarrow b_{y\sim x}=b_{x\sim y}\frac{\beta^2 var(x) + 0}{var(x)}=b_{x\sim y}\beta^2$

So $b_{x\sim y}=1/\beta$

— Matifou
source

0

It becomes interesting when there is also noise on your inputs (which we could argue is always the case, no command or observation is ever perfect).

I have built some simulations to observe the phenomenon, based on a simple linear relationship $x = y$ , with Gaussian noise on both x and y. I generated the observations as follows (python code):

x = np.linspace(0, 1, n)
y = x

x_o = x + np.random.normal(0, 0.2, n)
y_o = y + np.random.normal(0, 0.2, n)

See the different results (odr here is orthogonal distance regression, i.e. the same as least rectangles regression):

All the code is in there:

https://gist.github.com/jclevesque/5273ad9077d9ea93994f6d96c20b0ddd

— levesque
source

0

Regression line is not (always) the same as true relationship

You may have some 'true' causal relationship like

y = a + b x + ϵ

$y = a + bx + \epsilon$

but fitted regression lines y ~ x or x ~ y do not mean the same as that causal relationship (even when in practice the expression for one of the regression line may coincide with the expression for the causal 'true' relationship)

More precise relationship between slopes

For two switched simple linear regressions:

Y = a_{1} + b_{1} X X = a_{2} + b_{2} Y

$Y = a_1 + b_1 X\\X = a_2 + b_2 Y$

you can relate the slopes as following:

b_{1} = ρ^{2} \frac{1}{b_{2}} \leq \frac{1}{b_{2}}

$b_1 = \rho^2 \frac{1}{b_2} \leq \frac{1}{b_2}$

So the slopes are not each other inverse.

Intuition

The reason is that

Regression lines and correlations do not necessarily correspond one-to-one to a causal relationship.
Regression lines relate more directly to a conditional probability or best prediction.

You can imagine that the conditional probability relates to the strength of the relationship. Regression lines reflect this and the slopes of the lines may be both shallow when the strength of the relationship is small or both steep when the strength of the relationship is strong. The slopes are not simply each others inverse.

Example

If two variables $X$ and $Y$ relate to each other by some (causal) linear relationship

Y = a little bit of X + a lot of error

$Y = \text{a little bit of $X + $ a lot of error}$ Then you can imagine that it would not be good to entirely reverse that relationship in case you wish to express

X

$X$ based on a given value of

Y

$Y$ .

Instead of

X = a lot of Y + a little of error

$X = \text{a lot of $Y + $ a little of error}$

it would be better to also use

X = a little bit of Y + a lot of error

$X = \text{a little bit of $Y + $ a lot of error}$

See the following example distributions with their respective regression lines. The distributions are multivariate normal with $\Sigma_{11} \Sigma_{22}=1$ and $\Sigma_{12} = \Sigma_{21} = \rho$

The conditional expected values (what you would get in a linear regression) are

\begin{matrix} E (Y | X) & = & ρ X \\ E (X | Y) & = & ρ Y \end{matrix}

$\begin{array}{} E(Y|X) &=& \rho X \\ E(X|Y) &=& \rho Y \end{array}$

and in this case with $X,Y$ a multivariate normal distribution, then the marginal distributions are

\begin{matrix} Y & \sim & N (ρ X, 1 - ρ^{2}) \\ X & \sim & N (ρ Y, 1 - ρ^{2}) \end{matrix}

$\begin{array}{} Y & \sim & N(\rho X,1-\rho^2) \\ X & \sim & N(\rho Y,1-\rho^2) \end{array}$

So you can see the variable Y as being a part $\rho X$ and a part noise with variance $1-\rho^2$ . The same is true the other way around.

The larger the correlation coefficient $\rho$ , the closer the two lines will be. But the lower the correlation, the less strong the relationship, the less steep the lines will be (this is true for both lines Y ~ X and X ~ Y)

— Sextus Empiricus
source

0

The short answer

The goal of a simple linear regression is to come up with the best predictions of the y variable, given values of the x variable. This is a different goal than trying to come up with the best prediction of the x variable, given values of the y variable.

Simple linear regression of y ~ x gives you the 'best' possible model for predicting y given x. Hence, if you fit a model for x ~ y and algebraically inverted it, that model could at its very best do only as well as the model for y ~ x. But inverting a model fit for x ~ y will usually do worse at predicting y given x, compared to the 'optimal' y ~ x model, because the "inverted x ~ y model" was created to fulfill a different objective.

Illustration

Imagine you have the following dataset:

When you run an OLS regression of y ~ x, you come up with the following model

y = 0.167 + 1.5*x

This optimizes predictions of y by making the following predictions, which have associated errors:

The OLS regression's predictions are optimal in the sense that the sum of the values in the rightmost column (i.e. the sum of squares) is as small as can be.

When you run an OLS regression of x ~ y, you come up with a different model:

x = -0.07 + 0.64*y

This optimizes predictions of x by making the following predictions, with associated errors.

Again, this is optimal in the sense that the sum of the values of the rightmost column are as small as possible (equal to 0.071).

Now, imagine you tried to just invert the first model, y = 0.167 + 1.5*x, using algebra, giving you the model x = -0.11 + 0.67*x.

This would give you the following predictions and associated errors:

The sum of the values in the rightmost column is 0.074, which is larger than the corresponding sum from the model you get from regressing x on y, i.e. the x ~ y model. In other words, the "inverted y ~ x model" is doing a worse job at predicting x than the OLS model of x ~ y.

— bschneidr
source