Comment fonctionne la formule de génération de variables aléatoires corrélées?

19

Si nous avons 2 variables aléatoires normales non corrélées $X_1, X_2$ nous pouvons créer 2 variables aléatoires corrélées avec la formule

$Y=\rho X_1+ \sqrt{1-\rho^2} X_2$

puis aura une corrélation avec . $Y$ $\rho$ $X_1$

Quelqu'un peut-il expliquer d'où vient cette formule?

correlation normal-distribution covariance

— Lanza
source

1

Une discussion approfondie de ce problème et des problèmes connexes apparaît dans ma réponse sur stats.stackexchange.com/a/71303 . Entre autres choses, il est clair que (1) l'hypothèse de normalité n'est pas pertinente et (2) vous devez faire des hypothèses supplémentaires: les variances de

et

doivent être égales pour que la corrélation de

avec

soit

.

X_{1}

$X_1$

X_{2}

$X_2$

Y

$Y$

X_{1}

$X_1$

ρ

$\rho$

— whuber

Lien très intéressant. Je ne suis pas sûr de comprendre ce que vous entendez par normalité non pertinente. Si

ou

n'est pas normal, et il devient plus difficile de contrôler la densité de

via l'algorithme de Kaiser-Dickman. C'est la raison pour laquelle des algorithmes spécialisés génèrent des données corrélées non normales (par exemple, Headrick, 2002; Ruscio et Kaczetow, 2008; Vale et Maurelli, 1983) Par exemple, imaginez que votre objectif est de générer

~ normal,

~ uniforme , avec

= 0,5. L'utilisation de

~ uniforme donne un

qui n'est pas uniforme (

finit par être une combinaison linéaire d'une normale et uniforme).

X_{1}

$X_1$

X_{2}

$X_2$

Y

$Y$

X

$X$

Y

$Y$

ρ

$\rho$

X_{2}

$X_2$

Y

$Y$

Y

$Y$

— Anthony

@Anthony La question ne concerne que la corrélation , qui est purement fonction des premier et deuxième moments. La réponse ne dépend d'aucune autre propriété des distributions. Ce dont vous discutez est un tout autre sujet.

— whuber

17

Supposons que vous vouliez trouver une combinaison linéaire de et telle que $X_1$ $X_2$

corr (α X_{1} + β X_{2}, X_{1}) = ρ

$\text{corr}(\alpha X_1 + \beta X_2, X_1) = \rho$

Notez que si vous multipliez à la fois et par la même constante (non nulle), la corrélation ne changera pas. Ainsi, nous allons ajouter une condition pour conserver la variance: $\alpha$ $\beta$ $\text{var}(\alpha X_1 + \beta X_2) = \text{var}(X_1)$

Cela équivaut à

ρ = \frac{cov (α X_{1} + β X_{2}, X_{1})}{\sqrt{var (α X_{1} + β X_{2}) var (X_{1})}} = \frac{α \overset{= var (X_{1})}{\overset{⏞}{cov (X_{1}, X_{1})}} + \overset{= 0}{\overset{⏞}{β cov (X_{2}, X_{1})}}}{\sqrt{var (α X_{1} + β X_{2}) var (X_{1})}} = α \sqrt{\frac{var (X_{1})}{α^{2} var (X_{1}) + β^{2} var (X_{2})}}

$\rho = \frac{\text{cov}(\alpha X_1 + \beta X_2, X_1)}{\sqrt{\text{var}(\alpha X_1 + \beta X_2) \text{var}(X_1)}} = \frac{\alpha \overbrace{\text{cov}(X_1, X_1)}^{=\text{var}(X_1)} + \overbrace{\beta \text{cov}(X_2, X_1)}^{=0}}{\sqrt{\text{var}(\alpha X_1 + \beta X_2) \text{var}(X_1)}} = \alpha \sqrt{\frac{\text{var}(X_1)}{\alpha^2 \text{var}(X_1) + \beta^2 \text{var}(X_2)}}$

Assuming both random variables have the same variance (this is a crucial assumption!) ( $\text{var}(X_1) = \text{var}(X_2)$ ), we get

ρ \sqrt{α^{2} + β^{2}} = α

$\rho \sqrt{\alpha^2 + \beta^2} = \alpha$

There are many solutions to this equation, so it's time to recall variance-preserving condition:

var (X_{1}) = var (α X_{1} + β X_{2}) = α^{2} var (X_{1}) + β^{2} var (X_{2}) \Rightarrow α^{2} + β^{2} = 1

$\text{var}(X_1) = \text{var}(\alpha X_1 + \beta X_2) = \alpha^2 \text{var}(X_1) + \beta^2 \text{var}(X_2) \Rightarrow \alpha^2 + \beta^2 = 1$

And this leads us to

α = ρ β = \pm \sqrt{1 - ρ^{2}}

$\alpha = \rho \\ \beta = \pm \sqrt{1-\rho^2}$

UPD. Regarding the second question: yes, this is known as whitening.

— Artem Sobolev
source

9

The equation is a simplified bivariate form of Cholesky decomposition. This simplified equation is sometimes called the Kaiser-Dickman algorithm (Kaiser & Dickman, 1962).

Note that $X_1$ and $X_2$ must have the same variance for this algorithm to work properly. Also, the algorithm is typically used with normal variables. If $X_1$ or $X_2$ are not normal, $Y$ might not have the same distributional form as $X_2$ .

References:

Kaiser, H. F., & Dickman, K. (1962). Sample and population score matrices and sample correlation matrices from an arbitrary population correlation matrix. Psychometrika, 27(2), 179-182.

— Anthony
source

2

I suppose you don't need standardized normal variables, just having the same variance should be enough.

— Artem Sobolev

2

No, the distribution of

Y

$Y$ is not a mixture distribution as you claim.

— Dilip Sarwate

Point taken, @Dilip Sarwate. If either

X_{1}

$X_1$ or

X_{2}

$X_2$ is nonnormal, then

Y

$Y$ becomes a linear combination of two variables that might not result in the desired distribution. This is the reason for specialized algorithms (instead of Kaiser-Dickman) for generated non-normal correlated data.

— Anthony

3

Correlation coefficient is the $\cos$ between two series if they are treated as vectors (with $n^{th}$ data point being $n^{th}$ dimension of a vector). The above formula simply creates a decomposition of a vector into its $\cos\theta$ , $sin\theta$ components (with respect to $X_1,X_2$ ).
if $\rho = cos \theta$ , then $\sqrt{1-{\rho}^2}=\pm sin \theta$ .

Because if $X_1, X_2$ are uncorrelated, the angle between them is a right angle (ie, they can be considered as orthogonal, albeit non-normalized, basis vectors ).

— Dmitry Rubanovich
source

2

Welcome to our site! I believe your post will get more attention if you mark up the mathematical expressions using

T E X

$\TeX$ : enclose them between dollar signs. There's help available when you're editing.

— whuber