I'm not aware of a universal method to generate correlated random variables with any given marginal distributions. So, I'll propose an ad hoc method to generate pairs of uniformly distributed random variables with a given (Pearson) correlation.
Without loss of generality, I assume that the desired marginal distribution is standard uniform (i.e., the support is [0,1]).
The proposed approach relies on the following:
a) For standard uniform random variables U1 and U2 with respective distribution functions F1 and F2, we have Fi(Ui)=Ui, for i=1,2.
Thus, by definition Spearman's rho is
ρS(U1,U2)=corr(F1(U1),F2(U2))=corr(U1,U2).
So, Spearman's rho and Pearson's correlation coefficient are equal (sample versions might however differ).
b) If X1,X2 are random variables with continuous margins and Gaussian copula with (Pearson) correlation coefficient ρ, then Spearman's rho is
ρS(X1,X2)=6πarcsin(ρ2).
This makes it easy to generate random variables that have a desired value of Spearman's rho.
The approach is to generate data from the Gaussian copula with an appropriate correlation coefficient ρ such that the Spearman's rho corresponds to the desired correlation for the uniform random variables.
Simulation algorithm
Let r denote the desired level of correlation, and n the number of pairs to be generated.
The algorithm is:
- Compute ρ=2sin(rπ/6).
- Generate a pair of random variables from the Gaussian copula (e.g., with this approach)
- Repeat step 2 n times.
Example
The following code is an example of implementation of this algorithm using R with a target correlation r=0.6 and n=500 pairs.
## Initialization and parameters
set.seed(123)
r <- 0.6 # Target (Spearman) correlation
n <- 500 # Number of samples
## Functions
gen.gauss.cop <- function(r, n){
rho <- 2 * sin(r * pi/6) # Pearson correlation
P <- toeplitz(c(1, rho)) # Correlation matrix
d <- nrow(P) # Dimension
## Generate sample
U <- pnorm(matrix(rnorm(n*d), ncol = d) %*% chol(P))
return(U)
}
## Data generation and visualization
U <- gen.gauss.cop(r = r, n = n)
pairs(U, diag.panel = function(x){
h <- hist(x, plot = FALSE)
rect(head(h$breaks, -1), 0, tail(h$breaks, -1), h$counts/max(h$counts))})
In the figure below, the diagonal plots show histograms of variables U1 and U2, and off-diagonal plots show scatter plots of U1 and U2.
By constuction, the random variables have uniform margins and a correlation coefficient (close to) r. But due to the effect of sampling, the correlation coefficient of the simulated data is not exactly equal to r.
cor(U)[1, 2]
# [1] 0.5337697
Note that the gen.gauss.cop
function should work with more than two variables simply by specifying a larger correlation matrix.
Simulation study
The following simulation study repeated for target correlation r=−0.5,0.1,0.6 suggests that the distribution of the correlation coefficient converges to the desired correlation as the sample size n increases.
## Simulation
set.seed(921)
r <- 0.6 # Target correlation
n <- c(10, 50, 100, 500, 1000, 5000); names(n) <- n # Number of samples
S <- 1000 # Number of simulations
res <- sapply(n,
function(n, r, S){
replicate(S, cor(gen.gauss.cop(r, n))[1, 2])
},
r = r, S = S)
boxplot(res, xlab = "Sample size", ylab = "Correlation")
abline(h = r, col = "red")