One of the easiest tasks in R is to get correlations between each pair of variables in a dataset. As an example, let’s take the first four columns in the ‘mtcars’ dataset, that is available within R. Getting the variances-covariances and the correlations is straightforward.
data(mtcars)
matr <- mtcars[,1:4]
#Covariances
cov(matr)
## mpg cyl disp hp
## mpg 36.324103 -9.172379 -633.0972 -320.7321
## cyl -9.172379 3.189516 199.6603 101.9315
## disp -633.097208 199.660282 15360.7998 6721.1587
## hp -320.732056 101.931452 6721.1587 4700.8669
#Correlations
cor(matr)
## mpg cyl disp hp
## mpg 1.0000000 -0.8521620 -0.8475514 -0.7761684
## cyl -0.8521620 1.0000000 0.9020329 0.8324475
## disp -0.8475514 0.9020329 1.0000000 0.7909486
## hp -0.7761684 0.8324475 0.7909486 1.0000000
It’s really a piece of cake! Unfortunately, a few days ago I had a covariance matrix without the original dataset and I wanted the corresponding correlation matrix. Although this is an easy task as well, at first I was stuck, because I could not find an immediate solution… So I started wondering how I could make it.
Indeed, having the two variables X and Y, their covariance is:
\[cov(X, Y) = \sum\limits_{i=1}^{n} {(X_i - \hat{X})(Y_i - \hat{Y})}\]
where \(\hat{Y}\) and \(\hat{X}\) are the means for each variable. The correlation is:
\[cor(X, Y) = \frac{cov(X, Y)}{\sigma_x \sigma_y} \]
where \(\sigma_x\) and \(\sigma_y\) are the standard deviations for X and Y.
The opposite relationship is clear:
\[ cov(X, Y) = cor(X, Y) \sigma_x \sigma_y\]
Therefore, converting from covariance to correlation is pretty easy. For example, take the covariance between ‘cyl’ and ‘mpg’ above (–9.172379), the correlation is:
-633.097208 / (sqrt(36.324103) * sqrt(15360.7998))
## [1] -0.8475514
On the reverse, if we have the correlation (–0.8521620), the covariance is
-0.8475514 * sqrt(36.324103) * sqrt(15360.7998)
## [1] -633.0972
My covariance matrix was pretty large, so I started wondering how I could perform this task altogether. What I had to do was to take each element in the covariance matrix and divide it by the square root of the diagonal elements in the same column and in the same row (see below).
This is easily done by matrix multiplication. I need a square matrix where the standard deviations for each variable are repeated along the rows:
V <- cov(matr)
SM1 <- matrix(rep(sqrt(diag(V)), 4), 4, 4)
SM1
## [,1] [,2] [,3] [,4]
## [1,] 6.026948 6.026948 6.026948 6.026948
## [2,] 1.785922 1.785922 1.785922 1.785922
## [3,] 123.938694 123.938694 123.938694 123.938694
## [4,] 68.562868 68.562868 68.562868 68.562868
and another one where they are repeated along the columns
SM2 <- matrix(rep(sqrt(diag(V)), each = 4), 4, 4)
Now I can take my covariance matrix (V) and simply multiply these three matrices as follows:
V * 1/SM1 * 1/SM2
## mpg cyl disp hp
## mpg 1.0000000 -0.8521620 -0.8475514 -0.7761684
## cyl -0.8521620 1.0000000 0.9020329 0.8324475
## disp -0.8475514 0.9020329 1.0000000 0.7909486
## hp -0.7761684 0.8324475 0.7909486 1.0000000
Indeed, there is not even the need to use ‘rep’ when we create SM1, as R will recycle the elements as needed.
Going from correlation to covariance can be done similarly:
R <- cor(matr)
R / (1/SM1 * 1/SM2)
## mpg cyl disp hp
## mpg 36.324103 -9.172379 -633.0972 -320.7321
## cyl -9.172379 3.189516 199.6603 101.9315
## disp -633.097208 199.660282 15360.7998 6721.1587
## hp -320.732056 101.931452 6721.1587 4700.8669
This is an easy task, but it got me stuck for a few minutes…
Lately, I finally discovered that there is (at least) one function in R taking care of the above task; it is the ‘cov2cor()’ function in the ‘nlme’ package.
library(nlme)
cov2cor(V)
## mpg cyl disp hp
## mpg 1.0000000 -0.8521620 -0.8475514 -0.7761684
## cyl -0.8521620 1.0000000 0.9020329 0.8324475
## disp -0.8475514 0.9020329 1.0000000 0.7909486
## hp -0.7761684 0.8324475 0.7909486 1.0000000
It is really easy to get drown in a glass of water!