This article is part of the Stata for Students series. If you are new to Stata we strongly recommend reading all the articles in the Stata Basics section.
Correlations are a measure of how strongly related two quantitative variables are. It can only perfectly measure linear relationships, but a linear relationship will serve as a first approximation to many other kinds of relationships. You can calculate correlations for categorical variables and the results you get will sometimes point you in the right direction, but there are better ways to describe relationships involving categorical variables.
Correlation coefficients range from -1 to 1. A positive correlation coefficient means the two variables tend to move together: an observation which has a high value for one variable is likely to have a high variable for the other, and vice versa. The larger the coefficient the stronger the relationship. A negative correlation coefficient means they tend to move in opposite directions: observations with a high value for one variable are likely to have a low value for the other. Variables which are independent will have a correlation of zero, but variables which are related but not in a linear way can also have a correlation of zero.
Setting Up
If you plan to carry out the examples in this article, make sure you've downloaded the GSS sample to your U:\SFS folder as described in Managing Stata Files. Then create a do file called cor.do in that folder that loads the GSS sample as described in Doing Your Work Using Do Files. If you plan on applying what you learn directly to your homework, create a similar do file but have it load the data set used for your assignment.
Calculating Correlations
The correlate command, often abbreviated cor, calculates correlations. List the variables you want correlations for after the command.
cor sei10 educ height weight
This gives you the correlations between the respondent's socioeconomic status, years of education, height, and weight. They are given in the form of a matrix, but only half of the matrix is shown because it is symmetric:
(obs=114) | sei10 educ height weight -------------+------------------------------------ sei10 | 1.0000 educ | 0.6205 1.0000 height | 0.2466 0.1868 1.0000 weight | 0.1048 -0.0224 0.5282 1.0000
This shows that the correlation between socioeconomic status and education is .6205, which is fairly high. The correlation between socioeconomic status and height, .2466, is weaker, but it's interesting that its positive at all. Keep in mind that correlation does not imply causation. We cannot tell from these results whether high socioeconomic status causes people to grow taller or being tall causes people to have higher socioeconomic status (both can be true, and there's evidence for both theories), or if something else causes people to both grow taller and have higher socioeconomic status.
The correlation between weight and education is essentially zero, but the negative number indicates that people with higher levels of education are likely to have lower levels of weight. It's just a very small effect. On the other hand, given that education and height are positively correlated and height and weight are strongly positively correlated, this raises the possibility that education and weight might have a stronger negative relationship if we could control for height. Multivariate regression allows us to explore that possibility.
Calculating Covariances
If you want covariances instead, add the cov option:
cor sei10 educ height weight, cov
(obs=114) | sei10 educ height weight -------------+------------------------------------ sei10 | 510.103 educ | 43.4237 9.59983 height | 22.7511 2.36376 16.6884 weight | 99.2858 -2.91236 90.4648 1757.94
Covariances are not bound to fall in the range of -1 to 1, and depend on both how much the variables vary together and how much they vary overall. But the interpretations of positive and negative numbers are similar. The diagonal of the matrix gives you the variance of each variable, or its standard deviation squared.
Complete Do File
capture log close
log using cor.log, replace
clear all
set more off
use gss_sample
cor sei10 educ height weight
cor sei10 educ height weight, cov
log close
Last Revised: 11/17/2016