Correlation analysis and linear regression
Correlation Analysis
Today I learnt correlation analysis. The simplest thing is a table of two variables, income and meat consumption, say. We want to find out how one variable (independent variable, income) correlates with another (dependent variable, meat). First, we want a way to explore and visualise the data to see if they are correlated. A scatter plot is one way to visualise such data.
Annual income, $,000 | Annual meat consumption, kg |
---|---|
20 | 8 |
30 | 20 |
100 | 50 |
200 | 75 |
300 | 78 |
500 | 79 |
The strength of the correlation can be tested using the cor
and cor.test
functions in R. The 0.84 returned by cor
means 84% of the observed meat consumption can be explained by income. The remaining 16% is not covered by our model. To say a correlation is significant, a value of 0.7 and above must be obtained.
In addition to the correlation strength, we must also check the correlation p-value. The null hypothesis is that there is no relationship between income and meat consumption. A p-value < 0.05 means we reject the null hypothesis. Our p-value of 0.04 does just that. In other words, we reject the null hypothesis and accept that there is a correlation between rising income and rising meat consumption. A possible outcome when doing cor.test
is we have strong confidence (low p-value) that there is no correlation (low cor value).
Linear regression
Next, I learnt linear regression. The simplest kind is simply called simple linear regression. It is used to predict how a variable changes (the dependent variable on the y-axis) when another variable changes (the independent variable on the x-axis). An example we used in class is how motor insurance premium (dependent variable) changes with years of driving experience (independent variable) using the following data set.
Years of driving experience | Motor insurance premium |
---|---|
5 | 64 |
2 | 87 |
12 | 56 |
9 | 71 |
15 | 44 |
6 | 56 |
25 | 42 |
16 | 60 |
The following R script produces a scatter plot of the data, fits a regression line and computes the strength of the line.
On line 4, reg <- lm(prem~exp)
fits a linear model, called reg, to our data and abline
simply draws the regression line over a standard scatter plot. A computer program such as R will always be able to fit a line to any model, but whether that model is a good one is a different question. It’s up to us human modellers to answer that question. To do that, we check the adjusted R-squared and p-value. An accepted cutoff for adjusted R-squared is 0.7 but here we only have 0.55. In other words, only 55% of the insurance premium variability can be accounted for by the data. The p-value, though, is low at 0.02 so we reject the null hypothesis.
It is worth asking what the null hypothesis is here. For simple linear regression, H0 is m = 0
. The regression line has a general form of y = mx + c
. m = 0
implies y = c
. In other words, whatever value x (eg, driving experience) takes has no influence on y (ie, motor insurance premium) whatsoever. The regression line becomes a flat line. Rejecting H0 means x influences y.
The summary data of the regression model tells us also how to construct the regression line. From the summary data, we know c = 77.28
and m = -1.54
. Therefore, y = 77.28 - 1.54x
. So, if you just got your driving licence, expect to pay $77.28, on average. For every year of experience you’ve gained, you pay $1.54 less, again on average. Remember, a regression model is a prediction model and ours only covers 55% of the variation so we can’t be too sure. We can say, though, if you sample a large number of drivers with 10 years driving experience, then their average motor insurance premium is $77.28 - $1.54 * 10 = $61.88
.
Life is often more complicated than one variable correlating with another. We often find multiple factors correlating with an observation to varying degrees of strength. Smoking causes cancer, so does alcohol. And if you both drink and smoke which one will do you in? That’s what multiple linear regression tries to answer. Instead of a simple y = mx + c
, multiple linear regression has the general form of y = β0x0 + β1x1 + … + βnxn + c + ϵ, where ϵ is the residual error that can’t be explained by the model, c is the y-intercept, and each βi is the coefficient for its corresponding term, xi. The H0 is β0 = β1 = … = βn = 0. The multiple linear regression is best illustrated with an example.
A supermarket chain wants to launch a new energy bar. It sells the energy bar in 34 branches at different prices and varying amount of advertising. We want to find whether it’s low price that increases sales, advertising, both or none.
Store | Sales | Price | Advertising |
---|---|---|---|
1 | 4141 | 59 | 200 |
2 | 3842 | 59 | 200 |
3 | 3056 | 59 | 200 |
4 | 3519 | 59 | 200 |
5 | 4226 | 59 | 400 |
6 | 4630 | 59 | 400 |
7 | 3507 | 59 | 400 |
8 | 3754 | 59 | 400 |
9 | 5000 | 59 | 600 |
10 | 5120 | 59 | 600 |
11 | 4011 | 59 | 600 |
12 | 5015 | 59 | 600 |
13 | 1916 | 79 | 200 |
14 | 675 | 79 | 200 |
15 | 3636 | 79 | 200 |
16 | 3224 | 79 | 200 |
17 | 2295 | 79 | 400 |
18 | 2730 | 79 | 400 |
19 | 2618 | 79 | 400 |
20 | 4421 | 79 | 400 |
21 | 4113 | 79 | 600 |
22 | 3746 | 79 | 600 |
23 | 3532 | 79 | 600 |
24 | 3825 | 79 | 600 |
25 | 1096 | 99 | 200 |
26 | 761 | 99 | 200 |
27 | 2088 | 99 | 200 |
28 | 820 | 99 | 200 |
29 | 2114 | 99 | 400 |
30 | 1882 | 99 | 400 |
31 | 2159 | 99 | 400 |
32 | 1602 | 99 | 400 |
33 | 3354 | 99 | 600 |
34 | 2927 | 99 | 600 |
There are a few things to notice here. First, overall p-value is 2.863e-10 and below 0.05 so we reject H0. In other words, price and/or advertising correlate(s) with sales. But which one? We look at the triple stars on the right after sales$Price and sales$Advertising. Reading under the Pr(>|t|) column, the p-value of H0 for βprice=0 is 9.20e-09, way below 0.05. We reject the H0. That means sales$Price has a significant correlation with sales. The same goes for sales$Advertising. Finally, check adjusted R-squared is above 0.7 to ensure our model covers a large portion of sales variation. The significance of the terms can also be obtained using anova:
Another way to see which independent variable has a correlation is to run step-wise multiple regression. A forward selection adds each independent variable iteratively to see if it’s significant. A significant independent variable is kept. This process goes on under all the independent variables are tested. A backward elimination subtracts terms from the full model. A bidirectional elimination combines both approaches. This process gets tedious very quickly. Luckily, R does it for us in a few lines of code.
Point to note here is we start with an initial model of sales$Sales ~ sales$Price + sales$Advertising and end up with the same final model. Therefore, no term was added or subtracted and both are significantly correlated to sales.
One last interesting thing about our energy bar example is the regression model itself. From the R output, we can construct sales = 3.61 * advertising - 53.22 * price + 5837.52
.[1] Take shop 1 for example, we see that the output value of $3,419.54 is pretty close to the real sales value of $4,141. Suppose the company now wants to eliminate advertising because it’s costly. At what price should we sell the energy bar for us to generate the same sales amount? We can just drop the advertising term in our model to get 4141 = 5837.52 - 53.22 * price
. It’s trivial to solve the equation to find the new price 31.88.
The following R code shows a different data set following the same multiple linear regression workflow. This time, we want to find which factor correlates to a cigarette’s nicotine content. At the end, we’ll see that a cigarette’s weight and its CO content don’t correlate to its nicotine content. It is the tar that matters.
Finally, remember: Correlation is not causation
[1] The equation can be obtained by the following:
Revisions
7 May 2018: Added code for generating linear regression model as formula.
14 March 2017: Original version published.