Correlation Analysis

Today I learnt correlation analysis. The simplest thing is a table of two variables, income and meat consumption, say. We want to find out how one variable (independent variable, income) correlates with another (dependent variable, meat). First, we want a way to explore and visualise the data to see if they are correlated. A scatter plot is one way to visualise such data.

Annual income, $,000	Annual meat consumption, kg
20	8
30	20
100	50
200	75
300	78
500	79

> income <- c(20, 30, 100, 200, 300, 500)
> meat <- c(8, 20, 50, 75, 78, 79)
> plot(income, meat, xlim = c(0, 600), ylim = c(0, 100))
> cor(income, meat)
[1] 0.8380359
> cor.test(income, meat)

	Pearson's product-moment correlation

data:  income and meat
t = 3.0719, df = 4, p-value = 0.03722
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.08276373 0.98183443
sample estimates:
      cor 
0.8380359 

>

income_meat

The strength of the correlation can be tested using the cor and cor.test functions in R. The 0.84 returned by cor means 84% of the observed meat consumption can be explained by income. The remaining 16% is not covered by our model. To say a correlation is significant, a value of 0.7 and above must be obtained.

In addition to the correlation strength, we must also check the correlation p-value. The null hypothesis is that there is no relationship between income and meat consumption. A p-value < 0.05 means we reject the null hypothesis. Our p-value of 0.04 does just that. In other words, we reject the null hypothesis and accept that there is a correlation between rising income and rising meat consumption. A possible outcome when doing cor.test is we have strong confidence (low p-value) that there is no correlation (low cor value).

Linear regression

Next, I learnt linear regression. The simplest kind is simply called simple linear regression. It is used to predict how a variable changes (the dependent variable on the y-axis) when another variable changes (the independent variable on the x-axis). An example we used in class is how motor insurance premium (dependent variable) changes with years of driving experience (independent variable) using the following data set.

Years of driving experience	Motor insurance premium
5	64
2	87
12	56
9	71
15	44
6	56
25	42
16	60

The following R script produces a scatter plot of the data, fits a regression line and computes the strength of the line.

> exp <- c(5,2,12,9,15,6,25,16)
> prem <- c(64,87,56,71,44,56,42,60)
> plot(exp, prem, xlim=c(0,30), ylim=c(0,100))
> reg <- lm(prem~exp)
> abline(reg, col="red")
> summary(reg)

Call:
lm(formula = prem ~ exp)

Residuals:
     Min       1Q   Median       3Q      Max 
-12.0632  -6.7595   0.1349   7.3576  12.7934 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  77.2784     6.5942  11.719 2.33e-05 ***
exp          -1.5359     0.4992  -3.077   0.0218 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.776 on 6 degrees of freedom
Multiple R-squared:  0.6121,	Adjusted R-squared:  0.5474 
F-statistic: 9.466 on 1 and 6 DF,  p-value: 0.02175

>

linear_regression

On line 4, reg <- lm(prem~exp) fits a linear model, called reg, to our data and abline simply draws the regression line over a standard scatter plot. A computer program such as R will always be able to fit a line to any model, but whether that model is a good one is a different question. It’s up to us human modellers to answer that question. To do that, we check the adjusted R-squared and p-value. An accepted cutoff for adjusted R-squared is 0.7 but here we only have 0.55. In other words, only 55% of the insurance premium variability can be accounted for by the data. The p-value, though, is low at 0.02 so we reject the null hypothesis.

It is worth asking what the null hypothesis is here. For simple linear regression, H₀ is m = 0. The regression line has a general form of y = mx + c. m = 0 implies y = c. In other words, whatever value x (eg, driving experience) takes has no influence on y (ie, motor insurance premium) whatsoever. The regression line becomes a flat line. Rejecting H₀ means x influences y.

The summary data of the regression model tells us also how to construct the regression line. From the summary data, we know c = 77.28 and m = -1.54. Therefore, y = 77.28 - 1.54x. So, if you just got your driving licence, expect to pay $77.28, on average. For every year of experience you’ve gained, you pay $1.54 less, again on average. Remember, a regression model is a prediction model and ours only covers 55% of the variation so we can’t be too sure. We can say, though, if you sample a large number of drivers with 10 years driving experience, then their average motor insurance premium is $77.28 - $1.54 * 10 = $61.88.

Life is often more complicated than one variable correlating with another. We often find multiple factors correlating with an observation to varying degrees of strength. Smoking causes cancer, so does alcohol. And if you both drink and smoke which one will do you in? That’s what multiple linear regression tries to answer. Instead of a simple y = mx + c, multiple linear regression has the general form of y = β₀x₀ + β₁x₁ + … + β_nx_n + c + ϵ, where ϵ is the residual error that can’t be explained by the model, c is the y-intercept, and each β_i is the coefficient for its corresponding term, x_i. The H₀ is β₀ = β₁ = … = β_n = 0. The multiple linear regression is best illustrated with an example.

A supermarket chain wants to launch a new energy bar. It sells the energy bar in 34 branches at different prices and varying amount of advertising. We want to find whether it’s low price that increases sales, advertising, both or none.

Store	Sales	Price	Advertising
1	4141	59	200
2	3842	59	200
3	3056	59	200
4	3519	59	200
5	4226	59	400
6	4630	59	400
7	3507	59	400
8	3754	59	400
9	5000	59	600
10	5120	59	600
11	4011	59	600
12	5015	59	600
13	1916	79	200
14	675	79	200
15	3636	79	200
16	3224	79	200
17	2295	79	400
18	2730	79	400
19	2618	79	400
20	4421	79	400
21	4113	79	600
22	3746	79	600
23	3532	79	600
24	3825	79	600
25	1096	99	200
26	761	99	200
27	2088	99	200
28	820	99	200
29	2114	99	400
30	1882	99	400
31	2159	99	400
32	1602	99	400
33	3354	99	600
34	2927	99	600

> sales <- read.csv("energy_bar.csv")
> reg <- lm(sales$Sales ~ sales$Price + sales$Advertising)
> summary(reg)

Call:
lm(formula = sales$Sales ~ sales$Price + sales$Advertising)

Residuals:
     Min       1Q   Median       3Q      Max 
-1680.96  -406.40    53.45   297.48  1342.43 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       5837.5208   628.1502   9.293 1.79e-10 ***
sales$Price        -53.2173     6.8522  -7.766 9.20e-09 ***
sales$Advertising    3.6131     0.6852   5.273 9.82e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 638.1 on 31 degrees of freedom
Multiple R-squared:  0.7577,	Adjusted R-squared:  0.7421 
F-statistic: 48.48 on 2 and 31 DF,  p-value: 2.863e-10

>

There are a few things to notice here. First, overall p-value is 2.863e-10 and below 0.05 so we reject H₀. In other words, price and/or advertising correlate(s) with sales. But which one? We look at the triple stars on the right after sales$Price and sales$Advertising. Reading under the Pr(>|t|) column, the p-value of H₀ for β_price=0 is 9.20e-09, way below 0.05. We reject the H₀. That means sales$Price has a significant correlation with sales. The same goes for sales$Advertising. Finally, check adjusted R-squared is above 0.7 to ensure our model covers a large portion of sales variation. The significance of the terms can also be obtained using anova:

> anova(reg)
Analysis of Variance Table

Response: sales$Sales
                  Df   Sum Sq  Mean Sq F value    Pr(>F)    
sales$Price        1 28153486 28153486  69.152 2.155e-09 ***
sales$Advertising  1 11319245 11319245  27.803 9.822e-06 ***
Residuals         31 12620947   407127                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>

Another way to see which independent variable has a correlation is to run step-wise multiple regression. A forward selection adds each independent variable iteratively to see if it’s significant. A significant independent variable is kept. This process goes on under all the independent variables are tested. A backward elimination subtracts terms from the full model. A bidirectional elimination combines both approaches. This process gets tedious very quickly. Luckily, R does it for us in a few lines of code.

> library(MASS)
> step <- stepAIC(reg, direction = "both")
Start:  AIC=442.03
sales$Sales ~ sales$Price + sales$Advertising

                    Df Sum of Sq      RSS    AIC
<none>                           12620947 442.03
- sales$Advertising  1  11319245 23940191 461.80
- sales$Price        1  24556917 37177863 476.77
> step$anova
Stepwise Model Path 
Analysis of Deviance Table

Initial Model:
sales$Sales ~ sales$Price + sales$Advertising

Final Model:
sales$Sales ~ sales$Price + sales$Advertising


  Step Df Deviance Resid. Df Resid. Dev      AIC
1                         31   12620947 442.0333
>

Point to note here is we start with an initial model of sales$Sales ~ sales$Price + sales$Advertising and end up with the same final model. Therefore, no term was added or subtracted and both are significantly correlated to sales.

One last interesting thing about our energy bar example is the regression model itself. From the R output, we can construct sales = 3.61 * advertising - 53.22 * price + 5837.52.^[1] Take shop 1 for example, we see that the output value of $3,419.54 is pretty close to the real sales value of $4,141. Suppose the company now wants to eliminate advertising because it’s costly. At what price should we sell the energy bar for us to generate the same sales amount? We can just drop the advertising term in our model to get 4141 = 5837.52 - 53.22 * price. It’s trivial to solve the equation to find the new price 31.88.

The following R code shows a different data set following the same multiple linear regression workflow. This time, we want to find which factor correlates to a cigarette’s nicotine content. At the end, we’ll see that a cigarette’s weight and its CO content don’t correlate to its nicotine content. It is the tar that matters.

> cigdata <- read.csv("cigarettes.csv")
> reg <- lm(cigdata$Nicotine.content..mg. ~ cigdata$Tar.content..mg. + cigdata$Weight..g. + cigdata$Carbon.monoxide.content..mg.)
> summary(reg)

Call:
lm(formula = cigdata$Nicotine.content..mg. ~ cigdata$Tar.content..mg. + 
    cigdata$Weight..g. + cigdata$Carbon.monoxide.content..mg.)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.149411 -0.042048  0.004562  0.048801  0.157792 

Coefficients:
                                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)                           0.058492   0.195051   0.300    0.767    
cigdata$Tar.content..mg.              0.066668   0.010163   6.560  1.7e-06 ***
cigdata$Weight..g.                    0.107696   0.213766   0.504    0.620    
cigdata$Carbon.monoxide.content..mg. -0.008062   0.011949  -0.675    0.507    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.08002 on 21 degrees of freedom
Multiple R-squared:  0.9553,	Adjusted R-squared:  0.9489 
F-statistic: 149.6 on 3 and 21 DF,  p-value: 2.492e-14

> step <- stepAIC(reg, direction = "both")
Start:  AIC=-122.63
cigdata$Nicotine.content..mg. ~ cigdata$Tar.content..mg. + cigdata$Weight..g. + 
    cigdata$Carbon.monoxide.content..mg.

                                       Df Sum of Sq     RSS      AIC
- cigdata$Weight..g.                    1  0.001625 0.13609 -124.333
- cigdata$Carbon.monoxide.content..mg.  1  0.002915 0.13738 -124.097
<none>                                              0.13446 -122.633
- cigdata$Tar.content..mg.              1  0.275525 0.40999  -96.763

Step:  AIC=-124.33
cigdata$Nicotine.content..mg. ~ cigdata$Tar.content..mg. + cigdata$Carbon.monoxide.content..mg.

                                       Df Sum of Sq     RSS      AIC
- cigdata$Carbon.monoxide.content..mg.  1  0.003020 0.13911 -125.784
<none>                                              0.13609 -124.333
+ cigdata$Weight..g.                    1  0.001625 0.13446 -122.633
- cigdata$Tar.content..mg.              1  0.292999 0.42909  -97.624

Step:  AIC=-125.78
cigdata$Nicotine.content..mg. ~ cigdata$Tar.content..mg.

                                       Df Sum of Sq     RSS      AIC
<none>                                              0.13911 -125.784
+ cigdata$Carbon.monoxide.content..mg.  1   0.00302 0.13609 -124.333
+ cigdata$Weight..g.                    1   0.00173 0.13738 -124.097
- cigdata$Tar.content..mg.              1   2.86947 3.00858  -50.935
> step$anova
Stepwise Model Path 
Analysis of Deviance Table

Initial Model:
cigdata$Nicotine.content..mg. ~ cigdata$Tar.content..mg. + cigdata$Weight..g. + 
    cigdata$Carbon.monoxide.content..mg.

Final Model:
cigdata$Nicotine.content..mg. ~ cigdata$Tar.content..mg.


                                    Step Df    Deviance Resid. Df Resid. Dev       AIC
1                                                              21  0.1344637 -122.6334
2                   - cigdata$Weight..g.  1 0.001625209        22  0.1360889 -124.3331
3 - cigdata$Carbon.monoxide.content..mg.  1 0.003020181        23  0.1391091 -125.7843
>

Finally, remember: Correlation is not causation

[1] The equation can be obtained by the following:

> as.formula(
  paste0("sales ~ ", round(coefficients(reg)[1],2), "", 
         paste(sprintf(" %+.2f*%s ", 
                       coefficients(reg)[-1],  
                       names(coefficients(reg)[-1])), 
               collapse="")
  )
)
sales ~ 5837.52 - 53.22 * sales$Price + 3.61 * sales$Advertising

Revisions

7 May 2018: Added code for generating linear regression model as formula.

14 March 2017: Original version published.