Multiple Linear Regression Part II

class: split-70 with-border hide-slide-number bg-brand-red
background-image: url("images/USydLogo-black.svg")
background-size: 200px
background-position: 2% 90%

.column.white[.content[
<br><br><br>

# Multiple Linear Regression Part II
## .black[STAT3022 Applied Linear Models Lecture 7]
<br><br><br>

### .black[2020/02/17]

]]

.column.bg-brand-charcoal[.content.white[
## Today

1. More multiple linear regression 
1. Multiple correlation coefficient 
1. Coefficient of determination `$R^2$`
1. Adjusted `$R^2$`
1. Modelling categorical variables

]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# Let's look at sales vs. youtube
]]
.row[.split-two[
.row[.split-two[
.column[.content[
<img src="lecture07_2020JC_files/figure-html/unnamed-chunk-3-1.svg" style="display: block; margin: auto;" />
]]
.column.bg-brand-gray[.content[
<img src="lecture07_2020JC_files/figure-html/unnamed-chunk-4-1.svg" style="display: block; margin: auto;" />
]]]
]
.row[.split-two[
.column.bg-brand-gray[.content[
<img src="lecture07_2020JC_files/figure-html/unnamed-chunk-5-1.svg" style="display: block; margin: auto;" />
]]
.column[.content[
<img src="lecture07_2020JC_files/figure-html/unnamed-chunk-6-1.svg" style="display: block; margin: auto;" />
]]]
]
]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# Let's look at sales vs. facebook
]]
.row[.split-two[
.row[.split-two[
.column[.content[
<img src="lecture07_2020JC_files/figure-html/unnamed-chunk-7-1.svg" style="display: block; margin: auto;" />
]]
.column.bg-brand-gray[.content[
<img src="lecture07_2020JC_files/figure-html/unnamed-chunk-8-1.svg" style="display: block; margin: auto;" />
]]]
]
.row[.split-two[
.column.bg-brand-gray[.content[
<img src="lecture07_2020JC_files/figure-html/unnamed-chunk-9-1.svg" style="display: block; margin: auto;" />
]]
.column[.content[
<img src="lecture07_2020JC_files/figure-html/unnamed-chunk-10-1.svg" style="display: block; margin: auto;" />
]]]
]
]]

---
class: split-10

.row.bg-brand-blue.white[.content.vmiddle[
# Let's look at sales vs. newspaper
]]
.row[.split-two[
.row[.split-two[
.column[.content[
<img src="lecture07_2020JC_files/figure-html/unnamed-chunk-11-1.svg" style="display: block; margin: auto;" />
]]
.column.bg-brand-gray[.content[
<img src="lecture07_2020JC_files/figure-html/unnamed-chunk-12-1.svg" style="display: block; margin: auto;" />
]]]
]
.row[.split-two[
.column.bg-brand-gray[.content[
<img src="lecture07_2020JC_files/figure-html/unnamed-chunk-13-1.svg" style="display: block; margin: auto;" />
]]
.column[.content[
<img src="lecture07_2020JC_files/figure-html/unnamed-chunk-14-1.svg" style="display: block; margin: auto;" />
]]]
]
]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# What is your guess for the best model?
]]
.row[.content[

```r
fit01 <- lm(sales      ~ youtube      + facebook      + newspaper,      data=marketing)
fit02 <- lm(sales      ~ log(youtube) + facebook      + newspaper,      data=marketing)
fit03 <- lm(sales      ~ youtube      + log(facebook) + newspaper,      data=marketing)
fit04 <- lm(sales      ~ youtube      + facebook      + log(newspaper), data=marketing)
fit05 <- lm(sales      ~ log(youtube) + log(facebook) + newspaper,      data=marketing)
fit06 <- lm(sales      ~ log(youtube) + facebook      + log(newspaper), data=marketing)
fit07 <- lm(sales      ~ youtube      + log(facebook) + log(newspaper), data=marketing)
fit08 <- lm(sales      ~ log(youtube) + log(facebook) + log(newspaper), data=marketing)
fit09 <- lm(log(sales) ~ youtube      + facebook      + newspaper,      data=marketing)
fit10 <- lm(log(sales) ~ log(youtube) + facebook      + newspaper,      data=marketing)
fit11 <- lm(log(sales) ~ youtube      + log(facebook) + newspaper,      data=marketing)
fit12 <- lm(log(sales) ~ youtube      + facebook      + log(newspaper), data=marketing)
fit13 <- lm(log(sales) ~ log(youtube) + log(facebook) + newspaper,      data=marketing)
fit14 <- lm(log(sales) ~ log(youtube) + facebook      + log(newspaper), data=marketing)
fit15 <- lm(log(sales) ~ youtube      + log(facebook) + log(newspaper), data=marketing)
fit16 <- lm(log(sales) ~ log(youtube) + log(facebook) + log(newspaper), data=marketing)
```

]]

* Should we remove `newspaper`?
* Are there other model that we have not considered? How to select them?

<div style="width:60%">
<blockquote>
All models are wrong, but some are useful.
-- George Box
</blockquote>

</div>

---

.row.bg-brand-blue.white[.content.vmiddle[
# How about this model with interaction?
]]
.row[.content[

```r
final_fit <- lm(log(sales) ~ log(youtube)*facebook, data=marketing[-c(131, 156),])
summary(final_fit) #df=n-2-p=200-2-4
```

```
  
  Call:
  lm(formula = log(sales) ~ log(youtube) * facebook, data = marketing[-c(131, 
      156), ])
  
  Residuals:
        Min        1Q    Median        3Q       Max 
  -0.118289 -0.026081  0.004445  0.024301  0.098253 
  
  Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
  (Intercept)            1.1370471  0.0304426  37.351   <2e-16 ***
  log(youtube)           0.2676378  0.0061356  43.620   <2e-16 ***
  facebook              -0.0002488  0.0008666  -0.287    0.774    
  log(youtube):facebook  0.0023674  0.0001743  13.584   <2e-16 ***
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  
  Residual standard error: 0.03962 on 194 degrees of freedom
  Multiple R-squared:  0.989,	Adjusted R-squared:  0.9889 
  F-statistic:  5830 on 3 and 194 DF,  p-value: < 2.2e-16
```

]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# Which plot looks unusual to you?
]]
.row[.split-70[
.column[.content[

]]
.column.bg-brand-gray[.content[

]]

* A similar null generation mechanism in Lecture 4 was employed here where the errors are sampled from Normal distribution with mean 0 and variance from the estimated error variance from `final_model`.
* The simulated residual plots provide normal references to compare with the data residual plot.
* The data residual plot is 18.

---

.row.bg-brand-blue.white[.content.vmiddle[
# Model diagnostics for the "final" model
]]
.row[.content[
<br>

]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# Dissecting the "final" model
]]
.row[.content[

* Observation 131 and 156 were removed as it was deemed outliers (we will cover this in the next lecture).

```r
final_fit <- lm(log(sales) ~ log(youtube)*facebook, data=marketing[-c(131, 156),])
```
is a shorthand for

```r
final_fit <- lm(log(sales) ~ log(youtube) + facebook + log(youtube):facebook, data=marketing[-c(131, 156),])
```

* The last term, `log(youtube):facebook`, is referred to as interaction term between `log(youtube)` and `facebook` and is the same as fitting:

```r
final_fit2 <- marketing[-c(131,156), ] %>% 
  mutate(logYoutube=log(youtube), logYandF=log(youtube)*facebook) %>% 
  lm(log(sales) ~ logYoutube + facebook + logYandF, data=.) 
```

]]

---

### Indeed they have the same output

```r
final_fit %>% broom::tidy()
```

```
                     term      estimate    std.error  statistic       p.value
  1           (Intercept)  1.1370471442 0.0304426181 37.3505045  1.554532e-90
  2          log(youtube)  0.2676378453 0.0061356267 43.6202951 3.202419e-102
  3              facebook -0.0002487674 0.0008665749 -0.2870697  7.743653e-01
  4 log(youtube):facebook  0.0023673639 0.0001742763 13.5839689  5.660864e-30
```

```r
final_fit2 %>% broom::tidy()
```

```
           term      estimate    std.error  statistic       p.value
  1 (Intercept)  1.1370471442 0.0304426181 37.3505045  1.554532e-90
  2  logYoutube  0.2676378453 0.0061356267 43.6202951 3.202419e-102
  3    facebook -0.0002487674 0.0008665749 -0.2870697  7.743653e-01
  4    logYandF  0.0023673639 0.0001742763 13.5839689  5.660864e-30
```

---

.row.bg-brand-blue.white[.content.vmiddle[
# Dissecting the "final" model
]]
.row[.content[

`$$\hat{\log(\texttt{sales}_i) }= \color{blue}{1.14} + \color{red}{0.268}\times \log(\texttt{youtube}_i) \color{blue}{- 0.0000249\times \texttt{facebook}_i} \color{red}{+ 0.00237 \times} \log(\texttt{youtube}_i)\times \color{red}{\texttt{facebook}_i}$$`

`$$\hat{\texttt{sales}_i }= \texttt{youtube}_i^{\color{red}{0.268 + 0.00237 \times \texttt{facebook}_i}}\exp\left(\color{blue}{1.14  - 0.0000249\times \texttt{facebook}_i}\right)$$` 
since `$a \log b=\log b^a$`. This becomes multiplicative after taking log transformation. 
.pull-left[

```r
linedat <- expand.grid(facebook=seq(0, 59.52, 2),
                       youtube=seq(0.84, 355.68, 1))
linedat$sales <- exp(predict(final_fit, linedat))

ggplot(marketing, aes(youtube, sales)) +
  geom_point(aes(color=facebook)) +
  theme_bw(base_size=18) +
  geom_line(data=linedat, 
            aes(youtube, sales,
                group=facebook, color=facebook),
            alpha=0.5) + 
  ggtitle("Plotting prediction lines")
```
]
.pull-right[
<img src="lecture07_2020JC_files/figure-html/unnamed-chunk-24-1.png" style="display: block; margin: auto;" />

]
]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# Explaining the codes
]]
.row[.split-50[
.column[.content[

```r
linedat<-expand.grid(facebook=seq(0,59.52,2),youtube=seq(0.84,355.68,1))
linedat$sales <- exp(predict(final_fit, linedat))
```
Set grids and calculate **sales** as exponentiation of models for all combination of levels of **youtube** and **facebook**.

.brand-blue[linedat] contains first 2 columns of **youtube** and **facebook** with all combinations and the third column **sales** evaluated at these combinations.

]]
.column.bg-brand-gray[.content[

```r
ggplot(marketing,aes(youtube,sales))+geom_point(aes(color=facebook)) 
```
Assign (x,Y) as (youtube, sales) with colour of points using values of **facebook**.

```r
geom_line(data=linedat,aes(youtube,sales,group=facebook,color=facebook),alpha=0.5)  
```
Draw lines of **sales** on **youtube** using .brand-blue[linedat] for each level of **facebook** with line colour using levels of **facebook**.
]]
]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# Got another model. How is this model?
]]
.row[.content[

```r
skimr::skim(mystery_data)
```

```
  Skim summary statistics
   n obs: 8406 
   n variables: 5 
  
  -- Variable type:numeric ------------------------------------------------------------------------------------
   variable missing    n    mean   sd    p0   p25     p50  p75 p100     hist
         x1       0 8406 0.016   0.99 -4.48 -0.65  0.014  0.69 3.75 <U+2581><U+2581><U+2582><U+2586><U+2587><U+2585><U+2581><U+2581>
         x2       0 8406 0.017   1    -3.54 -0.64  0.0076 0.68 3.57 <U+2581><U+2581><U+2583><U+2587><U+2587><U+2583><U+2581><U+2581>
         x3       0 8406 0.00066 1    -3.71 -0.68 -0.0048 0.69 3.87 <U+2581><U+2581><U+2583><U+2587><U+2587><U+2583><U+2581><U+2581>
         x4       0 8406 0.014   1.99 -7.11 -1.39  0.032  1.39 6.86 <U+2581><U+2581><U+2583><U+2587><U+2587><U+2585><U+2581><U+2581>
          y       0 8406 0.045   1.41 -4.67 -0.89  0.052  1.09 3.8  <U+2581><U+2581><U+2582><U+2586><U+2587><U+2586><U+2583><U+2581>
```

```r
fit0<-lm(y ~ ., data=mystery_data); print(broom::tidy(fit0), digits = 4)
```

```
           term  estimate std.error statistic p.value
  1 (Intercept) -0.002044   0.01092   -0.1871  0.8516
  2          x1  0.979051   0.01510   64.8375  0.0000
  3          x2  0.997928   0.01550   64.3711  0.0000
  4          x3  0.972790   0.01543   63.0656  0.0000
  5          x4  0.994709   0.01086   91.5907  0.0000
```
]

]]

`$$\hat{Y} = -0.002 + 0.979x_1 + 0.998x_2 + 0.973x_3 + 0.995x_4$$`
</div>

---

.row.bg-brand-blue.white[.content.vmiddle[
# Ha! Jokes - My residual plot is Homer Simpson
]]
.row[.split-two[
.column[.content[

```r
lm(y ~ ., data=mystery_data) %>% 
  broom::augment() %>% 
  ggplot(aes(.fitted, .resid)) + 
  geom_point() + 
  labs(x="Fitted Values", y="Residual") + 
  theme(axis.title=element_text(size=20))
```

## Moral of the story:

### Don't forget to check model diagnostics.

]

]]
.column[.content[
![](lecture07_2020JC_files/figure-html/unnamed-chunk-33-1.png)
]]

]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# Goodness of Fit (GoF) criteria
]]
.row[.split-two[
.column[.content[
A measure of the goodness of fit of the multiple regression model is

`$$R^2 = \frac{S_{yy} - \text{RSS}}{S_{yy}}$$`

the proportion of total variability in `$Y$` explained by the linear model.

* `$R = \sqrt{R^2}$` is called the .brand-blue[multiple correlation coefficient], `$0\leq R \leq 1$`. 
* `$R$` is in fact the Pearson correlation coefficient between the observed and fitted values, `$\{Y_i\}$` and `$\{\hat{Y}_i\}$`.

]]
.column.bg-brand-gray[.content[

* `$R^2$` is always higher or equal for more complex models.

* A modified version of `$R^2$` that takes into account model complexity in assessing GoF is called the *adjusted* `$R^2$` by allowing for df given by
`\begin{equation*}
R_a^2 = 1-\frac{\text{RSS}/(n-p)}{S_{yy}/(n-1)}
\end{equation*}`

* More parameters increase `$p$` and hence the subtracted term.
* You can use the function `glance` from `broom` to calculate these values:

```r
broom::glance(final_fit)[,1:2]
```

```
    r.squared adj.r.squared
  1 0.9890294     0.9888597
```

]]
]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# Find `$R^2$` and `$R^2_a$` in `summary` output 
]]
.row[.content[

```r
c(summary(final_fit)$r.squared,summary(final_fit)$adj.r.squared) #alternative way
```

```
  [1] 0.9890294 0.9888597
```

]]

---

# Modelling categorical variables

---

.row.bg-brand-blue.white[.content.vmiddle[
# Example: weight by gender
]]
.row[.split-50[
.column[.content[

```r
data("genderweight", package="datarium")
ggplot(genderweight, aes(group, weight)) + 
  geom_point() + geom_jitter() +
  theme_bw(base_size=18)
```

![](lecture07_2020JC_files/figure-html/unnamed-chunk-37-1.png)

]]
.column.bg-brand-gray[.content[

```r
head(genderweight,5)
```

```
  # A tibble: 5 x 3
    id    group weight
    <fct> <fct>  <dbl>
  1 1     F       61.6
  2 2     F       64.6
  3 3     F       66.2
  4 4     F       59.3
  5 5     F       64.9
```

```r
tail(genderweight,5)
```

```
  # A tibble: 5 x 3
    id    group weight
    <fct> <fct>  <dbl>
  1 36    M       82.6
  2 37    M       77.0
  3 38    M       81.6
  4 39    M       87.4
  5 40    M       86.4
```

]]]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# Data summary 
]]
.row[.content[

n obs: 40,  n variables: 3

```
          missing  mean    sd    p0   p25   p50   p75  p100
  weight        0 74.66 11.79 59.31 62.96 72.92 62.96 95.06
  weightF       0 63.50  2.03 59.31 62.30 62.94 62.30 68.83
  weightM       0 85.83  4.35 77.01 83.17 86.27 83.17 95.06
```

![](lecture07_2020JC_files/figure-html/unnamed-chunk-40-1.png)

```r
#skimr::skim(genderweight)
genderweight %>% 
  group_by(group) %>% 
  summarise(n=n(), mean=mean(weight))
```

```
  # A tibble: 2 x 3
    group     n  mean
    <fct> <int> <dbl>
  1 F        20  63.5
  2 M        20  85.8
```

]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# Example: modelling weight by gender
]]
.row[.split-40[
.column[.content[
We consider the following model:

`$$\texttt{weight}_i = \begin{cases}
               \mu_F + \epsilon_i, \quad \text{if female}\\
               \mu_M + \epsilon_i,  \quad \text{if male}
            \end{cases}$$`

which is equivalent to:

`$$\texttt{weight}_i = \beta_1x_{1i}+\beta_2x_{2i} + \epsilon_i$$`
where 
* `$x_{1i}$` is 1 for female, 0 otherwise; 
* `$x_{2i}$` is 1 for male and 0 otherwise; 
* `$\beta_1=\mu_F$`; and `$\beta_2=\mu_M$`,
* assuming `$\epsilon_i \sim NID(0, \sigma^2)$`.

Note there is no intercept in this model.

]]
.column.bg-brand-gray[.content[

* In R, include `-1` or `0` in the model formulae to remove the intercept in the model otherwise it is included by default. 
* If a term in the model formulae is a `factor`, R expands into dummy variables ( `$\boldsymbol{x}_{1}$` and `$\boldsymbol{x}_{2}$` ) for you.

```r
fitw <- lm(weight ~ -1 + group, data=genderweight)
print(broom::tidy(fitw), digits = 4)
```

```
      term estimate std.error statistic   p.value
  1 groupF    63.50    0.7593     83.62 1.079e-44
  2 groupM    85.83    0.7593    113.03 1.203e-49
```

.blockquote.bg-white[
What do you notice about the estimate here? Why?
]

]]]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# Including intercept with categorical variable
]]
.row[.split-50[
.column[.content[

Now consider the following model

`$$\texttt{weight}_i = \beta_0 + \beta_1x_{1i}+\beta_2x_{2i} + \epsilon_i$$`
* `$x_{1i}$` is 1 for female, 0 otherwise; 
* `$x_{2i}$` is 1 for male and 0 otherwise; and
* assuming `$\epsilon_i \sim NID(0, \sigma^2)$`.

```r
fitw1 <- lm(weight ~ group, data=genderweight)
print(broom::tidy(fitw1), digits = 4)
```

```
           term estimate std.error statistic   p.value
  1 (Intercept)    63.50    0.7593     83.62 1.079e-44
  2      groupM    22.33    1.0739     20.79 2.331e-22
```

```r
c(sum(fitw1$coeff),fitw$coeff[2]) #mu_m=b0+b2 in M2
```

```
             groupM 
  85.82612 85.82612
```

]]
.column.bg-brand-gray[.content[

* Notice there is no result for `$\hat{\beta}_1$` and now `$\mu_F=\beta_0$` and `$\mu_M=\beta_0 + \beta_2$`!
* This is because we have the constraint `$\beta_1=0$` here while before the constraint was `$\beta_0=0$`.
* Recall that LSE is given as `$\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\boldsymbol{Y}$` *if* `$(\mathbf{X}^\top\mathbf{X})^{-1}$` exists. 
* If the model includes an intercept and all level of a categorical variable, `$(\mathbf{X}^\top\mathbf{X})^{-1}$` does not exist and to obtain a unique solution, constrain must be applied. 
* Consequently, you should be careful in your interpretation if your data contain categorical variables.
* If the model includes an intercept and one categorical variable then R will constrain the first level of the categorical variable to zero by default.

]]]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# Design matrix for categorical variable 
]]
.row[.content[
Just some additional information.

* The data and design matrices `$\boldsymbol{X}$` for models without intercept and that with intercept are

`$$\text{data}= \begin{pmatrix} 
F & 61.6\\
F & 64.6\\
\vdots & \vdots\\
M & 87.4\\
M & 86.4\\
\end{pmatrix}, 
\boldsymbol{X}_{M1}= \begin{pmatrix} 
1 & 0\\
1 & 0\\
\vdots & \vdots \\
0 & 1\\
0 & 1 \\
\end{pmatrix}, 
\boldsymbol{X}_{M2}= \begin{pmatrix} 
1 & 0\\
1 & 0\\
\vdots & \vdots  \\
1 & 1\\
1 & 1\\
\end{pmatrix}$$`

* This explains why `$se_{\tiny M1}(\hat \beta_1)=se_{\tiny M1}(\hat \beta_2)$` (M1 has no intercept) as the second column of `$\boldsymbol{X}_{M1}$` for `$\beta_2$` is 1 minus the first column for `$\beta_1$`. Hence their  variabilities are the same. 
* Also, `$se_{\tiny M2}(\hat \beta_0)=se_{\tiny M1}(\hat \beta_1)$` and `$se_{\tiny M2}(\hat \beta_2)=\sqrt{Var_{\tiny M1}(\hat \beta_1)+Var_{\tiny M1}(\hat \beta_2)}=\sqrt{2}se_{\tiny M2}(\hat \beta_0)$`. 
]]