Simple Linear Regression: Maximum Likelihood Estimation

class: split-70 with-border hide-slide-number bg-brand-red
background-image: url("images/USydLogo-black.svg")
background-size: 200px
background-position: 2% 90%

.column.white[.content[

# Simple Linear Regression: Maximum Likelihood Estimation
## .black[STAT3022 Applied Linear Models Lecture 3]

### .black[2020/02/15]

]]

.column.bg-brand-charcoal[.content.white[
## Today

1. Revising least squares estimate for simple linear regression.
1. Maximum likelihood estimate for regression parameters. 
1. Moment estimate for error variance.
1. Some properties of linear models.
1. `$R^2$` - coefficient of determination
2. Using .yellow[tidy] function from .yellow[broom] to modify model object to nice data frames (tibbles).

]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# Simple Linear Regression
]]
.row[.split-two[
.column[.content[

Simple linear regression seeks to model the relationship
between

* the mean of a .brand-blue[response variable], `$Y$`, and 
* a single .brand-blue[explanatory variable] (or .brand-blue[predictor]/.brand-blue[covariate]) `$x$`.

For data `$(x_1, Y_1), \ldots , (x_n, Y_n)$`, where
 `$x_1,\ldots,x_n$` are known constants and
 `$Y_i=y_i$` are the observed random responses,
we formulate the .brand-blue[simple linear regression model] as

`$$Y_i  = \beta_0 + \beta_1 x_i + \epsilon_i$$`

.blockquote[
Why is a capital letter used for the response but a lower case letter for the explanatory variable?
]

]]
.column.bg-brand-gray[.content[

# &nbsp;

* `$\beta_0, \beta_1$` are .brand-blue[regression parameters].

* `$\epsilon_1, \ldots ,\epsilon_n$` are .brand-blue[error terms], satisfying the following assumptions.

.padleft50px.padright20px[

(A1) `$E[\epsilon_i] = 0$` for `$i=1,\ldots,n$`.

(A2) `$\epsilon_1, \ldots ,\epsilon_n$` are .brand-blue[independent].

(A3) `$Var(\epsilon_i) = \sigma^2$` for `$i=1,\ldots,n$`, which is called the .brand-blue[homoscedasticity assumption].

(A4) `$\epsilon_1, \ldots ,\epsilon_n$` are .brand-blue[normally distributed]

.content-box-blue[
The concise version of (A1)-(A4) is: `$\epsilon_i \sim NID(0,\sigma^2)$`, `$i=1,\ldots,n$`.
]]

]]
]]

---

---

---

.row.bg-brand-blue.white[.content.vmiddle[
# Example: Body weight versus brain weight

]]
.row[.content[

* It is of interest to know whether brain weight for different mammal species truly depends on body weight. 
* View brain weight as the .brand-blue[response] variable and body weight as the .brand-blue[predictor] variable.

```r
dat <- read.csv("data/sleep.csv")
str(dat) # or dplyr::glimpse(dat)
```

```
  'data.frame':	62 obs. of  11 variables:
   $ Species: Factor w/ 62 levels "African_elephant",..: 1 2 3 4 5 6 7 8 9 10 ...
   $ SWS    : num  NA 6.3 NA NA 2.1 9.1 15.8 5.2 10.9 8.3 ...
   $ PS     : num  NA 2 NA NA 1.8 0.7 3.9 1 3.6 1.4 ...
   $ TS     : num  3.3 8.3 12.5 16.5 3.9 9.8 19.7 6.2 14.5 9.7 ...
   $ BodyWt : num  6654 1 3.38 0.92 2547 ...
   $ BrainWt: num  5712 6.6 44.5 5.7 4603 ...
   $ Life   : num  38.6 4.5 14 NA 69 27 19 30.4 28 50 ...
   $ GP     : num  645 42 60 25 624 180 35 392 63 230 ...
   $ P      : int  3 3 1 5 3 4 1 4 1 1 ...
   $ SE     : int  5 1 1 2 5 4 1 5 2 1 ...
   $ D      : int  3 3 1 3 4 4 1 4 1 1 ...
```

.bottom_abs.width100[
Data from Allison T and Cicchetti D (1976) Sleep in Mammals. *Ecological Constitutional Correlates Science* 194:732-734. 
]

]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# Check your data first - Part I
]]
.row[.content[

```r
dat %>% 
 select(BodyWt, BrainWt) %>% 
 skimr::skim()
```
<img src="images/L3Skim.png" width="80%" height="80%">

]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# Check your data first -  Part II
]]
.row[.split-two.with-border[
.column[.content[

```r
p <- dat %>%
 ggplot(aes(BodyWt, BrainWt)) + geom_point() +
 theme_bw(base_size=20) + 
 labs(x="Body Weight (kg)", y="Brain Weight (g)")
p
```

<img src="lecture03_2020JC_files/figure-html/unnamed-chunk-6-1.png" width="360" style="display: block; margin: auto;" />
]]
.column[.content[
Note that the log-transformed data result in a more homogenous scatterplot.

```r
p + coord_trans(x ="log10", y="log10") 
```

]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# Which line looks best to you?
]]
.row[.content[
 
<img src="lecture03_2020JC_files/figure-html/unnamed-chunk-8-1.png" width="1080" style="display: block; margin: auto;" />

]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# Parameter Estimation: Least Squares Estimates
]]
.row[
.split-two[
.column[.content[

```
  `geom_smooth()` using formula 'y ~ x'
```

<img src="lecture03_2020JC_files/figure-html/unnamed-chunk-9-1.png" width="360" style="display: block; margin: auto;" />
* **Predicted/Fitted value**: Output of the function model function - the model function gives the typical value of the response variable conditioning on the explanatory variables.
* **Residual** = Observed value - Predicted value.

]]
.column.bg-brand-gray[.content[

]]

]

]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# Parameter Estimation: Least Squares Estimates
]]
.row[
.split-two[
.column[.content[

```
  `geom_smooth()` using formula 'y ~ x'
```

<img src="lecture03_2020JC_files/figure-html/unnamed-chunk-10-1.png" width="360" style="display: block; margin: auto;" />
* **Predicted/Fitted value**: Output of the function model function - the model function gives the typical value of the response variable conditioning on the explanatory variables.
* **Residual** = Observed value - Predicted value.

]]
.column.bg-brand-gray[.content[

Least Squares Estimates (.brand-blue[LSEs])
`$\hat \beta_0$` and `$\hat \beta_1$`, are those values that minimize the
sum of squares
`\begin{eqnarray*}
S(\beta_0, \beta_1) &=&
= \sum_{i=1}^n (Y_i - \beta_0 - \beta_1 x_i)^2.
\label{eq:ss}
\end{eqnarray*}`

The least squares estimators `$\hat \beta_0$` and `$\hat \beta_1$` are given by
`\begin{equation}
\hat \beta_0 = \bar Y - \hat \beta_1 \bar x \quad
 \text{and}\quad \hat \beta_1 = \frac{S_{xy}}{S_{xx}},
\label{eq:beta}
\end{equation}`
where 
`\begin{eqnarray*}
S_{xy} &=& \sum_{i=1}^n (x_i - \bar x)(Y_i - \bar Y) \\
&=&  \sum x_i Y_i - n\bar{Y}\bar{x} \\
S_{xx} &=& \sum_{i=1}^n (x_i - \bar x)^2 = \sum x^2_i  - n\bar{x}^2.
\end{eqnarray*}`

]]

]

]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# How to do this in R
]]
.row[.content[

To fit 
$$\hat{\log(\texttt{BrainWt}_i)}=\beta_0 + \beta_1\log(\texttt{BodyWt}_i) $$

```r
fit <- lm(log(BrainWt) ~ log(BodyWt), data=dat)
coef(fit) # or fit$coef
```

```
  (Intercept) log(BodyWt) 
    2.1347887   0.7516859
```

```r
y=log(dat$BrainWt); x=log(dat$BodyWt)
b=sum((x-mean(x))*(y-mean(y)))/sum((x-mean(x))^2); a=mean(y)-b*mean(x); c(a,b)
```

```
  [1] 2.1347887 0.7516859
```
to extract the regression parameter estimates, `$\hat\beta_0$` and `$\hat\beta_1$`. The fitted model, where parameter estimates are rounded to 3 decimal places, is given by
`$$\hat{\log(\texttt{BrainWt})}=2.135 + 0.752\log(\texttt{BodyWt}).$$`

]]

---

.row.bg-brand-blue.white[.content.vmiddle[
#  How to do this in R
]]
.row[.split-50[
.column[.content[

```r
summary(fit)
```

```
 
 Call:
 lm(formula = log(BrainWt) ~ log(BodyWt), data = dat)
 
 Residuals:
 Min 1Q Median 3Q Max 
 -1.71550 -0.49228 -0.06162 0.43597 1.94829 
 
 Coefficients:
 Estimate Std. Error t value Pr(>|t|) 
 (Intercept) 2.13479 0.09604 22.23 <2e-16 ***
 log(BodyWt) 0.75169 0.02846 26.41 <2e-16 ***
 ---
 Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
 
 Residual standard error: 0.6943 on 60 degrees of freedom
 Multiple R-squared: 0.9208,	Adjusted R-squared: 0.9195 
 F-statistic: 697.4 on 1 and 60 DF, p-value: < 2.2e-16
```

]]
.column.bg-brand-gray[.content[

```r
broom::tidy(fit)
```

```
 # A tibble: 2 x 5
 term estimate std.error statistic p.value
 <chr> <dbl> <dbl> <dbl> <dbl>
 1 (Intercept) 2.13 0.0960 22.2 1.18e-30
 2 log(BodyWt) 0.752 0.0285 26.4 9.84e-35
```

]]
]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# Parameter Estimation: Maximum Likelihood Estimate  
]]
.row[.content[

* Consider a simple linear regression model `$Y_i = \beta_0 + \beta_1 x_i + \epsilon_i$` assuming errors `$\epsilon_i\sim NID(0,\sigma^2)$`.
* Therefore the .brand-blue[joint density] of the independent random responses `$Y_1,\ldots,Y_n$` evaluated at (the observed values) `$\boldsymbol{y}^\top=(y_1,\ldots,y_n)$` is

`$$\begin{eqnarray}
f(\boldsymbol{y};\beta_0,\beta_1,\sigma) &=& \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(y_1-\beta_0-\beta_1 x_1)^2}{2\sigma^2}}\times\cdots\times \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(y_n-\beta_0-\beta_1 x_n)^2}{2\sigma^2}}\nonumber\\
&=& \left(\frac{1}{\sqrt{2\pi}\sigma}\right)^n e^{-\frac{1}{2\sigma^2}\sum_{i=1}^n(y_i-\beta_0-\beta_1 x_i)^2}\label{eqn:MLE}
\end{eqnarray}$$`

* The method of .brand-blue[maximum-likelihood] is called such because it finds parameter values `$\hat\beta_0$`, `$\hat\beta_1$` and `$\hat\sigma$` that .brand-blue[maximise] the joint density (likelihood).
* One can show (Week 2 Tutorial) that maximising the likelihood over `$\beta_0$` and `$\beta_1$` is independent of `$\sigma$` and is achieved by minimising `$\sum_{i=1}^n(y_i-\beta_0-\beta_1 x_i)^2$` because the likelihood contains `$-\sum_{i=1}^n(y_i-\beta_0-\beta_1 x_i)^2$`. 
* In this (special) case the method of maximum-likelihood gives the same parameter estimates as the method of least-squares. How about `$\sigma^2$`?

]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# Sampling distribution of `$\hat \beta_0$` and `$\hat \beta_1$`
]]

`$$\begin{equation}
\hat\beta_0 \sim N \left(\beta_0 , \, \sigma^2\left[\frac{1}{n} + 
\frac{\bar x^2}{S_{xx}}\right ] \right ) \quad\text{and} \quad \hat \beta_1 \sim N \left(\beta_1 , \, \frac{\sigma^2}{S_{xx}} \right ).
\label{eq:b1samp-sig}
\end{equation}$$`

We will go through the derivation in a future lecture but we form standard error estimate from this.

]].row[
.split-two[
.column.bg-gray[.content[

]]
.column.bg-brand-gray[.content[

]]
]]]

]

]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# Fitted regression line
]]
.row[.content[

Note that we have `$y = \hat{\beta}_0 + \hat{\beta}_1 x = \bar{Y} + \hat{\beta}_1(x-\bar{x}).$` 
Thus, the regression line passes through the component-wise mean `$(\bar{x},\bar{Y}).$`

# Correlation coefficient and regression slope

Recall that the .brand-blue[Pearson correlation coefficient] between vectors `$x$` and `$y$` is

`$$r = \frac{S_{xy}}{\sqrt{S_{xx}S_{yy}}} \;\in\; [-1,1],\;\; \text{where } S_{yy} = \sum_i (y_i-\bar{y})^2.$$`
Therefore, `$$\hat{\beta}_1 = \frac{S_{xy}}{S_{xx}}=\frac{S_{xy}}{\sqrt{S_{xx}S_{yy}}}\frac{S_{yy}}{\sqrt{S_{xx}}}=r \sqrt{S_{yy}/S_{xx}},$$`

and `$\hat{\beta}_1$` has the .brand-blue[same sign] as `$r$`, i.e. both positive or both negative.

```r
cor(log(dat$BrainWt), log(dat$BodyWt))
```

```
  [1] 0.9595748
```

]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# Estimating the error variance `$\sigma^2$`
]]
.row[.content[

* The error variance, `$\sigma^2$`, is also an unknown parameter.
* An estimator can be obtained by using the .brand-blue[residual sum of squares (RSS)]:

`$$\begin{equation}
\text{RSS} = \sum_{i=1}^n (Y_i - \widehat Y_i)^2 
= \sum_{i=1}^n (Y_i - \hat \beta_0 - \hat \beta_1 x_i )^2 \; ,
\label{eq:rss}
\end{equation}$$`

where

* `$\widehat Y_i = \hat \beta_0 + \hat \beta_1 x_i$` is the `$i$`th .brand-blue[fitted value] and `$p=2$` is the .brand-blue[number of regression parameters to estimate];  
* `$R_i = y_i - \widehat Y_i$` is the `$i$`-th .brand-blue[residual]. Expected value of `$\chi^2$` random variable is its degrees of freedom.
 We have `$$\text{RSS}=\sigma^2\sum_{i=1}^n \left(\frac{y_i - \widehat Y_i}{\sigma}\right)^2\sim \sigma^2\chi^2_{n-p} \quad \Rightarrow \quad E(\text{RSS})=(n-p)\sigma^2.$$`
*  Thus, an *unbiased estimate* of `$\sigma^2$` is given by
`$$\begin{equation}
s^2 = \frac{1}{(n-p)} \text{RSS}.
\label{eq:sighat}
\end{equation}$$

]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# How to do this in R
]]
.row[.content[
Extracting .brand-blue[residuals]:

```r
residuals(fit) %>% # or fit$residuals
head()
```

```
           1          2          3          4          5          6 
  -0.1015355 -0.2477190  0.7441293 -0.3316457  0.4044490  1.2843199
```

Extracting .brand-blue[residual sum of squares] the easy way:

```r
c(deviance(fit), deviance(fit)/(length(dat$BodyWt)-2)) # or sum(residuals(fit) ^ 2)/(n-2)
```

```
  [1] 28.9227104  0.4820452
```

Extracting the .brand-blue[estimated variance of the errors] the easy way:

```r
sigma(fit)^2 # remove the square to get the standard deviation
```

```
  [1] 0.4820452
```

Thus `$s^2 = \hat{\sigma}^2 = 0.482 \Rightarrow s = \hat\sigma = 0.694.$`

]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# Residuals as error estimators
]]
.row[.content[

* The residuals `$R_i$` can be thought of as estimates for the error terms, `$\hat\epsilon_i$`, in the model. 
* **The residuals sum up to 0**:
`\begin{eqnarray*}
\sum_{i=1}^n R_i &=& \sum_{i=1}^n (Y_i -\hat{Y}_i) =  \sum_{i=1}^n (Y_i -\bar{Y} - \hat{\beta}_1(x_i-\bar{x}_i)) \\
&=& \sum_{i=1}^n (Y_i -\bar{Y}) - \hat{\beta}_1\sum_{i=1}^n(x_i-\bar{x}_i) =0 
\end{eqnarray*}`

```r
mean(fit$res)
```

```
  [1] 1.457552e-18
```

```r
all.equal(mean(fit$res), 0)
```

```
  [1] TRUE
```

]]

* Note: random variables that sum to a constant cannot be independent!

---

.row.bg-brand-blue.white[.content.vmiddle[
# Explaining the variability in the `$Y$`'s
]]

* If we ignore `$x$`, the total variability in the `$Y$` sample is `$S_{yy} = \sum (Y_i-\bar{Y})^2$`.
* But `$S_{yy} = \text{RSS} + \hat{\beta}_1^2 S_{xx}$`.

.brand-blue[Reason]: Alternative representation of `$R_i$` above `$\Rightarrow$`
`\begin{eqnarray*}
\text{RSS} &=& \sum_{i=1}^n (Y_i -\hat{Y}_i)^2 =  \sum_{i=1}^n (Y_i -\bar{Y} - \hat{\beta}_1(x_i-\bar{x}_i))^2 = ... \\
&=& S_{yy} + \hat{\beta}_1^2 S_{xx} - 2\hat{\beta}_1S_{xy} = S_{yy} + \hat{\beta}_1^2 S_{xx} - 2\hat{\beta}_1 \hat{\beta}_1 S_{xx}; 
\end{eqnarray*}`
since `$\hat{\beta}_1 = S_{xy}/S_{xx}$`.

* The component `$\hat{\beta}_1^2 S_{xx}$` is called the .brand-blue[linear regression sum of squares] and it explains `$R^2$` of the total variability in `$Y$`.
* The `$R^2$` is called the .brand-blue[coefficient of determination]:
`$$\frac{\hat{\beta}_1^2 S_{xx}}{S_{yy}} = \frac{S^2_{xy}}{S_{xx}S_{yy}} =:r^2=R^2$$`
]]

---

.column.bg-brand-red.white[.content[
# Summary

* Check your data first before fitting a model.
* **Maximum likelihood estimate** and **least squares estimate** for regression parameters in a regression model 
`$Y_i = \beta_0 + \beta_1x_i + \epsilon_i\quad \epsilon\sim NID(0, \sigma^2)$` is the same.
* Computing standard errors of the regression parameters.
* The regression line passes through the component-wise mean.
* Correlation coefficient and slope have the same sign. 
* Computing the error variance.
* Residuals as error estimates -  errors are considered independent but residuals are not independent.
* Coefficient of determination -  the proportion of variability explained by the model.

]]
.column[.content[
# Next lesson

* Model diagnostics 
]]

---

---