class: split-70 with-border hide-slide-number bg-brand-red background-image: url("images/USydLogo-black.svg") background-size: 200px background-position: 2% 90% .column.white[.content[ <br><br><br> # Simple Linear Regression: <br>Maximum Likelihood Estimation ## .black[STAT3022 Applied Linear Models Lecture 3] <br><br><br> ### .black[2020/02/15] ]] .column.bg-brand-charcoal[.content.white[ ## Today 1. Revising least squares estimate for simple linear regression. 1. Maximum likelihood estimate for regression parameters. 1. Moment estimate for error variance. 1. Some properties of linear models. 1. `\(R^2\)` - coefficient of determination 2. Using .yellow[tidy] function from .yellow[broom] to modify model object to nice data frames (tibbles). ]] --- layout: true class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Simple Linear Regression ]] .row[.split-two[ .column[.content[ Simple linear regression seeks to model the relationship between * the mean of a .brand-blue[response variable], `\(Y\)`, and * a single .brand-blue[explanatory variable] (or .brand-blue[predictor]/.brand-blue[covariate]) `\(x\)`. For data `\((x_1, Y_1), \ldots , (x_n, Y_n)\)`, where `\(x_1,\ldots,x_n\)` are known constants and `\(Y_i=y_i\)` are the observed random responses, we formulate the .brand-blue[simple linear regression model] as `$$Y_i = \beta_0 + \beta_1 x_i + \epsilon_i$$` .blockquote[ Why is a capital letter used for the response but a lower case letter for the explanatory variable? ] ]] .column.bg-brand-gray[.content[ # * `\(\beta_0, \beta_1\)` are .brand-blue[regression parameters]. * `\(\epsilon_1, \ldots ,\epsilon_n\)` are .brand-blue[error terms], satisfying the following assumptions. .padleft50px.padright20px[ (A1) `\(E[\epsilon_i] = 0\)` for `\(i=1,\ldots,n\)`. (A2) `\(\epsilon_1, \ldots ,\epsilon_n\)` are .brand-blue[independent]. (A3) `\(Var(\epsilon_i) = \sigma^2\)` for `\(i=1,\ldots,n\)`, which is called the .brand-blue[homoscedasticity assumption]. (A4) `\(\epsilon_1, \ldots ,\epsilon_n\)` are .brand-blue[normally distributed] .content-box-blue[ The concise version of (A1)-(A4) is: `\(\epsilon_i \sim NID(0,\sigma^2)\)`, `\(i=1,\ldots,n\)`. ]] ]] ]] --- class: show-10 --- count: false --- layout: false class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Example: Body weight versus brain weight ]] .row[.content[ * It is of interest to know whether brain weight for different mammal species truly depends on body weight. * View brain weight as the .brand-blue[response] variable and body weight as the .brand-blue[predictor] variable. ```r dat <- read.csv("data/sleep.csv") str(dat) # or dplyr::glimpse(dat) ``` ``` 'data.frame': 62 obs. of 11 variables: $ Species: Factor w/ 62 levels "African_elephant",..: 1 2 3 4 5 6 7 8 9 10 ... $ SWS : num NA 6.3 NA NA 2.1 9.1 15.8 5.2 10.9 8.3 ... $ PS : num NA 2 NA NA 1.8 0.7 3.9 1 3.6 1.4 ... $ TS : num 3.3 8.3 12.5 16.5 3.9 9.8 19.7 6.2 14.5 9.7 ... $ BodyWt : num 6654 1 3.38 0.92 2547 ... $ BrainWt: num 5712 6.6 44.5 5.7 4603 ... $ Life : num 38.6 4.5 14 NA 69 27 19 30.4 28 50 ... $ GP : num 645 42 60 25 624 180 35 392 63 230 ... $ P : int 3 3 1 5 3 4 1 4 1 1 ... $ SE : int 5 1 1 2 5 4 1 5 2 1 ... $ D : int 3 3 1 3 4 4 1 4 1 1 ... ``` .bottom_abs.width100[ Data from Allison T and Cicchetti D (1976) Sleep in Mammals. *Ecological Constitutional Correlates Science* 194:732-734. ] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Check your data first - Part I ]] .row[.content[ ```r dat %>% select(BodyWt, BrainWt) %>% skimr::skim() ``` <img src="images/L3Skim.png" width="80%" height="80%"> <img src="lecture03_2020JC_files/figure-html/unnamed-chunk-5-1.png" width="1152" /> ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Check your data first - Part II ]] .row[.split-two.with-border[ .column[.content[ ```r p <- dat %>% ggplot(aes(BodyWt, BrainWt)) + geom_point() + theme_bw(base_size=20) + labs(x="Body Weight (kg)", y="Brain Weight (g)") p ``` <img src="lecture03_2020JC_files/figure-html/unnamed-chunk-6-1.png" width="360" style="display: block; margin: auto;" /> ]] .column[.content[ Note that the log-transformed data result in a more homogenous scatterplot. ```r p + coord_trans(x ="log10", y="log10") ``` <img src="lecture03_2020JC_files/figure-html/unnamed-chunk-7-1.png" width="360" style="display: block; margin: auto;" /> ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Which line looks best to you? ]] .row[.content[ <br> <img src="lecture03_2020JC_files/figure-html/unnamed-chunk-8-1.png" width="1080" style="display: block; margin: auto;" /> ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Parameter Estimation: Least Squares Estimates ]] .row[ .split-two[ .column[.content[ ``` `geom_smooth()` using formula 'y ~ x' ``` <img src="lecture03_2020JC_files/figure-html/unnamed-chunk-9-1.png" width="360" style="display: block; margin: auto;" /> * **Predicted/Fitted value**: Output of the function model function - the model function gives the typical value of the response variable conditioning on the explanatory variables. * **Residual** = Observed value - Predicted value. ]] .column.bg-brand-gray[.content[ ]] ] ]] --- count: false class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Parameter Estimation: Least Squares Estimates ]] .row[ .split-two[ .column[.content[ ``` `geom_smooth()` using formula 'y ~ x' ``` <img src="lecture03_2020JC_files/figure-html/unnamed-chunk-10-1.png" width="360" style="display: block; margin: auto;" /> * **Predicted/Fitted value**: Output of the function model function - the model function gives the typical value of the response variable conditioning on the explanatory variables. * **Residual** = Observed value - Predicted value. ]] .column.bg-brand-gray[.content[ Least Squares Estimates (.brand-blue[LSEs]) `\(\hat \beta_0\)` and `\(\hat \beta_1\)`, are those values that minimize the sum of squares `\begin{eqnarray*} S(\beta_0, \beta_1) &=& = \sum_{i=1}^n (Y_i - \beta_0 - \beta_1 x_i)^2. \label{eq:ss} \end{eqnarray*}` The least squares estimators `\(\hat \beta_0\)` and `\(\hat \beta_1\)` are given by `\begin{equation} \hat \beta_0 = \bar Y - \hat \beta_1 \bar x \quad \text{and}\quad \hat \beta_1 = \frac{S_{xy}}{S_{xx}}, \label{eq:beta} \end{equation}` where `\begin{eqnarray*} S_{xy} &=& \sum_{i=1}^n (x_i - \bar x)(Y_i - \bar Y) \\ &=& \sum x_i Y_i - n\bar{Y}\bar{x} \\ S_{xx} &=& \sum_{i=1}^n (x_i - \bar x)^2 = \sum x^2_i - n\bar{x}^2. \end{eqnarray*}` ]] ] ]] --- layout: false class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # How to do this in R ]] .row[.content[ To fit $$\hat{\log(\texttt{BrainWt}_i)}=\beta_0 + \beta_1\log(\texttt{BodyWt}_i) $$ ```r fit <- lm(log(BrainWt) ~ log(BodyWt), data=dat) coef(fit) # or fit$coef ``` ``` (Intercept) log(BodyWt) 2.1347887 0.7516859 ``` ```r y=log(dat$BrainWt); x=log(dat$BodyWt) b=sum((x-mean(x))*(y-mean(y)))/sum((x-mean(x))^2); a=mean(y)-b*mean(x); c(a,b) ``` ``` [1] 2.1347887 0.7516859 ``` to extract the regression parameter estimates, `\(\hat\beta_0\)` and `\(\hat\beta_1\)`. The fitted model, where parameter estimates are rounded to 3 decimal places, is given by `$$\hat{\log(\texttt{BrainWt})}=2.135 + 0.752\log(\texttt{BodyWt}).$$` ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # How to do this in R ]] .row[.split-50[ .column[.content[ ```r summary(fit) ``` ``` Call: lm(formula = log(BrainWt) ~ log(BodyWt), data = dat) Residuals: Min 1Q Median 3Q Max -1.71550 -0.49228 -0.06162 0.43597 1.94829 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.13479 0.09604 22.23 <2e-16 *** log(BodyWt) 0.75169 0.02846 26.41 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.6943 on 60 degrees of freedom Multiple R-squared: 0.9208, Adjusted R-squared: 0.9195 F-statistic: 697.4 on 1 and 60 DF, p-value: < 2.2e-16 ``` ]] .column.bg-brand-gray[.content[ ```r broom::tidy(fit) ``` ``` # A tibble: 2 x 5 term estimate std.error statistic p.value <chr> <dbl> <dbl> <dbl> <dbl> 1 (Intercept) 2.13 0.0960 22.2 1.18e-30 2 log(BodyWt) 0.752 0.0285 26.4 9.84e-35 ``` ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Parameter Estimation: Maximum Likelihood Estimate ]] .row[.content[ * Consider a simple linear regression model `\(Y_i = \beta_0 + \beta_1 x_i + \epsilon_i\)` assuming errors `\(\epsilon_i\sim NID(0,\sigma^2)\)`. * Therefore the .brand-blue[joint density] of the independent random responses `\(Y_1,\ldots,Y_n\)` evaluated at (the observed values) `\(\boldsymbol{y}^\top=(y_1,\ldots,y_n)\)` is `$$\begin{eqnarray} f(\boldsymbol{y};\beta_0,\beta_1,\sigma) &=& \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(y_1-\beta_0-\beta_1 x_1)^2}{2\sigma^2}}\times\cdots\times \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(y_n-\beta_0-\beta_1 x_n)^2}{2\sigma^2}}\nonumber\\ &=& \left(\frac{1}{\sqrt{2\pi}\sigma}\right)^n e^{-\frac{1}{2\sigma^2}\sum_{i=1}^n(y_i-\beta_0-\beta_1 x_i)^2}\label{eqn:MLE} \end{eqnarray}$$` * The method of .brand-blue[maximum-likelihood] is called such because it finds parameter values `\(\hat\beta_0\)`, `\(\hat\beta_1\)` and `\(\hat\sigma\)` that .brand-blue[maximise] the joint density (likelihood). * One can show (Week 2 Tutorial) that maximising the likelihood over `\(\beta_0\)` and `\(\beta_1\)` is independent of `\(\sigma\)` and is achieved by minimising `\(\sum_{i=1}^n(y_i-\beta_0-\beta_1 x_i)^2\)` because the likelihood contains `\(-\sum_{i=1}^n(y_i-\beta_0-\beta_1 x_i)^2\)`. * In this (special) case the method of maximum-likelihood gives the same parameter estimates as the method of least-squares. How about `\(\sigma^2\)`? ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Sampling distribution of `\(\hat \beta_0\)` and `\(\hat \beta_1\)` ]] .row[.content[ .split-30[ .row[.content[ From the error assumption (A1)--(A4) it follows `$$\begin{equation} \hat\beta_0 \sim N \left(\beta_0 , \, \sigma^2\left[\frac{1}{n} + \frac{\bar x^2}{S_{xx}}\right ] \right ) \quad\text{and} \quad \hat \beta_1 \sim N \left(\beta_1 , \, \frac{\sigma^2}{S_{xx}} \right ). \label{eq:b1samp-sig} \end{equation}$$` We will go through the derivation in a future lecture but we form standard error estimate from this. ]].row[ .split-two[ .column.bg-gray[.content[ ]] .column.bg-brand-gray[.content[ ]] ]]] ] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Fitted regression line ]] .row[.content[ Note that we have `\(y = \hat{\beta}_0 + \hat{\beta}_1 x = \bar{Y} + \hat{\beta}_1(x-\bar{x}).\)`<br> Thus, the regression line passes through the component-wise mean `\((\bar{x},\bar{Y}).\)` # Correlation coefficient and regression slope Recall that the .brand-blue[Pearson correlation coefficient] between vectors `\(x\)` and `\(y\)` is `$$r = \frac{S_{xy}}{\sqrt{S_{xx}S_{yy}}} \;\in\; [-1,1],\;\; \text{where } S_{yy} = \sum_i (y_i-\bar{y})^2.$$` Therefore, `$$\hat{\beta}_1 = \frac{S_{xy}}{S_{xx}}=\frac{S_{xy}}{\sqrt{S_{xx}S_{yy}}}\frac{S_{yy}}{\sqrt{S_{xx}}}=r \sqrt{S_{yy}/S_{xx}},$$` and `\(\hat{\beta}_1\)` has the .brand-blue[same sign] as `\(r\)`, i.e. both positive or both negative. ```r cor(log(dat$BrainWt), log(dat$BodyWt)) ``` ``` [1] 0.9595748 ``` ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Estimating the error variance `\(\sigma^2\)` ]] .row[.content[ * The error variance, `\(\sigma^2\)`, is also an unknown parameter. * An estimator can be obtained by using the .brand-blue[residual sum of squares (RSS)]: `$$\begin{equation} \text{RSS} = \sum_{i=1}^n (Y_i - \widehat Y_i)^2 = \sum_{i=1}^n (Y_i - \hat \beta_0 - \hat \beta_1 x_i )^2 \; , \label{eq:rss} \end{equation}$$` where * `\(\widehat Y_i = \hat \beta_0 + \hat \beta_1 x_i\)` is the `\(i\)`th .brand-blue[fitted value] and `\(p=2\)` is the .brand-blue[number of regression parameters to estimate]; * `\(R_i = y_i - \widehat Y_i\)` is the `\(i\)`-th .brand-blue[residual]. Expected value of `\(\chi^2\)` random variable is its degrees of freedom. We have `$$\text{RSS}=\sigma^2\sum_{i=1}^n \left(\frac{y_i - \widehat Y_i}{\sigma}\right)^2\sim \sigma^2\chi^2_{n-p} \quad \Rightarrow \quad E(\text{RSS})=(n-p)\sigma^2.$$` * Thus, an *unbiased estimate* of `\(\sigma^2\)` is given by `$$\begin{equation} s^2 = \frac{1}{(n-p)} \text{RSS}. \label{eq:sighat} \end{equation}$$ ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # How to do this in R ]] .row[.content[ Extracting .brand-blue[residuals]: ```r residuals(fit) %>% # or fit$residuals head() ``` ``` 1 2 3 4 5 6 -0.1015355 -0.2477190 0.7441293 -0.3316457 0.4044490 1.2843199 ``` Extracting .brand-blue[residual sum of squares] the easy way: ```r c(deviance(fit), deviance(fit)/(length(dat$BodyWt)-2)) # or sum(residuals(fit) ^ 2)/(n-2) ``` ``` [1] 28.9227104 0.4820452 ``` Extracting the .brand-blue[estimated variance of the errors] the easy way: ```r sigma(fit)^2 # remove the square to get the standard deviation ``` ``` [1] 0.4820452 ``` Thus `\(s^2 = \hat{\sigma}^2 = 0.482 \Rightarrow s = \hat\sigma = 0.694.\)` ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Residuals as error estimators ]] .row[.content[ * The residuals `\(R_i\)` can be thought of as estimates for the error terms, `\(\hat\epsilon_i\)`, in the model. * **The residuals sum up to 0**: `\begin{eqnarray*} \sum_{i=1}^n R_i &=& \sum_{i=1}^n (Y_i -\hat{Y}_i) = \sum_{i=1}^n (Y_i -\bar{Y} - \hat{\beta}_1(x_i-\bar{x}_i)) \\ &=& \sum_{i=1}^n (Y_i -\bar{Y}) - \hat{\beta}_1\sum_{i=1}^n(x_i-\bar{x}_i) =0 \end{eqnarray*}` ```r mean(fit$res) ``` ``` [1] 1.457552e-18 ``` ```r all.equal(mean(fit$res), 0) ``` ``` [1] TRUE ``` {{content}} ]] -- * Note: random variables that sum to a constant cannot be independent! --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Explaining the variability in the `\(Y\)`'s ]] .row[.content[ * If we ignore `\(x\)`, the total variability in the `\(Y\)` sample is `\(S_{yy} = \sum (Y_i-\bar{Y})^2\)`. * But `\(S_{yy} = \text{RSS} + \hat{\beta}_1^2 S_{xx}\)`. .brand-blue[Reason]: Alternative representation of `\(R_i\)` above `\(\Rightarrow\)` `\begin{eqnarray*} \text{RSS} &=& \sum_{i=1}^n (Y_i -\hat{Y}_i)^2 = \sum_{i=1}^n (Y_i -\bar{Y} - \hat{\beta}_1(x_i-\bar{x}_i))^2 = ... \\ &=& S_{yy} + \hat{\beta}_1^2 S_{xx} - 2\hat{\beta}_1S_{xy} = S_{yy} + \hat{\beta}_1^2 S_{xx} - 2\hat{\beta}_1 \hat{\beta}_1 S_{xx}; \end{eqnarray*}` since `\(\hat{\beta}_1 = S_{xy}/S_{xx}\)`. * The component `\(\hat{\beta}_1^2 S_{xx}\)` is called the .brand-blue[linear regression sum of squares] and it explains `\(R^2\)` of the total variability in `\(Y\)`. * The `\(R^2\)` is called the .brand-blue[coefficient of determination]: `$$\frac{\hat{\beta}_1^2 S_{xx}}{S_{yy}} = \frac{S^2_{xy}}{S_{xx}S_{yy}} =:r^2=R^2$$` ]] --- layout: true class: split-60 .column.bg-brand-red.white[.content[ # Summary * Check your data first before fitting a model. * **Maximum likelihood estimate** and **least squares estimate** for regression parameters in a regression model `\(Y_i = \beta_0 + \beta_1x_i + \epsilon_i\quad \epsilon\sim NID(0, \sigma^2)\)` is the same. * Computing standard errors of the regression parameters. * The regression line passes through the component-wise mean. * Correlation coefficient and slope have the same sign. * Computing the error variance. * Residuals as error estimates - errors are considered independent but residuals are not independent. * Coefficient of determination - the proportion of variability explained by the model. ]] .column[.content[ # Next lesson * Model diagnostics ]] --- class: show-10 --- count: false