class: split-70 with-border hide-slide-number bg-brand-red background-image: url("images/USydLogo-black.svg") background-size: 200px background-position: 2% 90% .column.white[.content[ <br><br><br> # Multiple Linear Regression Part II ## .black[STAT3022 Applied Linear Models Lecture 7] <br><br><br> ### .black[2020/02/17] ]] .column.bg-brand-charcoal[.content.white[ ## Today 1. More multiple linear regression 1. Multiple correlation coefficient 1. Coefficient of determination `\(R^2\)` 1. Adjusted `\(R^2\)` 1. Modelling categorical variables ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Let's look at sales vs. youtube ]] .row[.split-two[ .row[.split-two[ .column[.content[ <img src="lecture07_2020JC_files/figure-html/unnamed-chunk-3-1.svg" style="display: block; margin: auto;" /> ]] .column.bg-brand-gray[.content[ <img src="lecture07_2020JC_files/figure-html/unnamed-chunk-4-1.svg" style="display: block; margin: auto;" /> ]]] ] .row[.split-two[ .column.bg-brand-gray[.content[ <img src="lecture07_2020JC_files/figure-html/unnamed-chunk-5-1.svg" style="display: block; margin: auto;" /> ]] .column[.content[ <img src="lecture07_2020JC_files/figure-html/unnamed-chunk-6-1.svg" style="display: block; margin: auto;" /> ]]] ] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Let's look at sales vs. facebook ]] .row[.split-two[ .row[.split-two[ .column[.content[ <img src="lecture07_2020JC_files/figure-html/unnamed-chunk-7-1.svg" style="display: block; margin: auto;" /> ]] .column.bg-brand-gray[.content[ <img src="lecture07_2020JC_files/figure-html/unnamed-chunk-8-1.svg" style="display: block; margin: auto;" /> ]]] ] .row[.split-two[ .column.bg-brand-gray[.content[ <img src="lecture07_2020JC_files/figure-html/unnamed-chunk-9-1.svg" style="display: block; margin: auto;" /> ]] .column[.content[ <img src="lecture07_2020JC_files/figure-html/unnamed-chunk-10-1.svg" style="display: block; margin: auto;" /> ]]] ] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Let's look at sales vs. newspaper ]] .row[.split-two[ .row[.split-two[ .column[.content[ <img src="lecture07_2020JC_files/figure-html/unnamed-chunk-11-1.svg" style="display: block; margin: auto;" /> ]] .column.bg-brand-gray[.content[ <img src="lecture07_2020JC_files/figure-html/unnamed-chunk-12-1.svg" style="display: block; margin: auto;" /> ]]] ] .row[.split-two[ .column.bg-brand-gray[.content[ <img src="lecture07_2020JC_files/figure-html/unnamed-chunk-13-1.svg" style="display: block; margin: auto;" /> ]] .column[.content[ <img src="lecture07_2020JC_files/figure-html/unnamed-chunk-14-1.svg" style="display: block; margin: auto;" /> ]]] ] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # What is your guess for the best model? ]] .row[.content[ ```r fit01 <- lm(sales ~ youtube + facebook + newspaper, data=marketing) fit02 <- lm(sales ~ log(youtube) + facebook + newspaper, data=marketing) fit03 <- lm(sales ~ youtube + log(facebook) + newspaper, data=marketing) fit04 <- lm(sales ~ youtube + facebook + log(newspaper), data=marketing) fit05 <- lm(sales ~ log(youtube) + log(facebook) + newspaper, data=marketing) fit06 <- lm(sales ~ log(youtube) + facebook + log(newspaper), data=marketing) fit07 <- lm(sales ~ youtube + log(facebook) + log(newspaper), data=marketing) fit08 <- lm(sales ~ log(youtube) + log(facebook) + log(newspaper), data=marketing) fit09 <- lm(log(sales) ~ youtube + facebook + newspaper, data=marketing) fit10 <- lm(log(sales) ~ log(youtube) + facebook + newspaper, data=marketing) fit11 <- lm(log(sales) ~ youtube + log(facebook) + newspaper, data=marketing) fit12 <- lm(log(sales) ~ youtube + facebook + log(newspaper), data=marketing) fit13 <- lm(log(sales) ~ log(youtube) + log(facebook) + newspaper, data=marketing) fit14 <- lm(log(sales) ~ log(youtube) + facebook + log(newspaper), data=marketing) fit15 <- lm(log(sales) ~ youtube + log(facebook) + log(newspaper), data=marketing) fit16 <- lm(log(sales) ~ log(youtube) + log(facebook) + log(newspaper), data=marketing) ``` {{content}} ]] -- * Should we remove `newspaper`? * Are there other model that we have not considered? How to select them? {{content}} -- <div style="width:60%"> <blockquote> All models are wrong, but some are useful. -- George Box </blockquote> </div> --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # How about this model with interaction? ]] .row[.content[ ```r final_fit <- lm(log(sales) ~ log(youtube)*facebook, data=marketing[-c(131, 156),]) summary(final_fit) #df=n-2-p=200-2-4 ``` ``` Call: lm(formula = log(sales) ~ log(youtube) * facebook, data = marketing[-c(131, 156), ]) Residuals: Min 1Q Median 3Q Max -0.118289 -0.026081 0.004445 0.024301 0.098253 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.1370471 0.0304426 37.351 <2e-16 *** log(youtube) 0.2676378 0.0061356 43.620 <2e-16 *** facebook -0.0002488 0.0008666 -0.287 0.774 log(youtube):facebook 0.0023674 0.0001743 13.584 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.03962 on 194 degrees of freedom Multiple R-squared: 0.989, Adjusted R-squared: 0.9889 F-statistic: 5830 on 3 and 194 DF, p-value: < 2.2e-16 ``` ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Which plot looks unusual to you? ]] .row[.split-70[ .column[.content[ <img src="lecture07_2020JC_files/figure-html/unnamed-chunk-16-1.png" width="720" style="display: block; margin: auto;" /> ]] .column.bg-brand-gray[.content[ {{content}} ]] ]] -- * A similar null generation mechanism in Lecture 4 was employed here where the errors are sampled from Normal distribution with mean 0 and variance from the estimated error variance from `final_model`. * The simulated residual plots provide normal references to compare with the data residual plot. * The data residual plot is 18. --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Model diagnostics for the "final" model ]] .row[.content[ <br> <img src="lecture07_2020JC_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" /> ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Dissecting the "final" model ]] .row[.content[ * Observation 131 and 156 were removed as it was deemed outliers (we will cover this in the next lecture). ```r final_fit <- lm(log(sales) ~ log(youtube)*facebook, data=marketing[-c(131, 156),]) ``` is a shorthand for ```r final_fit <- lm(log(sales) ~ log(youtube) + facebook + log(youtube):facebook, data=marketing[-c(131, 156),]) ``` * The last term, `log(youtube):facebook`, is referred to as interaction term between `log(youtube)` and `facebook` and is the same as fitting: ```r final_fit2 <- marketing[-c(131,156), ] %>% mutate(logYoutube=log(youtube), logYandF=log(youtube)*facebook) %>% lm(log(sales) ~ logYoutube + facebook + logYandF, data=.) ``` ]] --- ### Indeed they have the same output ```r final_fit %>% broom::tidy() ``` ``` term estimate std.error statistic p.value 1 (Intercept) 1.1370471442 0.0304426181 37.3505045 1.554532e-90 2 log(youtube) 0.2676378453 0.0061356267 43.6202951 3.202419e-102 3 facebook -0.0002487674 0.0008665749 -0.2870697 7.743653e-01 4 log(youtube):facebook 0.0023673639 0.0001742763 13.5839689 5.660864e-30 ``` ```r final_fit2 %>% broom::tidy() ``` ``` term estimate std.error statistic p.value 1 (Intercept) 1.1370471442 0.0304426181 37.3505045 1.554532e-90 2 logYoutube 0.2676378453 0.0061356267 43.6202951 3.202419e-102 3 facebook -0.0002487674 0.0008665749 -0.2870697 7.743653e-01 4 logYandF 0.0023673639 0.0001742763 13.5839689 5.660864e-30 ``` --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Dissecting the "final" model ]] .row[.content[ `$$\hat{\log(\texttt{sales}_i) }= \color{blue}{1.14} + \color{red}{0.268}\times \log(\texttt{youtube}_i) \color{blue}{- 0.0000249\times \texttt{facebook}_i} \color{red}{+ 0.00237 \times} \log(\texttt{youtube}_i)\times \color{red}{\texttt{facebook}_i}$$` `$$\hat{\texttt{sales}_i }= \texttt{youtube}_i^{\color{red}{0.268 + 0.00237 \times \texttt{facebook}_i}}\exp\left(\color{blue}{1.14 - 0.0000249\times \texttt{facebook}_i}\right)$$` since `\(a \log b=\log b^a\)`. This becomes multiplicative after taking log transformation. .pull-left[ ```r linedat <- expand.grid(facebook=seq(0, 59.52, 2), youtube=seq(0.84, 355.68, 1)) linedat$sales <- exp(predict(final_fit, linedat)) ggplot(marketing, aes(youtube, sales)) + geom_point(aes(color=facebook)) + theme_bw(base_size=18) + geom_line(data=linedat, aes(youtube, sales, group=facebook, color=facebook), alpha=0.5) + ggtitle("Plotting prediction lines") ``` ] .pull-right[ <img src="lecture07_2020JC_files/figure-html/unnamed-chunk-24-1.png" style="display: block; margin: auto;" /> ] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Explaining the codes ]] .row[.split-50[ .column[.content[ ```r linedat<-expand.grid(facebook=seq(0,59.52,2),youtube=seq(0.84,355.68,1)) linedat$sales <- exp(predict(final_fit, linedat)) ``` Set grids and calculate **sales** as exponentiation of models for all combination of levels of **youtube** and **facebook**. .brand-blue[linedat] contains first 2 columns of **youtube** and **facebook** with all combinations and the third column **sales** evaluated at these combinations. ]] .column.bg-brand-gray[.content[ ```r ggplot(marketing,aes(youtube,sales))+geom_point(aes(color=facebook)) ``` Assign (x,Y) as (youtube, sales) with colour of points using values of **facebook**. ```r geom_line(data=linedat,aes(youtube,sales,group=facebook,color=facebook),alpha=0.5) ``` Draw lines of **sales** on **youtube** using .brand-blue[linedat] for each level of **facebook** with line colour using levels of **facebook**. ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Got another model. How is this model? ]] .row[.content[ ```r skimr::skim(mystery_data) ``` ``` Skim summary statistics n obs: 8406 n variables: 5 -- Variable type:numeric ------------------------------------------------------------------------------------ variable missing n mean sd p0 p25 p50 p75 p100 hist x1 0 8406 0.016 0.99 -4.48 -0.65 0.014 0.69 3.75 <U+2581><U+2581><U+2582><U+2586><U+2587><U+2585><U+2581><U+2581> x2 0 8406 0.017 1 -3.54 -0.64 0.0076 0.68 3.57 <U+2581><U+2581><U+2583><U+2587><U+2587><U+2583><U+2581><U+2581> x3 0 8406 0.00066 1 -3.71 -0.68 -0.0048 0.69 3.87 <U+2581><U+2581><U+2583><U+2587><U+2587><U+2583><U+2581><U+2581> x4 0 8406 0.014 1.99 -7.11 -1.39 0.032 1.39 6.86 <U+2581><U+2581><U+2583><U+2587><U+2587><U+2585><U+2581><U+2581> y 0 8406 0.045 1.41 -4.67 -0.89 0.052 1.09 3.8 <U+2581><U+2581><U+2582><U+2586><U+2587><U+2586><U+2583><U+2581> ``` .pull-left[ ![](lecture07_2020JC_files/figure-html/unnamed-chunk-30-1.png)<!-- --> ] .pull-right[ ```r fit0<-lm(y ~ ., data=mystery_data); print(broom::tidy(fit0), digits = 4) ``` ``` term estimate std.error statistic p.value 1 (Intercept) -0.002044 0.01092 -0.1871 0.8516 2 x1 0.979051 0.01510 64.8375 0.0000 3 x2 0.997928 0.01550 64.3711 0.0000 4 x3 0.972790 0.01543 63.0656 0.0000 5 x4 0.994709 0.01086 91.5907 0.0000 ``` ] ]] -- <div class="bg-brand-red white" style="position:absolute;top:45%;left:30%;padding-left:20px;padding-right:20px"> `$$\hat{Y} = -0.002 + 0.979x_1 + 0.998x_2 + 0.973x_3 + 0.995x_4$$` </div> --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Ha! Jokes - My residual plot is Homer Simpson ]] .row[.split-two[ .column[.content[ ```r lm(y ~ ., data=mystery_data) %>% broom::augment() %>% ggplot(aes(.fitted, .resid)) + geom_point() + labs(x="Fitted Values", y="Residual") + theme(axis.title=element_text(size=20)) ``` .content-box-blue[ ## Moral of the story: ### Don't forget to check model diagnostics. ] ]] .column[.content[ ![](lecture07_2020JC_files/figure-html/unnamed-chunk-33-1.png)<!-- --> ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Goodness of Fit (GoF) criteria ]] .row[.split-two[ .column[.content[ A measure of the goodness of fit of the multiple regression model is `$$R^2 = \frac{S_{yy} - \text{RSS}}{S_{yy}}$$` the proportion of total variability in `\(Y\)` explained by the linear model. * `\(R = \sqrt{R^2}\)` is called the .brand-blue[multiple correlation coefficient], `\(0\leq R \leq 1\)`. * `\(R\)` is in fact the Pearson correlation coefficient between the observed and fitted values, `\(\{Y_i\}\)` and `\(\{\hat{Y}_i\}\)`. ]] .column.bg-brand-gray[.content[ * `\(R^2\)` is always higher or equal for more complex models. * A modified version of `\(R^2\)` that takes into account model complexity in assessing GoF is called the *adjusted* `\(R^2\)` by allowing for df given by `\begin{equation*} R_a^2 = 1-\frac{\text{RSS}/(n-p)}{S_{yy}/(n-1)} \end{equation*}` * More parameters increase `\(p\)` and hence the subtracted term. * You can use the function `glance` from `broom` to calculate these values: ```r broom::glance(final_fit)[,1:2] ``` ``` r.squared adj.r.squared 1 0.9890294 0.9888597 ``` ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Find `\(R^2\)` and `\(R^2_a\)` in `summary` output ]] .row[.content[ ``` Call: lm(formula = log(sales) ~ log(youtube) * facebook, data = marketing[-c(131, 156), ]) Residuals: Min 1Q Median 3Q Max -0.118289 -0.026081 0.004445 0.024301 0.098253 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.1370471 0.0304426 37.351 <2e-16 *** log(youtube) 0.2676378 0.0061356 43.620 <2e-16 *** facebook -0.0002488 0.0008666 -0.287 0.774 log(youtube):facebook 0.0023674 0.0001743 13.584 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.03962 on 194 degrees of freedom Multiple R-squared: 0.989, Adjusted R-squared: 0.9889 F-statistic: 5830 on 3 and 194 DF, p-value: < 2.2e-16 ``` ```r c(summary(final_fit)$r.squared,summary(final_fit)$adj.r.squared) #alternative way ``` ``` [1] 0.9890294 0.9888597 ``` ]] --- class: bg-brand-red middle center white # Modelling categorical variables --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Example: weight by gender ]] .row[.split-50[ .column[.content[ ```r data("genderweight", package="datarium") ggplot(genderweight, aes(group, weight)) + geom_point() + geom_jitter() + theme_bw(base_size=18) ``` ![](lecture07_2020JC_files/figure-html/unnamed-chunk-37-1.png)<!-- --> ]] .column.bg-brand-gray[.content[ ```r head(genderweight,5) ``` ``` # A tibble: 5 x 3 id group weight <fct> <fct> <dbl> 1 1 F 61.6 2 2 F 64.6 3 3 F 66.2 4 4 F 59.3 5 5 F 64.9 ``` ```r tail(genderweight,5) ``` ``` # A tibble: 5 x 3 id group weight <fct> <fct> <dbl> 1 36 M 82.6 2 37 M 77.0 3 38 M 81.6 4 39 M 87.4 5 40 M 86.4 ``` ]]]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Data summary ]] .row[.content[ n obs: 40, n variables: 3 ``` missing mean sd p0 p25 p50 p75 p100 weight 0 74.66 11.79 59.31 62.96 72.92 62.96 95.06 weightF 0 63.50 2.03 59.31 62.30 62.94 62.30 68.83 weightM 0 85.83 4.35 77.01 83.17 86.27 83.17 95.06 ``` ![](lecture07_2020JC_files/figure-html/unnamed-chunk-40-1.png)<!-- --> ```r #skimr::skim(genderweight) genderweight %>% group_by(group) %>% summarise(n=n(), mean=mean(weight)) ``` ``` # A tibble: 2 x 3 group n mean <fct> <int> <dbl> 1 F 20 63.5 2 M 20 85.8 ``` ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Example: modelling weight by gender ]] .row[.split-40[ .column[.content[ We consider the following model: `$$\texttt{weight}_i = \begin{cases} \mu_F + \epsilon_i, \quad \text{if female}\\ \mu_M + \epsilon_i, \quad \text{if male} \end{cases}$$` which is equivalent to: `$$\texttt{weight}_i = \beta_1x_{1i}+\beta_2x_{2i} + \epsilon_i$$` where * `\(x_{1i}\)` is 1 for female, 0 otherwise; * `\(x_{2i}\)` is 1 for male and 0 otherwise; * `\(\beta_1=\mu_F\)`; and `\(\beta_2=\mu_M\)`, * assuming `\(\epsilon_i \sim NID(0, \sigma^2)\)`. Note there is no intercept in this model. ]] .column.bg-brand-gray[.content[ * In R, include `-1` or `0` in the model formulae to remove the intercept in the model otherwise it is included by default. * If a term in the model formulae is a `factor`, R expands into dummy variables ( `\(\boldsymbol{x}_{1}\)` and `\(\boldsymbol{x}_{2}\)` ) for you. ```r fitw <- lm(weight ~ -1 + group, data=genderweight) print(broom::tidy(fitw), digits = 4) ``` ``` term estimate std.error statistic p.value 1 groupF 63.50 0.7593 83.62 1.079e-44 2 groupM 85.83 0.7593 113.03 1.203e-49 ``` .blockquote.bg-white[ What do you notice about the estimate here? Why? ] ]]]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Including intercept with categorical variable ]] .row[.split-50[ .column[.content[ Now consider the following model `$$\texttt{weight}_i = \beta_0 + \beta_1x_{1i}+\beta_2x_{2i} + \epsilon_i$$` * `\(x_{1i}\)` is 1 for female, 0 otherwise; * `\(x_{2i}\)` is 1 for male and 0 otherwise; and * assuming `\(\epsilon_i \sim NID(0, \sigma^2)\)`. ```r fitw1 <- lm(weight ~ group, data=genderweight) print(broom::tidy(fitw1), digits = 4) ``` ``` term estimate std.error statistic p.value 1 (Intercept) 63.50 0.7593 83.62 1.079e-44 2 groupM 22.33 1.0739 20.79 2.331e-22 ``` ```r c(sum(fitw1$coeff),fitw$coeff[2]) #mu_m=b0+b2 in M2 ``` ``` groupM 85.82612 85.82612 ``` ]] .column.bg-brand-gray[.content[ * Notice there is no result for `\(\hat{\beta}_1\)` and now `\(\mu_F=\beta_0\)` and `\(\mu_M=\beta_0 + \beta_2\)`! * This is because we have the constraint `\(\beta_1=0\)` here while before the constraint was `\(\beta_0=0\)`. * Recall that LSE is given as `\(\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\boldsymbol{Y}\)` *if* `\((\mathbf{X}^\top\mathbf{X})^{-1}\)` exists. * If the model includes an intercept and all level of a categorical variable, `\((\mathbf{X}^\top\mathbf{X})^{-1}\)` does not exist and to obtain a unique solution, constrain must be applied. * Consequently, you should be careful in your interpretation if your data contain categorical variables. * If the model includes an intercept and one categorical variable then R will constrain the first level of the categorical variable to zero by default. ]]]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Design matrix for categorical variable ]] .row[.content[ Just some additional information. * The data and design matrices `\(\boldsymbol{X}\)` for models without intercept and that with intercept are `$$\text{data}= \begin{pmatrix} F & 61.6\\ F & 64.6\\ \vdots & \vdots\\ M & 87.4\\ M & 86.4\\ \end{pmatrix}, \boldsymbol{X}_{M1}= \begin{pmatrix} 1 & 0\\ 1 & 0\\ \vdots & \vdots \\ 0 & 1\\ 0 & 1 \\ \end{pmatrix}, \boldsymbol{X}_{M2}= \begin{pmatrix} 1 & 0\\ 1 & 0\\ \vdots & \vdots \\ 1 & 1\\ 1 & 1\\ \end{pmatrix}$$` * This explains why `\(se_{\tiny M1}(\hat \beta_1)=se_{\tiny M1}(\hat \beta_2)\)` (M1 has no intercept) as the second column of `\(\boldsymbol{X}_{M1}\)` for `\(\beta_2\)` is 1 minus the first column for `\(\beta_1\)`. Hence their variabilities are the same. * Also, `\(se_{\tiny M2}(\hat \beta_0)=se_{\tiny M1}(\hat \beta_1)\)` and `\(se_{\tiny M2}(\hat \beta_2)=\sqrt{Var_{\tiny M1}(\hat \beta_1)+Var_{\tiny M1}(\hat \beta_2)}=\sqrt{2}se_{\tiny M2}(\hat \beta_0)\)`. ]]