class: split-70 with-border hide-slide-number bg-brand-red background-image: url("images/USydLogo-black.svg") background-size: 200px background-position: 2% 90% .column.white[.content[ <br><br><br> # Multiple Linear Regression Part I ## .black[STAT3022 Applied Linear Models Lecture 6] <br><br><br> ### .black[2020/02/17] ]] .column.bg-brand-charcoal[.content.white[ ## Today 1. Multiple regression 1. Adjusted `\(R^2\)` 1. Single parameter `\(t\)`-tests 1. Predicted values and confidence intervals in multiple regression ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Exploring multivariate or multivariable data ]] .row[.split-70[ .column[.content[ ### Installing from github ```r install.packages("remotes") remotes::install_github("kassambara/datarium") ``` ```r data("marketing", package = "datarium") str(marketing) ``` ``` 'data.frame': 200 obs. of 4 variables: $ youtube : num 276.1 53.4 20.6 181.8 217 ... $ facebook : num 45.4 47.2 55.1 49.6 13 ... $ newspaper: num 83 54.1 83.2 70.2 70.1 ... $ sales : num 26.5 12.5 11.2 22.2 15.5 ... ``` * Above loads the data containing the advertising budget (in $1000s) on youtube, facebook and newspaper and corresponding sales. * The advertising experiment has been repeated 200 times. ]] .column.bg-brand-gray[.content[ Slides made with `xaringan` R-package using [ninja-themes](https://github.com/emitanaka/ninja-theme). <br><br><br><br> <center>
<i class="fab fa-facebook fa-3x faa-burst animated "></i>
<i class="fab fa-youtube fa-3x faa-burst animated "></i>
<i class="fas fa-newspaper fa-3x faa-burst animated "></i>
</center> ] ] ] ]] --- layout: false class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Exploring multivariable or multivariate data ]] .row[.split-60[ .column.bg-brand-gray[.content[ n obs: 200, n variables: 4 ``` missing mean sd p0 p25 p50 p75 p100 youtube 0 176.45 103.03 0.84 89.25 179.70 262.59 355.68 facebook 0 27.92 17.82 0.00 11.97 27.48 43.83 59.52 newspaper 0 36.66 26.13 0.36 15.30 30.90 54.12 136.80 sales 0 16.83 6.26 1.92 12.45 15.48 20.88 32.40 ``` ![](lecture06_2020JC_files/figure-html/unnamed-chunk-7-1.png)<!-- --> ]] .column[.content[ ```r round(cor(marketing),3) ``` ``` youtube facebook newspaper sales youtube 1.000 0.055 0.057 0.782 facebook 0.055 1.000 0.354 0.576 newspaper 0.057 0.354 1.000 0.228 sales 0.782 0.576 0.228 1.000 ``` * We want high correlation between `\((Y,x_i)\)` and low correlation between `\((x_i,x_j)\)`. * That means `\(X\)` should be highly correlated with `\(Y\)` but they are independent of each other and do not overlap in the relationship with `\(Y\)`. ]]]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Exploring multivariate or multivariable data ]] .row[.split-two[ .column[.content[ ```r pairs(marketing) ``` ![](lecture06_2020JC_files/figure-html/unnamed-chunk-9-1.png)<!-- --> ]] .column.bg-brand-gray[.content[ ```r gg <- GGally::ggpairs(marketing, progress=F) ``` ![](lecture06_2020JC_files/figure-html/unnamed-chunk-11-1.svg)<!-- --> ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Why have a ggplot output instead? ]] .row[.split-two[ .column[.content[ Because all the abstract framework from Lecture 2 holds and you can easily customise like other `ggplot` objects! ```r gg + theme_bw(base_size=20) ``` ![](lecture06_2020JC_files/figure-html/unnamed-chunk-12-1.png)<!-- --> ]] .column.bg-brand-gray[.content[ You can also plot a correlation matrix in heat map. ```r GGally::ggcorr(marketing) ``` ![](lecture06_2020JC_files/figure-html/unnamed-chunk-14-1.svg)<!-- --> ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Multiple linear regression ]] .row[.split-50[ .column[.content[ * Suppose we have `\(n\)` independent observations `$$(\boldsymbol{x}_1^\top, Y_1),\ldots,(\boldsymbol{x}_n^\top, Y_n),$$` where `\(\boldsymbol{x}_i\)` is a `\((p-1)\)`-vector of known (explanatory) values. * A natural extension of simple linear regression is to consider the model with more than one predictor variables `$$Y_i = \beta_0 + \beta_1 x_{i1} + \ldots + \beta_k x_{i,p-1} + \epsilon_i,\;\; i=1,\ldots,n,$$` where `\(\epsilon_i \sim NID(0,\sigma^2)\)` and `\(p\)` is the number of regression parameters to estimate. ]] .column.bg-brand-gray[.content[ * Equivalently, we can write in matrix notation `$$\boldsymbol{Y} = \mathbf{X}\boldsymbol{\beta}+ \boldsymbol{\epsilon},\quad \boldsymbol{\epsilon}\sim N(\boldsymbol{0}, \sigma^2\textbf{I}_n),$$` where `$$\mathbf{X}=\begin{pmatrix}1 & \boldsymbol{x_1}^\top \\1 & \boldsymbol{x_2}^\top\\ \vdots & \vdots\\ 1 & \boldsymbol{x}^\top_n\end{pmatrix},\quad \boldsymbol{\beta}=\begin{pmatrix}\beta_0\\\beta_1\\\vdots\\\beta_{p-1}\end{pmatrix},\quad \boldsymbol{\epsilon}=\begin{pmatrix}\epsilon_1\\\epsilon_2\\\vdots\\\epsilon_n\end{pmatrix}.$$` ### Regression parameter estimation? Still minimise residual sum of squares to get: `$$\hat{\boldsymbol{\beta}}= (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\boldsymbol{Y}$$` if `\((\mathbf{X}^\top\mathbf{X})^{-1}\)` exists. ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Multiple linear regression ]] .row[.content[ To fit `$$\texttt{sales}_i=\beta_0 + \beta_1 \, \texttt{youtube}_i + \beta_2 \, \texttt{facebook}_i + \beta_3 \, \texttt{newspaper}_i + \epsilon_i,\quad \epsilon_i\sim NID(0, \sigma^2)$$` ```r fit <- lm(sales ~ youtube + facebook + newspaper, data= marketing) # or use shorthand below fit <- lm(sales ~ ., data= marketing) fit %>% broom::tidy() ``` ``` term estimate std.error statistic p.value 1 (Intercept) 3.526667243 0.374289884 9.4222884 1.267295e-17 2 youtube 0.045764645 0.001394897 32.8086244 1.509960e-81 3 facebook 0.188530017 0.008611234 21.8934961 1.505339e-54 4 newspaper -0.001037493 0.005871010 -0.1767146 8.599151e-01 ``` So the fitted model is approximately, `$$\hat{\texttt{sales}}_i=3.53 + 0.0458\times\texttt{youtube}_i + 0.189\times\texttt{facebook}_i -0.001\times\texttt{newspaper}_i.$$` Note the unexpected negative effect of newspaper but it is not significant. ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Multiple linear regression ]] .row[.content[ ```r summary(fit) ``` ``` Call: lm(formula = sales ~ ., data = marketing) Residuals: Min 1Q Median 3Q Max -10.5932 -1.0690 0.2902 1.4272 3.3951 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.526667 0.374290 9.422 <2e-16 *** youtube 0.045765 0.001395 32.809 <2e-16 *** facebook 0.188530 0.008611 21.893 <2e-16 *** newspaper -0.001037 0.005871 -0.177 0.86 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 2.023 on 196 degrees of freedom Multiple R-squared: 0.8972, Adjusted R-squared: 0.8956 F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16 ``` Consider dropping newspaper. Note: `\(F=\frac{SS_{reg}/(p-1)}{SSR/(n-p)}\)` tests the significant of the whole regression model. ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Covariance matrix and CI of estimates, prediction ]] .row[.content[ ```r round(vcov(fit),8) #cov of beta, not cor(marketing) for X & Y ``` ``` (Intercept) youtube facebook newspaper (Intercept) 0.14009292 -0.00031887 -0.00133859 -0.00070923 youtube -0.00031887 0.00000195 -0.00000045 -0.00000033 facebook -0.00133859 -0.00000045 0.00007415 -0.00001780 newspaper -0.00070923 -0.00000033 -0.00001780 0.00003447 ``` ```r confint(fit,level=0.95) #CI of newspaper include 0, not sig; df for t is 196 ``` ``` 2.5 % 97.5 % (Intercept) 2.78851474 4.26481975 youtube 0.04301371 0.04851558 facebook 0.17154745 0.20551259 newspaper -0.01261595 0.01054097 ``` ```r predict.at=data.frame(youtube=100,facebook=30,newspaper=40) #x0=(100,30,40) predict(fit,newdata=predict.at,interval="prediction",level=0.95) #Pred Int for hat y0 given x0 ``` ``` fit lwr upr 1 13.71753 9.712796 17.72227 ``` ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Pairwise fitted line plots ]] .row[.content[ <img src="lecture06_2020JC_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" /> * In multiple linear regression, these lines depend on the levels of other 2 variables, so they can't be plotted. * Here we plot simple linear regression with a pair at a time, ie `\(Y\)` on `\(x_1\)`, `\(Y\)` on `\(x_2\)`, `\(Y\)` on `\(x_3\)`. ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Model diagnostics ]] .row[.content[ How does it look to you? <img src="lecture06_2020JC_files/figure-html/unnamed-chunk-19-1.png" style="display: block; margin: auto;" /> Both residual plots show pattern. Both boxplot and QQ plot show negatively skewed residuals. ]] --- --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Birds of the High Paramo data ]] .row[.content[ * A paramo is an exposed, high plateau in the tropical parts of South America. * For each of the n = 14 island of vegetation the following variables were recorded: + number of species of bird present (N), + area of the island in square kilometers (AR), + elevation in thousands of meters (EL), + the distance from Ecuador in kilometers (DEc) + distance to the nearest other island in kilometers (DNI). * The response variable Y is the number of species (N). * The `\(k = 4\)` explanatory variables are AR, EL, DEc and DNI. ]] --- --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Birds of the High Paramo data ]] .row[.content[ ```r dat <- read.table("data/paramo.txt",header=TRUE,row.names=1) dim(dat) ``` ``` [1] 14 5 ``` ```r head(dat) ``` ``` N AR EL DEc DNI Chiles 36 0.33 1.26 36 14 Las Papas-Coconuco 30 0.50 1.17 234 13 Sumapaz 37 2.03 1.06 543 83 Tolima-Quindio 35 0.99 1.90 551 23 Paramillo 11 0.03 0.46 773 45 Cocuy 21 2.17 2.00 801 14 ``` Have a go. ]]