class: split-70 with-border hide-slide-number bg-brand-red background-image: url("images/USydLogo-black.svg") background-size: 200px background-position: 2% 90% .column.white[.content[ <br><br><br> # Simple Linear Regression: Diagnostics, Inference and Prediction ## .black[STAT3022 Applied Linear Models Lecture 4] <br><br><br> ### .black[2020/02/16] ]] .column.bg-brand-charcoal[.content.white[ ## Today 1. Q-Q plots for residuals 1. Sampling distribution of least squares estimators. 1. Hypothesis testing for `\(\beta_0\)` and `\(\beta_1\)` 1. Confidence intervals for `\(\beta_0\)` and `\(\beta_1\)` 1. Confidence interval for `\(\sigma^2\)` ]] --- class: split-10 layout: true .row.bg-brand-blue.white[.content.vmiddle[ # Model Diagnostics ]] .row[ .split-two[ .column[ .split-two[ .row[.content[ Plot `\(Y_i\)` vs `\(x_i\)` to see if there is `\(\approx\)` a linear relationship between `\(Y\)` and `\(x\)`. <img src="lecture04_2020JC_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> ]] .row.bg-brand-gray[.content[ A boxplot of the residuals `\(R_i\)` to check for symmetry. <img src="lecture04_2020JC_files/figure-html/unnamed-chunk-4-1.svg" style="display: block; margin: auto;" /> ]] ] ] .column[ .split-two[ .row.bg-brand-gray[.content[ To check the homoscedasticity assumption, plot `\(R_i\)` vs `\(x_i\)`. There should be no obvious patterns. <img src="lecture04_2020JC_files/figure-html/unnamed-chunk-5-1.svg" style="display: block; margin: auto;" /> ]] .row[.content[ A normal Q-Q plot, i.e. a plot of the ordered residuals vs `\(\Phi^{-1}(\frac{i}{n+1})\)`. <img src="lecture04_2020JC_files/figure-html/unnamed-chunk-6-1.svg" style="display: block; margin: auto;" /> ]] ] ] ]] --- class: hide-row2 --- count: false --- layout: false class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Assessing (A1) `\(E[\epsilon_i] = 0\)` for `\(i=1,\ldots,n\)` ]] .row[.content[ * It is a property of the least squares method that $$\sum_{i=1}^n R_i = 0,\quad \text{so}\quad \bar R_i = 0$$ where `\(R_i=\tilde \epsilon_i =Y_i -\tilde Y_i\)` and hence (A1) will always appear valid "overall". * Trend in residual versus fitted values or covariate can indicate "local" failure of (A1). * What do you conclude from the following plots? ![](lecture04_2020JC_files/figure-html/unnamed-chunk-7-1.svg)<!-- --> ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Assessing (A2)-(A3) ]] .row[.split-two[ .column.bg-brand-gray[.content[ ### (A2) `\(\epsilon_1, \ldots ,\epsilon_n\)` are independent * If (A2) is correct, then residuals should appear randomly scattered about zero if plotted against fitted values or covariate. * Long sequences of positive residuals followed by sequences of negative residuals in `\(R_i\)` vs `\(x_i\)` plot suggests that the error terms are not independent. ![](lecture04_2020JC_files/figure-html/unnamed-chunk-8-1.svg)<!-- --> ]] .column[.content[ ### (A3) `\(Var(\epsilon_i) = \sigma^2\)` for `\(i=1,\ldots,n\)` * If (A3) holds then the spread of the residuals should be roughly the same across the fitted values or covariate. ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Examining the simulated data further ]] .row[.split-two[ .column[.split-two[ .row[.content[ <img src="lecture04_2020JC_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> ]] .row.bg-brand-gray[.content[ <img src="lecture04_2020JC_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> ]]]] .column[.split-two[ .row.bg-brand-gray[.content[ {{content}} ]] .row[.content[ <img src="lecture04_2020JC_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> ]]]] ]] ]] -- Simulation scheme ```r n <- 100; s=seq(0.5, 1, length.out = n) x <- seq(0, 1, length.out = n) y1 <- x + rnorm(n) / 3 # Linear y2 <- 3 * (x - 0.5) ^ 2 + c(rnorm(n / 2)/3, rnorm(n / 2)/6) # Quadratic y3 <- -0.25 * sin(20 * x - 0.2) + x + rnorm(n,0,2) / 3 # Non-linear M1 <- lm(y1 ~ x); M2 <- lm(y2 ~ x); M3 <- lm(y3 ~ x) ``` --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Assessing (A4) `\(\epsilon_1, \ldots ,\epsilon_n\)` are normally distributed ]] .row[.split-two[ .column[.content[ ### Q-Q Plots * The function `qqnorm(x)` produces a Q-Q plot of the ordered vector `x` against the quantiles of the normal distribution, ie plotting `\((\Phi^{-1}(\frac{i}{n+1}),\frac{\epsilon_i}{\hat \sigma})\)`. * The `\(n\)` chosen normal quantiles `\(\Phi^{-1}(\frac{i}{n+1})\)` are easy to calculate but more sophisticated ways exist: * `\(\frac{i}{n+1} \mapsto \frac{i-3/8}{n+1/4}\)`, default in `qqnorm`. * `\(\frac{i}{n+1} \mapsto \frac{i-1/3}{n+1/3}\)`, recommended by Hyndman and Fan (1996). * Symmetry, heavy or light tails, location and spread can be "easily" seen in Q-Q plots. How? ]] .column.bg-brand-gray[.content[ ### In R By "hand" ```r plot(qnorm((1:n)/(n+1)),sort(resid(M1))) #or divide sigma ``` By `base` ```r qqnorm(resid(M1)) qqline(resid(M1)) ``` By `ggplot2` ```r data.frame(residual=resid(M1)) %>% ggplot(aes(sample=residual)) + stat_qq() + stat_qq_line(color="blue") ``` ]] .bottom_abs.width100[ <span style="font-size:16pt">Reference: Hyndman and Fan (1996). *Sample quantiles in statistical packages*, American Statistician, 50, 361--365.</span> ] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Does any Q-Q plot look unusual? ]] .row[.content[ <img src="lecture04_2020JC_files/figure-html/unnamed-chunk-16-1.png" width="720" style="display: block; margin: auto;" /> ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # How to interpret QQ plot ]] .row[.content[ <img src="images/InterpretQQPlot.PNG" width="100%" height="100%"> ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Making inferences based on a single data plot ]] .row[.split-two[ .column[ <img src="lecture04_2020JC_files/figure-html/unnamed-chunk-17-1.png" width="504" /> ] .column[.content[ * Did you think plot 20 was unusual? * Plot 1-19 was generated as follows: 1. Simulate new response, `\(\log(\texttt{BrainWt})\)`, from `$$\hat{\beta}_0 + \hat{\beta}_1 \log(\texttt{BodyWt}) + \epsilon_i$$` <span class="brand-blue">where `\(\epsilon_i\)` is sampled independently from `\(N(0, \hat\sigma^2)\)`</span>. 2. Fit the following linear model to the new response `$$\log(\texttt{BrainWt}) = \beta_0 + \beta_1 \log(\texttt{BodyWt}) + \epsilon_i$$` where `\(\epsilon_i\sim NID(0, \sigma^2)\)`. 3. Make a Q-Q plot from the residuals of the above fit. ]] ]] .bottom_abs.width100[ <span style="font-size:16pt">For more details see Wickham, Chowdhury, Cook and Hofmann (2018). `nullabor`: Tools for Graphical Inference.</span> ] --- exclude: true class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Using .yellow[augment] from .yellow[broom] ]] .row[.content[ ```r dat <- read.csv("data/sleep.csv") fit <- lm(log(BrainWt) ~ log(BodyWt), data=dat) fit %>% broom::augment() ``` ``` log.BrainWt. log.BodyWt. .fitted .se.fit .resid .hat 1 8.6503245 8.80297346 8.75186002 0.23006114 -0.101535517 0.10979911 2 1.8870696 0.00000000 2.13478868 0.09604339 -0.247719028 0.01913583 3 3.7954892 1.21935391 3.05135986 0.08823966 0.744129330 0.01615251 4 1.7404662 -0.08338161 2.07211189 0.09700864 -0.331645719 0.01952240 5 8.4344635 7.84267147 8.03001453 0.20508257 0.404449017 0.08725087 6 5.1901752 2.35612586 3.90585535 0.09281968 1.284319858 0.01787279 7 -1.2039728 -3.77226106 -0.70076691 0.17008415 -0.503205892 0.06001225 8 5.1298987 5.07517382 5.94972546 0.13817737 -0.819826743 0.03960829 9 3.2425924 1.19392247 3.03224341 0.08827022 0.210348946 0.01616370 10 6.0867747 3.95431592 5.10719234 0.11542360 0.979582387 0.02763767 11 1.8562980 -0.85566611 1.49159650 0.10803694 0.364701495 0.02421346 12 6.0473722 6.14203741 6.75167181 0.16271552 -0.704299635 0.05492502 13 0.8754687 -0.59783700 1.68540301 0.10396911 -0.809934274 0.02242440 14 6.0378709 5.23164323 6.06734132 0.14163489 -0.029470399 0.04161528 15 0.1823216 -2.59026717 0.18772128 0.14238685 -0.005399721 0.04205833 16 3.2188758 1.09861229 2.96060008 0.08843739 0.258275741 0.01622498 17 1.2527630 -0.24207156 1.95282689 0.09897699 -0.700063920 0.02032267 18 1.6094379 -1.60943791 0.92499683 0.12170054 0.684441080 0.03072538 19 2.8622009 0.34358970 2.39306023 0.09260303 0.469140656 0.01778946 20 4.3944492 4.09434456 5.21244990 0.11803486 -0.818000748 0.02890233 21 6.5220928 6.27098843 6.84860249 0.16581222 -0.326509689 0.05703551 22 4.7449321 3.31998733 4.63037646 0.10468519 0.114555670 0.02273436 23 0.0000000 -2.12026354 0.54101640 0.13214268 -0.541016395 0.03622417 24 6.0063532 5.33271879 6.14331840 0.14389749 -0.136965236 0.04295549 25 5.7838252 4.44265126 5.47426715 0.12484549 0.309558036 0.03233389 26 4.7833164 3.59264385 4.83532853 0.10906448 -0.052012159 0.02467624 27 1.3862944 -2.29263476 0.41144737 0.13583641 0.974846992 0.03827759 28 1.7047481 0.03922071 2.16427034 0.09560636 -0.459522243 0.01896207 29 6.4846352 6.25575004 6.83714800 0.16544505 -0.352512768 0.05678319 30 5.0562458 4.60517019 5.59643034 0.12816196 -0.540184534 0.03407458 31 4.0253517 3.55534806 4.80729381 0.10844311 -0.781942122 0.02439586 32 -1.9661129 -5.29831737 -1.84788197 0.20844810 -0.118230883 0.09013805 33 -1.3862944 -4.60517019 -1.32685299 0.19075342 -0.059441375 0.07548435 34 7.1853870 4.12713439 5.23709755 0.11865737 1.948289465 0.02920799 35 1.0986123 -2.10373423 0.55344124 0.13179263 0.545171049 0.03603251 36 2.0918641 0.30010459 2.36037308 0.09298865 -0.268509017 0.01793792 37 -0.9162907 -3.77226106 -0.70076691 0.17008415 -0.215523820 0.06001225 38 -1.1086626 -3.03655427 -0.14774646 0.15256388 -0.960916163 0.04828538 39 1.8405496 0.53062825 2.53365447 0.09111767 -0.693104837 0.01722335 40 2.3795461 1.25276297 3.07647298 0.08820853 -0.696926847 0.01614111 41 6.1944054 5.52146092 6.28519320 0.14817958 -0.090787805 0.04555006 42 2.7408400 -0.73396918 1.58307437 0.10607308 1.157765654 0.02334117 43 4.7449321 2.30258509 3.86560951 0.09235501 0.879322620 0.01769429 44 2.4336134 0.48242615 2.49742163 0.09147314 -0.063808273 0.01735799 45 5.1929569 5.25749537 6.08677401 0.14221149 -0.893817157 0.04195480 46 2.4932055 0.91629073 2.82355153 0.08898701 -0.330346081 0.01642727 47 3.6686767 1.45582042 3.22910842 0.08823977 0.439568332 0.01615255 48 0.6418539 -1.27296568 1.17791828 0.11530848 -0.536064395 0.02758257 49 3.9199912 1.44338333 3.21975963 0.08822697 0.700231550 0.01614786 50 5.1873858 1.91692261 3.57571245 0.08970444 1.611673361 0.01669322 51 2.5095993 -0.28768207 1.91854211 0.09957344 0.591057154 0.02056834 52 3.0445224 1.28093385 3.09764863 0.08819024 -0.053126196 0.01613442 53 4.5870062 2.69665216 4.16182418 0.09628846 0.425182038 0.01923361 54 5.1647860 4.01638302 5.15384731 0.11657143 0.010938666 0.02819009 55 2.5257286 0.33647224 2.38771012 0.09266513 0.138018519 0.01781332 56 0.0000000 -2.81341072 0.01998741 0.14742637 -0.019987408 0.04508817 57 0.9555114 -0.10536052 2.05559066 0.09727112 -1.100079214 0.01962818 58 2.5095993 0.69314718 2.65581766 0.09006298 -0.146218402 0.01682693 59 0.9162907 -2.26336438 0.43344950 0.13520374 0.482841228 0.03792186 60 4.0604430 1.43270073 3.21172967 0.08821711 0.848713341 0.01614425 61 1.3609766 1.25276297 3.07647298 0.08820853 -1.715496428 0.01614111 62 2.8332133 1.39871688 3.18618449 0.08819271 -0.352971141 0.01613532 .sigma .cooksd .std.resid 1 0.7000137 1.481634e-03 -0.154999431 2 0.6993962 1.265991e-03 -0.360255866 3 0.6933081 9.584369e-03 1.080539535 4 0.6987947 2.316801e-03 -0.482404874 5 0.6979813 1.776956e-02 0.609739319 6 0.6795215 3.170192e-02 1.866575053 7 0.6968857 1.783895e-02 -0.747550871 8 0.6916313 2.993752e-02 -1.204908341 9 0.6996093 7.664044e-04 0.305446427 10 0.6881054 2.909440e-02 1.430813525 11 0.6985021 3.508363e-03 0.531760730 12 0.6937719 3.163988e-02 -1.043471186 13 0.6919840 1.596622e-02 -1.179860691 14 0.7001429 4.081571e-05 -0.043358296 15 0.7001535 1.386113e-06 -0.007946171 16 0.6993327 1.159957e-03 0.375052316 17 0.6940724 1.076396e-02 -1.018714051 18 0.6942793 1.589125e-02 1.001310516 19 0.6974364 4.209610e-03 0.681799856 20 0.6917635 2.127143e-02 -1.195579266 21 0.6987841 7.092985e-03 -0.484288921 22 0.6999913 3.240208e-04 0.166903870 23 0.6964682 1.183994e-02 -0.793740504 24 0.6999166 9.125515e-04 -0.201651015 25 0.6989542 3.432206e-03 0.453247533 26 0.7001203 7.279028e-05 -0.075855418 27 0.6880895 4.079438e-02 1.431751738 28 0.6975437 4.315286e-03 -0.668220462 29 0.6985574 8.226758e-03 -0.522787527 30 0.6964878 1.105376e-02 -0.791637720 31 0.6925265 1.625550e-02 -1.140233831 32 0.6999679 1.578706e-03 -0.178525111 33 0.7001076 3.236595e-04 -0.089040552 34 0.6511095 1.220221e-01 2.848042829 35 0.6964120 1.195412e-02 0.799756396 36 0.6992647 1.390895e-03 -0.390252356 37 0.6995555 3.272415e-03 -0.320177132 38 0.6883104 5.105702e-02 -1.418692456 39 0.6942121 8.885625e-03 -1.006995766 40 0.6941528 8.400874e-03 -1.011991626 41 0.7000493 4.274841e-04 -0.133846516 42 0.6833399 3.402205e-02 1.687350683 43 0.6905607 1.470682e-02 1.277853428 44 0.7001037 7.591805e-05 -0.092711895 45 0.6899867 3.787819e-02 -1.315260618 46 0.6988096 1.922087e-03 -0.479757807 47 0.6977727 3.344418e-03 0.638290887 48 0.6965678 8.694517e-03 -0.782972891 49 0.6940954 8.484396e-03 1.016793797 50 0.6674150 4.651557e-02 2.340931152 51 0.6958232 7.769485e-03 0.860198221 52 0.7001191 4.879569e-05 -0.077143078 53 0.6979192 3.749399e-03 0.618369772 54 0.7001524 3.704630e-06 0.015981952 55 0.6999191 3.648496e-04 0.200584086 56 0.7001488 2.048950e-05 -0.029459880 57 0.6850499 2.563473e-02 -1.600238374 58 0.6998906 3.860390e-04 -0.212394456 59 0.6972146 9.907405e-03 0.709014804 60 0.6912354 1.246119e-02 1.232399309 61 0.6629608 5.090140e-02 -2.491033350 62 0.6986194 2.154109e-03 -0.512539859 ``` ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Inference for the linear regression model ]] .row[.content[ * Consider the simple linear regression model `$$Y_i = \beta_0 + \beta_1 x_i + \epsilon_i, \quad i=1,\ldots,n$$` and `\(\epsilon_i \sim NID(0,\sigma^2)\)`. * Let `\(p\)` be the number of regression parameters to be estimated. So for a simple linear regression model, `\(p=2\)`. * The following is an important known result from distribution theory (use moment generating functions to prove it): ## Theorem Let `\(Y_1,Y_2,\ldots,Y_n\)` be independent normal random variables with `\(E (Y_i) = \mu_i\)` and `\(Var(Y_i) = \sigma^2_i\)`. Then for real constants `\(c_1,\ldots,c_n\)` we have that `\begin{equation} \mathbf{c}^\top \boldsymbol{Y} = \sum_{i=1}^n c_iY_i \sim N\left(\sum_{i=1}^n c_i \mu_i, \sum_{i=1}^n c_i^2 \sigma_i^2\right). \label{eqn:vecindnor2nor} \end{equation}` ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Distribution of the regression parameter estimates ]] .row[.split-two[ .column[.content[ * Write the LSEs as linear combination of the `\(Y\)`'s: `\begin{eqnarray*} \hat{\beta}_1 &=& \frac{S_{xy}}{S_{xx}} = \frac{1}{S_{xx}} \sum_{i=1}^{n}(x_i-\bar{x})Y_i,\\ \hat{\beta_0} &=& \bar{Y} - \hat{\beta}_1\bar{x}. \end{eqnarray*}` since `\(\bar Y \sum_{i=1}^{n}(x_i-\bar{x})=0\)`. * If `\(\epsilon_i \sim NID(0,\sigma^2)\)`, we have that `$$Y_i \sim N(\beta_0+\beta_1 x_i,\sigma^2).$$` * From (A2) errors are independent `\(\Rightarrow\)` `\(Y_i\)`'s are independent. Hence we conclude that the .brand-blue[sampling distributions] of `\(\hat{\beta}_0\)` and `\(\hat{\beta}_1\)` are also normal. ]] .column[.content[ * The distribution of any random variable `\(Y \sim N(\mu,\sigma^2)\)` is uniquely defined by knowing its first two moments: location and spread. * It can be shown that `\(\hat{\beta}_0\)` and `\(\hat{\beta}_1\)` are *unbiased*, i.e. * `\(E(\hat{\beta}_0) = \beta_0\)` and `\(E(\hat{\beta}_1) = \beta_1\)`. * Further, * `\(Var(\hat{\beta}_0) = \sigma^2[\frac{1}{n}+\frac{\bar x^2}{S_{xx}}]=\sigma^2[\frac{S_{xx}+n\bar x^2}{nS_{xx}}]=\sigma^2\frac{\sum x_i^2}{nS_{xx}},\)` (depends on location of `\(x\)`'s; `\(S_{xx}=\sum x_i^2-n \bar x^2\)`), * `\(Var(\hat{\beta}_1) = \frac{\sigma^2}{S_{xx}} .\)` .content-box-blue[ `\begin{eqnarray*} \hat\beta_0 &\sim& N\left(\beta_0, \sigma^2 \dfrac{(\sum x_i^2)}{nS_{xx}}\right)\\ \hat\beta_1 &\sim& N\left(\beta_1, \dfrac{\sigma^2}{S_{xx}}\right)\\ \end{eqnarray*}` ] ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Sampling distribution ]] .row[.split-two[ .column[.content[ * Standardizing the normal sampling distribution of `\(\hat \beta_1\)` gives `\begin{equation} \frac{\hat \beta_1 - \beta_1}{\sigma/\sqrt{S_{xx}}} \sim N(0,1). \end{equation}` * Replacing `\(\sigma^2\)` with the estimate `\(s^2 = \text{RSS}/(n-p)\)` changes the sampling distribution of `\(\hat \beta_1\)` `\begin{equation} \frac{\hat \beta_1 - \beta_1}{s/\sqrt{S_{xx}}} \sim t_{n-p}. \label{eq:b1samp} \end{equation}` * This forms the basis for testing hypotheses and constructing confidence intervals for `\(\beta_1\)`. ]] .column.content-box-yellow[.content[ ## Hypothesis testing for LSEs * For testing `\(H_0: \beta = \beta^0\)`, against `\(H_1: \beta \ne \beta^0\)`, use the `\(t\)`-test statistic `\begin{equation} t_{obs} = \frac{\hat \beta - \beta^0}{\hat{SE}(\hat \beta)}\sim t_{n-p}, \label{eq:b1test} \end{equation}` `\(\beta=\beta_0,\beta_1\)` and `\(\beta^0\)` is their hypothesised values. * The `\(p\)`-value is given as `\(P(|t_{n-p}| > t_{obs})\)`. * As usual, small `\(p\)`-value indicate evidence against `\(H_0\)`. * You can obtain these `\(t\)`-test statistics and `\(p\)`-values from `summary` or `broom::tidy` functions as shown next. ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Sampling distribution of `\(\hat \beta_0\)` and `\(\hat \beta_1\)` ]] .row[ .split-two[ .column.bg-gray[.content[ ```r print(summary(fit), digits = 3) ``` ``` Call: lm(formula = log(BrainWt) ~ log(BodyWt), data = dat) Residuals: Min 1Q Median 3Q Max -1.7155 -0.4923 -0.0616 0.4360 1.9483 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.1348 0.0960 22.2 <2e-16 *** log(BodyWt) 0.7517 0.0285 26.4 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.694 on 60 degrees of freedom Multiple R-squared: 0.921, Adjusted R-squared: 0.919 F-statistic: 697 on 1 and 60 DF, p-value: <2e-16 ``` ]] .column.bg-brand-gray[.content[ ```r print(broom::tidy(fit), digits = 3) ``` ``` term estimate std.error statistic p.value 1 (Intercept) 2.135 0.0960 22.2 1.18e-30 2 log(BodyWt) 0.752 0.0285 26.4 9.84e-35 ``` A 95% confidence interval for `\(\beta_0\)` and `\(\beta_1\)` from `fit` ```r confint(fit, level=0.95) ``` ``` 2.5 % 97.5 % (Intercept) 1.9426733 2.3269041 log(BodyWt) 0.6947503 0.8086215 ``` and similarly a 98% confidence interval is given by ```r confint(fit, level=0.98) ``` ``` 1 % 99 % (Intercept) 1.9052335 2.3643438 log(BodyWt) 0.6836546 0.8197172 ``` ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Confidence intervals for regression parameters ]] .row[.content[ A `\((1-\alpha)100\%\)` confidence interval for a regression parameter, `\(\beta\)`, is `$$\hat\beta \pm t_{n-p}(1-\alpha/2) \times \hat{SE}(\hat \beta).$$` A 95% confidence interval for regression parameters from `out` can be found in R by ```r alpha <- 0.05 out %>% mutate( lwr=estimate-qt(1-alpha/2,fit$df.residual)*std.error, upr=estimate+qt(1-alpha/2,fit$df.residual)*std.error) %>% select(term,lwr, upr) ``` ``` term lwr upr 1 (Intercept) 1.9426733 2.3269041 2 log(BodyWt) 0.6947503 0.8086215 ``` ```r y=log(dat$BrainWt); x=log(dat$BodyWt); n=length(y); s2=deviance(fit)/(n-2); Sxx=sum((x-mean(x))^2) round(c(sqrt(s2*sum(x^2)/(n*Sxx)),sqrt(s2/Sxx)),4) #checking ``` ``` [1] 0.0960 0.0285 ``` ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Inference for the error variance ]] .row[.content[ Recall that an estimator of the error variance is `\(\hat{\sigma}^2 = s^2 = \dfrac{\text{RSS}}{n-p}\)`. Because of the distributional result `\(\text{RSS} \sim \sigma^2 \chi^2_{n-p}\)` we know the sampling distribution of `\(\hat{\sigma}^2\)`, `$$\hat{\sigma}^2 \sim \left(\frac{\sigma^2}{n-p}\right) \;\chi^2_{n-p}.$$` ## Confidence interval for `\(\sigma^2\)` Choose `\(a\)` and `\(b\)` such that `$$1-\alpha = P(a\leq \chi_{n-2}^2 \leq b).$$` <img src="lecture04_2020JC_files/figure-html/unnamed-chunk-26-1.png" style="display: block; margin: auto;" /> ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Confidence interval for `\(\sigma^2\)` ]] .row[.content[ Then from `\(\text{RSS} \sim \sigma^2 \chi^2_{n-2}\)`, `\begin{eqnarray*} 1-\alpha &=& P\left(a\leq \frac{\text{RSS}}{\sigma^2} \leq b\right) = P\left(\frac{\text{RSS}}{b}\leq \sigma^2 \leq \frac{\text{RSS}}{a}\right). \end{eqnarray*}` A `\((1 - \alpha)100\%\)` confidence interval for `\(\sigma^2\)` can be given as `$$\left(\frac{\text{RSS}}{b}, \frac{\text{RSS}}{a}\right)$$` where `\(a\)` and `\(b\)` satisfy `\(P(\chi_{n-p}^2 < a)=\alpha/2\)` and `\(P(\chi_{n-p}^2 <b)=1-\alpha/2\)`. ```r RSS <- deviance(fit); alpha <- 0.05 n <- nrow(dat); p <- length(coef(fit)) a <- qchisq(alpha / 2, n - p); b <- qchisq(1 - alpha / 2, n - p) c(RSS, RSS / b, RSS / a) ``` ``` [1] 28.9227104 0.3472211 0.7144630 ``` ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Prediction and estimation in linear regression ]] .row[.split-two[ .column[.content[ ### Motivation * Simple linear regression models can be used to predict the response at any given value of the predictor. * .brand-red[Beware predicting far beyond the range of the data.] * Point predictions should be accompanied by corresponding prediction intervals, providing a range of plausible values. * Suppose that we want to predict `\(Y_0=Y|x_0\)` when the predictor `\(x\)` takes the value `\(x_0\)`. .content-box-blue[ Note that predicting a response is about estimating the value of a random variable, say `\(Y_0\)`, and not the value of a parameter, say `\(\mu_0\)`. ] ]] .column[.content[ ### Prediction of `\(Y|x_0\)` * Denote the prediction by `\(\tilde Y_0\)`. * Our model for the corresponding response, `\(Y_0\)`, is `\(Y_0 = E(Y_0) + \epsilon_0.\)` * Natural to estimate the parameter `\(E(Y_0)=\mu_0\)` by `\(\hat \beta_0 + \hat \beta_1 x_0\)`. * No information about the location of `\(\epsilon_0\)` so predict by its expected value, `\(\tilde \epsilon_0 = E(\epsilon_0) = 0\)`. * But we know that the error variance is `\(\sigma^2\)`, thus to not underestimate the variability of the predicted error, we define `\(Var(\tilde{\epsilon}_0) = \sigma^2\)`. * Then for simple linear regression `$$\tilde Y_0 = \hat \beta_0 + \hat \beta_1 x_0 + \tilde \epsilon_0.$$` ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Sampling distribution for the mean response ]] .row[.content[ For a simple linear regression: * The standard deviation of the sampling distribution of `\(\hat{\mu}_0 = \hat\beta_0 + \hat \beta_1 x_0\)` is `$$\sigma_{\hat\mu_0} = \sigma \sqrt{ \frac1n + \frac{ (x_0 - \bar x)^2} {S_{xx}} }$$` * The standard error estimate for `\(\hat{\mu}_0\)` is `$$\hat{SE}(\hat \mu_0) = s \sqrt{ \frac1n + \frac{(x_0 - \bar x)^2}{S_{xx}} }$$` * It can be shown that the sampling distribution of `\(\hat{\mu}_0\)` is defined by `$$\frac{ \hat \mu_0 - \mu_0 }{\hat{SE}(\hat \mu_0)} \sim t_{n-2}$$` ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Prediction versus estimation of mean response ]] .row[.content[ * The prediction was `$$\tilde Y_0 = \color{blue}{\hat \beta_0 + \hat \beta_1 x_0} + \color{red}{\tilde \epsilon_0}.$$` * But `\(\hat \beta_0 + \hat \beta_1 x_0\)` is also the natural estimate of `\(E(Y_0)=\mu_0\)` and `\(E(\tilde \epsilon_0)=0\)`. * Hence `\(\tilde{Y}_0\)`, the predicted value of `\(Y_0\)` is the same as `\(\hat{E( Y_0)}=\hat\mu_0\)`, the estimated mean response of `\(E (Y_0)\)`. * However, differences arise if we want to construct corresponding confidence intervals! ### Confidence interval for the mean response * A `\((1-\alpha)100\%\)` confidence interval for `\(\mu_0\)` is given by `$$\hat\mu_0 \pm t_{n-p}(1-\alpha/2) \hat{SE}(\hat\mu_0).$$` * This confidence interval provides a range of plausible values for the mean response `\(\hat \mu_0\)` = plausible values for the location of the regression line. ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Prediction intervals for simple linear regression ]] .row[.content[ * When it comes to finding the .brand-blue[prediction error] for `\(Y_0\)`, we must take into account the `\(\epsilon_0\)` term: `$$Var(\tilde Y_0) = \color{blue}{Var(\hat \mu_0)} + \color{red}{Var(\tilde\epsilon_0)} = [SE(\hat \mu_0)]^2 + \sigma^2.$$` * Replacing `\(\sigma^2\)` by its estimate `\(s^2\)`, we get the prediction error estimate of `\(\tilde Y_0\)` as `$$\hat{PE}(\tilde Y_0) = s \sqrt{ \frac1n + \frac{(x_0 - \bar x)^2}{S_{xx} }+1}.$$` * A `\(100(1-\alpha)\%\)` prediction interval for `\(Y_0\)` is then `$$\tilde{Y}_0 \pm t_{n-p}(1-\alpha/2) \times \hat{PE}(\tilde Y_0).$$` `$$\tilde{Y}_0 \pm t_{n-p}(1-\alpha/2) \sqrt{\hat{SE}^2(\hat\mu_0) + \hat\sigma^2}.$$` ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Confidence and Prediction Interval in R ]] .row[.split-60[ .column[.content[ The long way in R ```r uxnew <- c(27, 32); xnew <- log(uxnew) xbar <- mean(log(dat$BodyWt)) mu_hat <- coef(fit)[1] + coef(fit)[2] * xnew s <- sigma(fit); n <- nrow(dat); p <- length(coef(fit)) Sxx <- (n - 1) * var(log(dat$BodyWt)) SE_mu <- s * sqrt(1 / n + (xnew - xbar) ^ 2 / Sxx) PE_mu <- s * sqrt(1 + 1 / n + (xnew - xbar) ^ 2 / Sxx) cbind(fit=mu_hat, lwr=mu_hat - qt(0.975, n - p) * SE_mu, upr=mu_hat + qt(0.975, n - p) * SE_mu) ``` ``` fit lwr upr [1,] 4.612223 4.403559 4.820887 [2,] 4.739934 4.525945 4.953922 ``` ```r cbind(fit=mu_hat, lwr=mu_hat - qt(0.975, n - p) * PE_mu, upr=mu_hat + qt(0.975, n - p) * PE_mu) ``` ``` fit lwr upr [1,] 4.612223 3.207839 6.016607 [2,] 4.739934 3.334748 6.145119 ``` ]] .column.bg-brand-gray[.content[ The short way in R ```r predict.at <- data.frame(BodyWt = uxnew) predict(fit, newdata = predict.at, interval = "confidence", level = 0.95) ``` ``` fit lwr upr 1 4.612223 4.403559 4.820887 2 4.739934 4.525945 4.953922 ``` ```r predict(fit, newdata = predict.at, interval = "prediction", level = 0.95) ``` ``` fit lwr upr 1 4.612223 3.207839 6.016607 2 4.739934 3.334748 6.145119 ``` ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Conf intervals for estimation `\(\mu_0\)` and prediction `\(Y_0\)` ]] .row[.split-50[ .column[.content[ <img src="images/3DRegressionLine.PNG" width="80%" height="80%"> * `\(\color{blue}{\mu_0}=E(Y_0)=\beta_0+\beta_1 x_0\)` and `\(Y_0=\color{blue}{\beta_0+\beta_1 x_0} +\color{red}{\epsilon_0}\)`. * `\(\color{blue}{\mu_0}\)` on regression line is a fixed mean parameter of normal curve when `\(x=x_0\)`. * `\(Y_0\)` is a random variable distributes around the normal curve with mean `\(\mu_0\)`, ie `\(Y_0 \sim N(\mu_0, \color{red}{\sigma^2})\)` with variability due to `\(\sigma^2\)`. ]] .column.bg-brand-gray[.content[ * `\(\mu_0\)` and `\(Y_0\)` are estimated from a sample. * Think about different samples give different regression lines with slightly different `\(\hat \mu_0\)`. * That is `\(\color{blue}{\hat \mu_0}\)` has variability due to sampling distribution. * Then `\(\hat Y_0\)` has two sources of variability from the model distribution `\(N(\mu_0, \color{red}{\sigma^2})\)` and the sampling distribution of `\(\color{blue}{\hat \mu_0}\)`. * This explains why the CI for prediction `\(\hat Y_0\)` is wider than the CI for mean `\(\hat \mu_0\)` as shown in the next plot. ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Confidence interval plot for simple linear regression ]] .row[.content[ Confidence interval for <span class="brand-blue">mean response</span> and <span class="brand-red">prediction</span>. <img src="lecture04_2020JC_files/figure-html/unnamed-chunk-30-1.png" style="display: block; margin: auto;" /> Note that the CI for mean response is narrower in the middle (Why?) and is much narrower than the CI for prediction. ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Why these CIs are narrower in the middle? ]] .row[.content[ <img src="images/CentreEstInterval.PNG" width="30%" height="30%"> * The error increases away from `\(\bar x\)` because if you wiggle the regression line, it makes more of a difference as it gets further away from the mean `\(\bar x\)`. ]] --- layout: true class: split-60 .column.bg-brand-red.white[.content[ # Summary * Sampling distribution of least squares estimators is normal under assumptions (A1)-(A4). * For `\(Y_i = \beta_0 + \beta_1 x_i + \epsilon_i, \quad \epsilon_i\sim NID(0, \sigma^2)\)` `$$\hat\beta_0 \sim N\left(\beta_0, \sigma^2 \dfrac{(\sum x_i^2)}{nS_{xx}}\right)\qquad \hat\beta_1 \sim N\left(\beta_1, \dfrac{\sigma^2}{S_{xx}}\right)$$` * A `\((1-\alpha)100\%\)` confidence interval * for a regression parameter, `\(\beta\)`, is `\(\hat\beta \pm t_{n-p}(1-\alpha/2) \times \hat{SE}(\hat \beta)\)`; * for `\(\sigma^2\)` is `\(\left(\frac{\text{RSS}}{b}, \frac{\text{RSS}}{a}\right)\)` where `\(a\)` and `\(b\)` satisfy `\(1-\alpha = P(a\leq \chi_{n-p}^2 \leq b)\)`; and * for `\(\mu_0\)` is `\(\hat\mu_0 \pm t_{n-p}(1-\alpha/2) \hat{SE}(\hat\mu_0).\)` * A `\((1-\alpha)100\%\)` prediction interval for `\(\mu_0\)` is `$$\hat\mu_0 \pm t_{n-p}(1-\alpha/2) \sqrt{\hat{SE}^2(\hat\mu_0) + \hat\sigma^2}.$$` ]] .column[.content[ # Next lesson * Matrix algebra for linear regression ]] --- class: show-10 --- count: false