class: split-70 with-border hide-slide-number bg-brand-red background-image: url("images/USydLogo-black.svg") background-size: 200px background-position: 2% 90% .column.white[.content[ <br><br><br> # Robust Regression ## .black[STAT3022 Applied Linear Models Lecture 13] <br><br><br> ### .black[2020/02/17] ]] .column.bg-brand-charcoal[.content.white[ ## Today 1. Robustness and regression 1. Efficient and resistant regression 1. L1 regression 1. M estimation, MM estimation 1. Least-median-squares and least-trimmed-squares ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Robust regression ]] .row[.split-two[ .column[.content[ *"The method of Least Squares is seen to be our best course when we have thrown overboard a certain portion of our data -- a sort of sacrifice which has often to be made by those who sail upon the stormy seas of Probability"* <span style="float:right">Edgeworth (1887)</span> .bottom_abs.width100[ Further references: * Rousseeuw and Leroy (2005, Chapter 1-3). Robust Regression and Outlier Detection, New York: Wiley. * Venables and Ripley (2003, Chapter 6). Modern Applied Statistics with S (4e). New York: Springer. ] ]] .column.bg-brand-gray[.content[ ### `\(x\)`-outliers and `\(y\)`-outliers * Outliers `\(\neq\)` extreme points, i.e. every univariate data set has a smallest and largest observation. * Outliers are observations, which have a different underlying distribution/model than the bulk of the data (recall lecture 8). * For regression data `\((\boldsymbol{Y},\mathbf{X})\)` the different model can be in the `\(x\)`-space (e.g. high leverage points) or in the `\(y\)`-space. * Cooks distance measures the "outlyingness" based on both, `\(\boldsymbol{Y}\)` and `\(\mathbf{X}\)` but is only "powerful" when *single outliers* are present. ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Example: Phone calls ]] .row[.split-50[ .column[.content[ * Data on number of international phone `calls` from Belgium in the `year`s 1950-1973. ```r library(MASS) # load R-package MASS # the data set phones is within the MASS package dat <- as.data.frame(phones) skimr::skim(dat) ``` ``` Skim summary statistics n obs: 24 n variables: 2 -- Variable type:numeric --------------------------------------------------------------------------------------------------- variable missing mean sd p0 p50 p100 hist calls 0 49.99 65.53 4.4 15.5 212 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> year 0 61.5 7.07 50 61.5 73 <U+2587><U+2587><U+2587><U+2587><U+2587><U+2587><U+2587><U+2587> ``` ```r cor(dat$year, dat$calls) ``` ``` [1] 0.5439873 ``` ]] .column.bg-brand-gray[.content[ <img src="lecture13_2020JC_files/figure-html/unnamed-chunk-4-1.svg" style="display: block; margin: auto;" /> * Heavy contamination from 64--69. Why? * From 64-69 another recording device was used, giving the total number of minutes of these calls instead of total number of calls. * .brand-blue[Outliers in] `\(\color{blue}{y}\)`.brand-blue[-direction]. ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Example: Star cluster CYG OB1 ]] .row[.split-two[ .column[.content[ ```r dat2 <- read.table("data/star.txt", header = TRUE, row.names = NULL) skimr::skim(dat2) ``` ``` Skim summary statistics n obs: 47 n variables: 3 -- Variable type:integer --------------------------------------------------------------------------------------------------- variable missing n mean sd p0 p25 p50 p75 p100 hist index 0 47 24 13.71 1 12.5 24 35.5 47 <U+2587><U+2587><U+2587><U+2587><U+2587><U+2587><U+2587><U+2587> -- Variable type:numeric --------------------------------------------------------------------------------------------------- variable missing mean sd p0 p50 p100 hist llint 0 5.01 0.57 3.94 5.1 6.29 <U+2585><U+2586><U+2585><U+2587><U+2587><U+2586><U+2583><U+2582> ltemp 0 4.31 0.29 3.48 4.42 4.62 <U+2582><U+2581><U+2581><U+2581><U+2581><U+2585><U+2587><U+2585> ``` ```r cor(dat2$ltemp, dat2$llint) ``` ``` [1] -0.2104133 ``` ]] .column.bg-brand-gray[.content[ <img src="lecture13_2020JC_files/figure-html/unnamed-chunk-6-1.svg" style="display: block; margin: auto;" /> * Data on 47 stars in the direction of Cygnus. * `\(x=\)` effective temperature at the surface. * `\(y=\)` logarithm of light intensity. * .brand-blue[Outliers in] `\(\color{blue}{x}\)`.brand-blue[-direction but no reporting errors.] * Most stars (43) lie on the "main sequence", 4 stars are giants that are high leverage points. ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Least Squares (LS) Estimate ]] .row[.split-two[ .column[.content[ Given a (candidate) regression model `\begin{eqnarray*} Y_i &=& \beta_0 + \beta_1 x_{i1} + \ldots + \beta_{(p-1)} x_{i(p-1)} + \epsilon_{i}. \end{eqnarray*}` * What if `\(\epsilon_i \not\sim NID(0,\sigma^2\)`)? * Under (A1)-(A4), we used the LS for estimation of `\(\beta\)`-parameters, `\(\sigma^2\)` and for the selection of the "best" model. * Information is stored in `\(\text{RSS}(\mathbf{m})\)`! * Is it possible to use all or parts of data? * Aim: .brand-blue[Robustness. M-estimation]. But, robust against what? ]] .column.bg-brand-gray[.content[ Given a model `\(\mathbf{m}\)`, * Residual sum of squares `$$\text{RSS}(\mathbf{m}) = \sum_{i=1}^n (Y_i - \hat{Y}_{\mathbf{m} i})^2 = \sum_{i=1}^n R_{i}^2(\hat{\boldsymbol{\beta}};\mathbf{m})$$` * To measure the average (expected) squared error of fit we simply use $$\frac{1}{n}\text{RSS}(\mathbf{m}). $$ * Thus choose `\(\boldsymbol{\beta}_{\mathbf{m}}\)` such that `$$\hat{\boldsymbol{\beta}}_{\text{LS}} = \text{argmin}_{\boldsymbol{\beta}\in\Bbb{R}^{p_{\mathbf{m}}}} \big(\operatorname{mean}\{R^2_{ 1}(\boldsymbol{\beta};\mathbf{m}),\ldots,R^2_{ n}(\boldsymbol{\beta}; \mathbf{m})\}\big).$$` ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Use more robust distance measures: L1 regression ]] .row[.split-two[ .column[.content[ * Use absolute value instead of squares, `$$\hat{\boldsymbol{\beta}}_{\text{L1}} = \text{argmin}_{\boldsymbol{\beta}\in\Bbb{R}^{p_{\mathbf{m}}}} \big(\operatorname{mean}\{|R_{\mathbf{m} 1}(\boldsymbol{\beta})|,\ldots,|R_{\mathbf{m} n}(\boldsymbol{\beta})|\}\big),$$` * For example, `\(Y_i = \beta_0 + \beta_1 x_{i 1} + \epsilon_i\)` `$$\Rightarrow \hat{\boldsymbol{\beta}}_{\text{L1}} = \text{argmin}_{\boldsymbol{\beta}\in\Bbb{R}^{2}} \sum_{i=1}^n|Y_i - \beta_0 - \beta_1 x_{i 1}|.$$` * `\(\hat{\boldsymbol{\beta}}_{\text{L1}}\)` can be obtained by .brand-blue[linear programming] (not covered in this course). ]] .column.bg-brand-gray[.content[ ### Least absolute regression (L1) * Idea attributed to Laplace (1749--1827) but credit given to Edgeworth (1887). * Robust against `\(y\)`-outliers unlike LS-regression. * .brand-blue[Not robust against] `\(\color{blue}{x}\)`.brand-blue[-outliers.] * In R use `rq` function in `library(quantreg)`. ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Loss function ]] .row[.split-40[ .column[.content[ <img src="images/LossFunction.png" width="90%" height="90%"> ]] .column.bg-brand-gray[.content[ * It shows that large residuals become much larger under the square loss `\(R^2\)` (dash red) than the absolute loss `\(|R|\)` (solid red). * Regression using the absolute loss are called the .brand-blue[median regression] whereas using square loss the .brand-blue[mean regression]. * Relative to the usual mean regression, median regression is more robust but is also more difficult to estimate as the loss function is not smooth at 0 and so the usual optimisation using differentiation fails. Hence linear programming is used. * The LS and L1 loss functions are symmetric, those in blue are asymmetric and they give .brand-blue[quantile regression], ie, the regression is conducted at different quantile (instead of median), eg 95% (slope is higher to the right), of a distribution (not covered). ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Robust regression in practice ]] .row[.split-two[ .column[.content[ ```r library(quantreg) #quantile regressioin M1_L1 <- rq(calls ~ year, data=dat) M1_LS <- lm(calls ~ year, data=dat) ggplot(dat, aes(year, calls)) + geom_point() + geom_abline(slope=coef(M1_LS)[2], intercept=coef(M1_LS)[1], color="red") + geom_abline(slope=coef(M1_L1)[2], intercept=coef(M1_L1)[1], color="blue") + labs(x="Year", y="Calls (in millions)") ``` <img src="lecture13_2020JC_files/figure-html/unnamed-chunk-7-1.svg" style="display: block; margin: auto;" /> ]] .column.bg-brand-gray[.content[ ```r M2_L1 <- rq(llint ~ ltemp, data=dat2) M2_LS <- lm(llint ~ ltemp, data=dat2) ggplot(dat2, aes(ltemp, llint)) + geom_point() + geom_abline(slope=coef(M2_LS)[2], intercept=coef(M2_LS)[1], color="red") + geom_abline(slope=coef(M2_L1)[2], intercept=coef(M2_L1)[1], color="blue") + labs(y="Log light intensity", x="Log of temperature") ``` <img src="lecture13_2020JC_files/figure-html/unnamed-chunk-9-1.svg" style="display: block; margin: auto;" /> ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Diagnostic of Star data ]] .row[.split-50[ .column[.content[ ```r library(ggfortify) autoplot(M2_LS) ``` ![](lecture13_2020JC_files/figure-html/unnamed-chunk-10-1.png)<!-- --> ]] .column.bg-brand-gray[.content[ ![](lecture13_2020JC_files/figure-html/unnamed-chunk-12-1.png)<!-- --> * L1 line is even closer to the four points. * Plots show that they are `\(x\)`-outliers with high leverage ( `\(>2p/n=2(2)/47=0.085\)`) but not influential with Cook's distances `\(<F^{-1}(0.5,p,n-p)=0.704\)`. Hence L1 is not robust to `\(x\)`-outliers. ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Alternatives to LS and L1: Robust M estimation ]] .row[.split-two[ .column.bg-brand-gray[.content[ * Use M (Maximum likelihood type) estimation (Huber, 1973, Annals of Statistics 1:799-821) * Replace the squared residuals `\(R^2_i(\boldsymbol{\beta};\mathbf{m})\)` in RSS or the absolute residuals `\(|R_i(\boldsymbol{\beta};\mathbf{m})|\)` by another function of the residuals `$$\hat{\boldsymbol{\beta}}_{\text{M}} = \text{argmin}_{\boldsymbol{\beta}} \sum_{i=1}^n \rho(R_i(\boldsymbol{\beta};\mathbf{m})/\hat{\sigma}),$$` * where `\(\rho\)` is a .brand-blue[symmetric] function that ideally bounds the influence of `\(y\)`-outliers, * `\(\hat{\sigma}^2\)` is an estimator of the error variance. * Examples: Huber, Hampel and Tukey's bisquare * In R use `rlm` function in `library(MASS)`. ]] .column[.content[ * These functions (next slide) bound the .brand-blue[loss function] outside certain threshold, eg `\(\rho(R)\)` will level off when `\(|R|>c\)`. * It can be shown that the weight of each observation in the parameter estimation is given by .brand-blue[Psi function] `\(\psi(R)=\rho'(R)/R\)` which decreases with the magniture of `\(R\)` for Huber, Hampel and Tukey's bisquare. * For LS, the weights are all equal since `\(\psi(R)=\rho'(R)/R=2R/R=2\)` and so is not robust to outliers. ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Loss function and Psi function ]] .row[.split-50[ .column[.content[ ### Loss function and Psi function <img src="images/RhoFunction.png" width="30%" height="30%"><img src="images/PsiFunction.png" width="29.5%" height="29.5%"> ]] .column.bg-brand-gray[.content[ ### Psi function ```r x <- seq(-10,10,0.01) data.frame(x=x, huber=psi.huber(x), hampel=psi.hampel(x), bisquare=psi.bisquare(x)) %>% gather("psi_function", "value", huber:bisquare) %>% ggplot(aes(x, value, color=psi_function)) + geom_line(size=1.25) + theme_bw(base_size=18) ``` <img src="lecture13_2020JC_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # M estimation in practice ]] .row[.split-two[ .column[.content[ ```r M2_M <- rlm(llint ~ ltemp, data=dat2) ggplot(dat2, aes(ltemp, llint)) + geom_point() + geom_abline(slope=coef(M2_LS)[2], intercept=coef(M2_LS)[1], color="red") + geom_abline(slope=coef(M2_L1)[2], intercept=coef(M2_L1)[1], color="blue") + geom_abline(slope=coef(M2_M)[2], intercept=coef(M2_M)[1], color="green") + labs(y="Log light intensity", x="Log of temperature") ``` * Default of Huber with `\(c = 1.345\)` is used for `rlm`. * The fits using .brand-red[LS] and .brand-green[M] are very similar because LS (quadratic) and Huber (although linear) loss functions are both large for large residuals. ]] .column[.content[ <img src="lecture13_2020JC_files/figure-html/unnamed-chunk-14-1.svg" style="display: block; margin: auto;" /> ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Estimation to downweigh high leverage point ]] .row[.split-40[ .column[.content[ ### Robust Generalised M (GM) estimation * Previous results show that M estimators in L1 regression are not robust against high leverage points ( `\(x\)` -outliers). * Idea: down-weight problematic points using certain .brand-blue[weight] `\(w(\boldsymbol{x}_i)\)` in GM estimation. * .brand-blue[Mallows type weights] `$$\hat{\boldsymbol{\beta}}_{\text{GM}} = \text{argmin}_{\boldsymbol{\beta}} \sum_{i=1}^n w(\boldsymbol{x}_i) \rho'\Big( R_i(\boldsymbol{\beta};\mathbf{m})/\hat{\sigma}\Big)\boldsymbol{x}_i$$` ]] .column.bg-brand-gray[.content[ ### LMS and LTS regression * Suming of all observations is problematic with `\(x\)`-outliers. * Alternative way: estimate .brand-blue[robustly] center of the distribution of the squared residuals. * .brand-blue[Least median squares (LMS)] is defined as `$$\hat{\boldsymbol{\beta}}_{\text{LMS}} = \text{argmin}_{\boldsymbol{\beta}\in\Bbb{R}^{p_{\mathbf{m}}}} \big(\operatorname{median}\{R^2_{1}(\boldsymbol{\beta};\mathbf{m}),\ldots,R^2_{ n}(\boldsymbol{\beta};\mathbf{m})\}\big).$$` * .brand-blue[Least trimmed squares (LTS)] is defined as `$$\hat{\boldsymbol{\beta}}_{\text{LTS}} = \text{argmin}_{\boldsymbol{\beta}\in\Bbb{R}^{p_{\mathbf{m}}}} \sum_{i=1}^{n-k}R^2_{(i,n)}(\boldsymbol{\beta}; \mathbf{m}),$$` where `\(R^2_{(1,n)} \leq R^2_{(2,n)} \leq \ldots \leq R^2_{(n,n)}\)` the ordered squared residuals. Best robustness properties for `\(k \approx n/2\)`, recommended `\(k=\lfloor (n+p_{\mathbf{m}})/2 \rfloor\)`. ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # Efficient regression: MM (modified M) estimator ]] .row[.split-70[ .column[.content[ * Best of all worlds (Yohai, Stahel & Zamar; 1991, Directions in Robust Statistics and Diagnostics II, New York, Springer) in .brand-blue[hybrid methods]: * Use robust (median) measurement of center of (transformed eg absolute) residual distribution (idea of median in LMS). * .brand-blue[Minimise] a scale-estimator by choosing a .brand-blue[subset of sample points] (idea of L1, median in LMS and possibly trimming in LMS) to estimate `\(\sigma\)`, e.g. `$$\hat{\sigma} = \text{MAD} = 1.483 \cdot \operatorname{median}\Big\{|R_i-\operatorname{median}\{R_i\}_{1\leq i \leq n} | \Big\}_{1\leq i \leq n}$$` * Use robust `\(\rho\)` function (idea of .brand-blue[M-estimator]) and choose `\(w(\boldsymbol{x}_i)\)` (idea of GM-estimator) to measure goodness of fit at location `\(\boldsymbol{x}_i\)`. ]] .column.bg-brand-gray[.content[ * In R: use `rlm` in `library(MASS)` and specify option `method="MM"`. ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # MM-estimate in practice ]] .row[.split-60[ .column[.content[ ```r M2_MM <- rlm(llint ~ ltemp, data=dat2, method="MM") ggplot(dat2, aes(ltemp, llint)) + geom_point() + geom_abline(slope=coef(M2_LS)[2], intercept=coef(M2_LS)[1], color="red") + geom_abline(slope=coef(M2_L1)[2], intercept=coef(M2_L1)[1], color="blue") + geom_abline(slope=coef(M2_M)[2], intercept=coef(M2_M)[1], color="green") + geom_abline(slope=coef(M2_MM)[2], intercept=coef(M2_MM)[1], color="black") + labs(y="Log light intensity", x="Log of temperature") ``` All LS, L1 and M eatimators are sensitive to `\(x\)` outliers. The fit is much better by downweighting (excluding) the high leverage points using MM estimator. ]] .column[.content[ <img src="lecture13_2020JC_files/figure-html/unnamed-chunk-15-1.svg" style="display: block; margin: auto;" /> ]] ]] --- class: split-10 .row.bg-brand-blue.white[.content.vmiddle[ # MM-estimate in practice ]] .row[.split-60[ .column[.content[ ```r M1_M <- rlm(calls ~ year, data=dat) M1_MM <- rlm(calls ~ year, data=dat, method="MM") ggplot(dat, aes(year, calls)) + geom_point() + geom_abline(slope=coef(M1_LS)[2], intercept=coef(M1_LS)[1], color="red") + geom_abline(slope=coef(M1_L1)[2], intercept=coef(M1_L1)[1], color="blue") + geom_abline(slope=coef(M1_M)[2], intercept=coef(M1_M)[1], color="green") + geom_abline(slope=coef(M1_MM)[2], intercept=coef(M1_MM)[1], color="black") + labs(x="Year", y="Calls (in millions)") ``` There is gradual improvement from LS, M, L1 and MM estimators. The fit is the best using MM-estimtor which caters both `\(x\)`- and `\(Y\)`-outliers. ]] .column[.content[ <img src="lecture13_2020JC_files/figure-html/unnamed-chunk-16-1.svg" style="display: block; margin: auto;" /> ]] ]]