Robust Regression

class: split-70 with-border hide-slide-number bg-brand-red
background-image: url("images/USydLogo-black.svg")
background-size: 200px
background-position: 2% 90%

.column.white[.content[

# Robust Regression
## .black[STAT3022 Applied Linear Models Lecture 13]

### .black[2020/02/17]

]]

.column.bg-brand-charcoal[.content.white[
## Today

1. Robustness and regression
1. Efficient and resistant regression
1. L1 regression
1. M estimation, MM estimation
1. Least-median-squares and least-trimmed-squares

]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# Robust regression
]]
.row[.split-two[
.column[.content[

*"The method of Least Squares is seen to be our best course when we have thrown overboard a certain portion of our data -- a sort of sacrifice which has often to be made by those who sail upon the stormy seas of Probability"*

Edgeworth (1887)

.bottom_abs.width100[
Further references:
* Rousseeuw and Leroy (2005, Chapter 1-3). Robust Regression and Outlier Detection, New York: Wiley.
* Venables and Ripley (2003, Chapter 6). Modern Applied Statistics with S (4e). New York: Springer.
]

]]
.column.bg-brand-gray[.content[
### `$x$`-outliers and `$y$`-outliers
* Outliers `$\neq$` extreme points, i.e. every univariate data set has a smallest and largest observation.
* Outliers are observations, which have a different underlying distribution/model than the bulk of the data (recall lecture 8).
* For regression data `$(\boldsymbol{Y},\mathbf{X})$` the different model can be in the `$x$`-space (e.g. high leverage points) or in the `$y$`-space. 
* Cooks distance measures the "outlyingness" based on both, `$\boldsymbol{Y}$` and `$\mathbf{X}$` but is only "powerful" when *single outliers* are present.

]]
]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# Example: Phone calls
]]
.row[.split-50[
.column[.content[

* Data on number of international phone `calls` from Belgium in the `year`s 1950-1973.

```r
library(MASS) # load R-package MASS
# the data set phones is within the MASS package
dat <- as.data.frame(phones)
skimr::skim(dat)
```

```
 Skim summary statistics
 n obs: 24 
 n variables: 2 
 
 -- Variable type:numeric ---------------------------------------------------------------------------------------------------
 variable missing mean sd p0 p50 p100 hist
 calls 0 49.99 65.53 4.4 15.5 212 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
 year 0 61.5 7.07 50 61.5 73 <U+2587><U+2587><U+2587><U+2587><U+2587><U+2587><U+2587><U+2587>
```

```r
cor(dat$year, dat$calls)
```

```
  [1] 0.5439873
```

]]
.column.bg-brand-gray[.content[

* Heavy contamination from 64--69. Why?
* From 64-69 another recording device was used, giving the total number of minutes of these calls instead of total number of calls.
* .brand-blue[Outliers in] `$\color{blue}{y}$`.brand-blue[-direction].

]]
]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# Example: Star cluster CYG OB1
]]
.row[.split-two[
.column[.content[

```r
dat2 <- read.table("data/star.txt", header = TRUE, 
 row.names = NULL)

skimr::skim(dat2)
```

```
 Skim summary statistics
 n obs: 47 
 n variables: 3 
 
 -- Variable type:integer ---------------------------------------------------------------------------------------------------
 variable missing n mean sd p0 p25 p50 p75 p100 hist
 index 0 47 24 13.71 1 12.5 24 35.5 47 <U+2587><U+2587><U+2587><U+2587><U+2587><U+2587><U+2587><U+2587>
 
 -- Variable type:numeric ---------------------------------------------------------------------------------------------------
 variable missing mean sd p0 p50 p100 hist
 llint 0 5.01 0.57 3.94 5.1 6.29 <U+2585><U+2586><U+2585><U+2587><U+2587><U+2586><U+2583><U+2582>
 ltemp 0 4.31 0.29 3.48 4.42 4.62 <U+2582><U+2581><U+2581><U+2581><U+2581><U+2585><U+2587><U+2585>
```

```r
cor(dat2$ltemp, dat2$llint)
```

```
  [1] -0.2104133
```
]]
.column.bg-brand-gray[.content[

* Data on 47 stars in the direction of Cygnus.
* `$x=$` effective temperature at the surface.
* `$y=$` logarithm of light intensity.
* .brand-blue[Outliers in] `$\color{blue}{x}$`.brand-blue[-direction but no reporting errors.]
* Most stars (43) lie on the "main sequence", 4 stars are giants that are high leverage points.

]]
]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# Least Squares (LS) Estimate
]]
.row[.split-two[
.column[.content[

Given a (candidate) regression model

`\begin{eqnarray*}
Y_i &=& \beta_0 + \beta_1 x_{i1} + \ldots + \beta_{(p-1)} x_{i(p-1)} + \epsilon_{i}. 
\end{eqnarray*}`

* What if `$\epsilon_i \not\sim NID(0,\sigma^2$`)?
* Under (A1)-(A4), we used the LS for estimation of `$\beta$`-parameters, `$\sigma^2$` and for the selection of the "best" model.
* Information is stored in `$\text{RSS}(\mathbf{m})$`!
* Is it possible to use all or parts of data?
* Aim: .brand-blue[Robustness. M-estimation]. But, robust against what? 
]]
.column.bg-brand-gray[.content[

Given a model `$\mathbf{m}$`,

* Residual sum of squares 
`$$\text{RSS}(\mathbf{m}) = \sum_{i=1}^n (Y_i - \hat{Y}_{\mathbf{m} i})^2 = \sum_{i=1}^n R_{i}^2(\hat{\boldsymbol{\beta}};\mathbf{m})$$`
* To measure the average (expected) squared error of fit we simply use $$\frac{1}{n}\text{RSS}(\mathbf{m}). $$
* Thus choose `$\boldsymbol{\beta}_{\mathbf{m}}$` such that
`$$\hat{\boldsymbol{\beta}}_{\text{LS}} = \text{argmin}_{\boldsymbol{\beta}\in\Bbb{R}^{p_{\mathbf{m}}}} \big(\operatorname{mean}\{R^2_{ 1}(\boldsymbol{\beta};\mathbf{m}),\ldots,R^2_{ n}(\boldsymbol{\beta}; \mathbf{m})\}\big).$$`

]]
]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# Use more robust distance measures: L1 regression
]]
.row[.split-two[
.column[.content[

* Use absolute value instead of squares,
`$$\hat{\boldsymbol{\beta}}_{\text{L1}} = \text{argmin}_{\boldsymbol{\beta}\in\Bbb{R}^{p_{\mathbf{m}}}} \big(\operatorname{mean}\{|R_{\mathbf{m} 1}(\boldsymbol{\beta})|,\ldots,|R_{\mathbf{m} n}(\boldsymbol{\beta})|\}\big),$$`
* For example, `$Y_i = \beta_0 + \beta_1 x_{i 1} + \epsilon_i$`
`$$\Rightarrow \hat{\boldsymbol{\beta}}_{\text{L1}} = \text{argmin}_{\boldsymbol{\beta}\in\Bbb{R}^{2}} \sum_{i=1}^n|Y_i - \beta_0 - \beta_1 x_{i 1}|.$$`
* `$\hat{\boldsymbol{\beta}}_{\text{L1}}$` can be obtained by .brand-blue[linear programming] (not covered in this course).

]]
.column.bg-brand-gray[.content[
### Least absolute regression (L1)

* Idea attributed to Laplace (1749--1827) but credit given to Edgeworth (1887). 
* Robust against `$y$`-outliers unlike LS-regression.
* .brand-blue[Not robust against] `$\color{blue}{x}$`.brand-blue[-outliers.]
* In R use `rq` function in `library(quantreg)`.
]]
]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# Loss function
]]
.row[.split-40[
.column[.content[
<img src="images/LossFunction.png" width="90%" height="90%">

]]
.column.bg-brand-gray[.content[

* It shows that large residuals become much larger under the square loss `$R^2$` (dash red) than the absolute loss `$|R|$` (solid red).
* Regression using the absolute loss are called the .brand-blue[median regression] whereas using square loss the .brand-blue[mean regression].  
* Relative to the usual mean regression, median regression is more robust but is also more difficult to estimate as the loss function is not smooth at 0 and so the usual optimisation using differentiation fails. Hence linear programming is used.
* The LS and L1 loss functions are symmetric, those in blue are asymmetric and they give .brand-blue[quantile regression], ie, the regression is conducted at different quantile (instead of median), eg 95% (slope is higher to the right), of a distribution (not covered).

]]
]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# Robust regression in practice
]]
.row[.split-two[
.column[.content[

```r
library(quantreg) #quantile regressioin
M1_L1 <- rq(calls ~ year, data=dat)
M1_LS <- lm(calls ~ year, data=dat)
ggplot(dat, aes(year, calls)) + geom_point() + 
 geom_abline(slope=coef(M1_LS)[2], 
 intercept=coef(M1_LS)[1], color="red") + 
 geom_abline(slope=coef(M1_L1)[2], 
 intercept=coef(M1_L1)[1], color="blue") +
 labs(x="Year", y="Calls (in millions)")
```

]]
.column.bg-brand-gray[.content[

```r
M2_L1 <- rq(llint ~ ltemp, data=dat2)
M2_LS <- lm(llint ~ ltemp, data=dat2)
ggplot(dat2, aes(ltemp, llint)) + geom_point() + 
 geom_abline(slope=coef(M2_LS)[2], 
 intercept=coef(M2_LS)[1], color="red") + 
 geom_abline(slope=coef(M2_L1)[2], 
 intercept=coef(M2_L1)[1], color="blue") +
 labs(y="Log light intensity", 
 x="Log of temperature")
```

<img src="lecture13_2020JC_files/figure-html/unnamed-chunk-9-1.svg" style="display: block; margin: auto;" />
]]
]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# Diagnostic of Star data
]]
.row[.split-50[
.column[.content[

```r
library(ggfortify)
autoplot(M2_LS)
```

![](lecture13_2020JC_files/figure-html/unnamed-chunk-10-1.png)

]]
.column.bg-brand-gray[.content[

![](lecture13_2020JC_files/figure-html/unnamed-chunk-12-1.png)
* L1 line is even closer to the four points. 
* Plots show that they are `$x$`-outliers with high leverage ( `$>2p/n=2(2)/47=0.085$`) but not influential with Cook's distances `$<F^{-1}(0.5,p,n-p)=0.704$`. Hence L1 is not robust to `$x$`-outliers.

]]
]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# Alternatives to LS and L1: Robust M estimation
]]
.row[.split-two[
.column.bg-brand-gray[.content[

* Use M (Maximum likelihood type) estimation (Huber, 1973, Annals of Statistics 1:799-821) 
* Replace the squared residuals `$R^2_i(\boldsymbol{\beta};\mathbf{m})$` in RSS or the absolute residuals `$|R_i(\boldsymbol{\beta};\mathbf{m})|$` by another function of the residuals

`$$\hat{\boldsymbol{\beta}}_{\text{M}} = \text{argmin}_{\boldsymbol{\beta}} \sum_{i=1}^n \rho(R_i(\boldsymbol{\beta};\mathbf{m})/\hat{\sigma}),$$`

* where `$\rho$` is a .brand-blue[symmetric] function that ideally bounds the influence of `$y$`-outliers,
* `$\hat{\sigma}^2$` is an estimator of the error variance.
* Examples: Huber, Hampel and Tukey's bisquare
* In R use `rlm` function in `library(MASS)`.

]]
.column[.content[

* These functions (next slide) bound the .brand-blue[loss function] outside certain threshold, eg `$\rho(R)$` will level off when `$|R|>c$`. 
* It can be shown that the weight of each observation in the parameter estimation is given by .brand-blue[Psi function] `$\psi(R)=\rho'(R)/R$` which decreases with the magniture of `$R$` for Huber, Hampel and Tukey's bisquare.
* For LS, the weights are all equal since `$\psi(R)=\rho'(R)/R=2R/R=2$` and so is not robust to outliers.

]]
]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# Loss function and Psi function
]]
.row[.split-50[
.column[.content[
### Loss function and Psi function
<img src="images/RhoFunction.png" width="30%" height="30%"><img src="images/PsiFunction.png" width="29.5%" height="29.5%">

]]
.column.bg-brand-gray[.content[
### Psi function

```r
x <- seq(-10,10,0.01)
data.frame(x=x,
 huber=psi.huber(x),
 hampel=psi.hampel(x),
 bisquare=psi.bisquare(x)) %>% 
 gather("psi_function", "value", huber:bisquare) %>% 
 ggplot(aes(x, value, color=psi_function)) +
 geom_line(size=1.25) + theme_bw(base_size=18)
```

<img src="lecture13_2020JC_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" />
]]
]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# M estimation in practice
]]
.row[.split-two[
.column[.content[

```r
M2_M <- rlm(llint ~ ltemp, data=dat2)
ggplot(dat2, aes(ltemp, llint)) + geom_point() + 
 geom_abline(slope=coef(M2_LS)[2], 
 intercept=coef(M2_LS)[1], color="red") + 
 geom_abline(slope=coef(M2_L1)[2], 
 intercept=coef(M2_L1)[1], color="blue") +
 geom_abline(slope=coef(M2_M)[2], 
 intercept=coef(M2_M)[1], color="green") +
 labs(y="Log light intensity", 
 x="Log of temperature")
```
* Default of Huber with `$c = 1.345$` is used for `rlm`.
* The fits using .brand-red[LS] and .brand-green[M] are very similar because LS (quadratic) and Huber (although linear) loss functions are both large for large residuals. 
]]
.column[.content[
<img src="lecture13_2020JC_files/figure-html/unnamed-chunk-14-1.svg" style="display: block; margin: auto;" />
]]
]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# Estimation to downweigh high leverage point
]]
.row[.split-40[
.column[.content[
### Robust Generalised M (GM) estimation
* Previous results show that M estimators in L1 regression are not robust against high leverage points ( `$x$` -outliers).
* Idea: down-weight problematic points using certain .brand-blue[weight] `$w(\boldsymbol{x}_i)$` in GM estimation.
* .brand-blue[Mallows type weights] 
`$$\hat{\boldsymbol{\beta}}_{\text{GM}} = \text{argmin}_{\boldsymbol{\beta}} \sum_{i=1}^n w(\boldsymbol{x}_i) \rho'\Big( R_i(\boldsymbol{\beta};\mathbf{m})/\hat{\sigma}\Big)\boldsymbol{x}_i$$`

]]
.column.bg-brand-gray[.content[
### LMS and LTS regression

* Suming of all observations is problematic with `$x$`-outliers.
* Alternative way: estimate .brand-blue[robustly] center of the distribution of the squared residuals. 
* .brand-blue[Least median squares (LMS)] is defined as
`$$\hat{\boldsymbol{\beta}}_{\text{LMS}} = \text{argmin}_{\boldsymbol{\beta}\in\Bbb{R}^{p_{\mathbf{m}}}} \big(\operatorname{median}\{R^2_{1}(\boldsymbol{\beta};\mathbf{m}),\ldots,R^2_{ n}(\boldsymbol{\beta};\mathbf{m})\}\big).$$` 
* .brand-blue[Least trimmed squares (LTS)] is defined as
`$$\hat{\boldsymbol{\beta}}_{\text{LTS}} = \text{argmin}_{\boldsymbol{\beta}\in\Bbb{R}^{p_{\mathbf{m}}}} \sum_{i=1}^{n-k}R^2_{(i,n)}(\boldsymbol{\beta}; \mathbf{m}),$$`
where  `$R^2_{(1,n)} \leq R^2_{(2,n)} \leq \ldots \leq R^2_{(n,n)}$` the ordered squared residuals. Best robustness properties for `$k \approx n/2$`, recommended `$k=\lfloor (n+p_{\mathbf{m}})/2 \rfloor$`.

]]
]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# Efficient regression: MM (modified M) estimator
]]
.row[.split-70[
.column[.content[

* Best of all worlds (Yohai, Stahel & Zamar; 1991, Directions in Robust Statistics and Diagnostics II, New York, Springer) in .brand-blue[hybrid methods]:
  * Use robust (median) measurement of center of (transformed eg absolute) residual distribution (idea of median in LMS). 
  * .brand-blue[Minimise] a scale-estimator by choosing a .brand-blue[subset of sample points] (idea of L1, median in LMS and possibly trimming in LMS) to estimate `$\sigma$`, e.g. 
  `$$\hat{\sigma} = \text{MAD} = 1.483 \cdot \operatorname{median}\Big\{|R_i-\operatorname{median}\{R_i\}_{1\leq i \leq n} | \Big\}_{1\leq i \leq n}$$`
  * Use robust `$\rho$` function (idea of .brand-blue[M-estimator]) and choose `$w(\boldsymbol{x}_i)$` (idea of GM-estimator) to measure goodness of fit at location `$\boldsymbol{x}_i$`.

]]
.column.bg-brand-gray[.content[

* In R: use `rlm` in `library(MASS)` and specify option `method="MM"`.
]]
]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# MM-estimate in practice
]]
.row[.split-60[
.column[.content[

```r
M2_MM <- rlm(llint ~ ltemp, data=dat2, method="MM")

ggplot(dat2, aes(ltemp, llint)) + geom_point() + 
 geom_abline(slope=coef(M2_LS)[2], 
 intercept=coef(M2_LS)[1], color="red") + 
 geom_abline(slope=coef(M2_L1)[2], 
 intercept=coef(M2_L1)[1], color="blue") +
 geom_abline(slope=coef(M2_M)[2], 
 intercept=coef(M2_M)[1], color="green") +
 geom_abline(slope=coef(M2_MM)[2], 
 intercept=coef(M2_MM)[1], color="black") +
 labs(y="Log light intensity", 
 x="Log of temperature")
```
All LS, L1 and M eatimators are sensitive to `$x$` outliers. The fit is much better by downweighting (excluding) the high leverage points using MM estimator.
]]
.column[.content[
<img src="lecture13_2020JC_files/figure-html/unnamed-chunk-15-1.svg" style="display: block; margin: auto;" />

]]
]]

---

.row.bg-brand-blue.white[.content.vmiddle[
# MM-estimate in practice
]]
.row[.split-60[
.column[.content[

```r
M1_M <- rlm(calls ~ year, data=dat)
M1_MM <- rlm(calls ~ year, data=dat, method="MM")

ggplot(dat, aes(year, calls)) + geom_point() + 
 geom_abline(slope=coef(M1_LS)[2], 
 intercept=coef(M1_LS)[1], color="red") + 
 geom_abline(slope=coef(M1_L1)[2], 
 intercept=coef(M1_L1)[1], color="blue") +
 geom_abline(slope=coef(M1_M)[2], 
 intercept=coef(M1_M)[1], color="green") +
 geom_abline(slope=coef(M1_MM)[2], 
 intercept=coef(M1_MM)[1], color="black") +
 labs(x="Year", y="Calls (in millions)")
```
There is gradual improvement from LS, M, L1 and MM estimators. The fit is the best using MM-estimtor which caters both `$x$`- and `$Y$`-outliers.
]]
.column[.content[
<img src="lecture13_2020JC_files/figure-html/unnamed-chunk-16-1.svg" style="display: block; margin: auto;" />

]]
]]