robustlm: Robust Variable Selection with Exponential Squared Loss

The goal of the robustlm package is to carry out robust variable selection through exponential squared loss (Wang et al. 2013). Specifically, it solves the following optimization problem:

$$ \arg\min_{\beta} \sum_{i=1}^n(1-\exp\{-(y_i-x_i^T\beta)^2/\gamma_n\})+n\sum_{i=1}^d \lambda_{n j}|\beta_{j}|, $$ where penalty function is the well-known adaptive LASSO (Zou 2006). yis are responses and xi are predictors. γn is the tuning parameter controlling the degree of robustness and efficiency of the regression estimators. Block coordinate gradient descent algorithm (Tseng and Yun 2009) is used to efficiently solve the optimization problem. The tuning parameter γn and regularization parameter λnj are chosen adaptively by default, while they can be supplied by the user.

Quick Start

This is a basic example which shows you how to use this package. First, we generate data which contain influential points in the response. Let covariate xi follows a multivariate normal distribution N(0, Σ) where Σij = 0.5|i − j|, and the error term ϵ follows a mixture normal distribution 0.8N(0, 1) + 0.2N(10, 62). The response is generated by Y = XTβ + ϵ, where β = (1, 1.5, 2, 1, 0, 0, 0, 0)T.

set.seed(1)
library(MASS)
N <- 1000
p <- 8
rho <- 0.5
beta_true <- c(1, 1.5, 2, 1, 0, 0, 0, 0)
H <- abs(outer(1:p, 1:p, "-"))
V <- rho^H
X <- mvrnorm(N, rep(0, p), V)

# generate error term from a mixture normal distribution
components <- sample(1:2, prob=c(0.8, 0.2), size=N, replace=TRUE)
mus <- c(0, 10)
sds <- c(1, 6)
err <- rnorm(n=N,mean=mus[components],sd=sds[components])

Y <- X %*% beta_true + err

Regression diagnostics for simple linear regression provide some intuition about the data.

plot(lm(Y ~ X))

The Q-Q plot suggests the normal assumption is not valid. Thus, a robust variable selection procedure is needed. We apply robustlm function to select important variables:

library(robustlm)
robustlm1 <- robustlm(X, Y)
robustlm1
#> $beta
#> [1] 0.9795562 1.5606554 2.0355552 0.9774514 0.0000000 0.0000000 0.0000000
#> [8] 0.0000000
#> 
#> $alpha
#> [1] 0
#> 
#> $gamma
#> [1] 8.3
#> 
#> $weight
#> [1]  87.140346   6.787335   4.404298   3.396719   6.648872 379.818663 101.841412
#> [8] 221.226545 212.512461
#> 
#> $loss
#> [1] 250.5937

The estimated regression coefficients (0.94, 1.58, 2.07, 0.95, 0.00, 0.00, 0.00, 0.00) are close to the true values(1, 1.5, 2, 1, 0, 0, 0, 0). There is no mistaken selection or discard.

robustlm package also provides generic coef function to extract estimated coefficients and predict function to make a prediction:

coef(robustlm1)
#> $alpha
#> [1] 0
#> 
#> $beta
#> [1] 0.9795562 1.5606554 2.0355552 0.9774514 0.0000000 0.0000000 0.0000000
#> [8] 0.0000000
Y_pred <- predict(robustlm1, X)
head(Y_pred)
#>            [,1]
#> [1,]  3.9500529
#> [2,]  0.6414250
#> [3,]  3.2478387
#> [4,] -1.6781785
#> [5,]  0.1788287
#> [6,] -0.2711302

Boston Housing Price Dataset

We apply this package to analysis the Boston Housing Price Dataset, which is available in MASS package. The data contain 14 variables and 506 observations. The response variable is medv (median value of owner-occupied homes in thousand dollars), and the rest are the predictors. Here the predictors are scaled to have zero mean and unit variance. The responses are centerized.

data(Boston, package = "MASS")
head(Boston)
#>      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
#> 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
#> 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
#> 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
#> 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
#> 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
#> 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
#>   medv
#> 1 24.0
#> 2 21.6
#> 3 34.7
#> 4 33.4
#> 5 36.2
#> 6 28.7
# scaling and centering
Boston[, -14] <- scale(Boston[, -14])
Boston[, 14] <- scale(Boston[, 14], scale = FALSE)

# diagnostic
set.seed(1)
x <- as.matrix(Boston[, -14])
y <- Boston[, 14]
lm_OLS <- lm(y ~ x - 1)
plot(lm_OLS)

The diagnostic plots suggest the residuals may not follow normal distribution, we use robustlm to carry out variable selection with robustness.

# robustlm
robustlm2 <- robustlm(x, y)
coef(robustlm2)
#> $alpha
#> [1] -0.8696961
#> 
#> $beta
#>  [1]  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  4.4744083
#>  [7] -0.7759438 -0.6542423  0.0000000 -0.6474789 -1.2535192  1.3645150
#> [13] -1.6753801

In this example, robustlm selected seven of the predictors while discarding six of them. Take a look at the correlations, it appears robustlm tends to select variables which have higher correlations with the response.

cor(x, y)
#>               [,1]
#> crim    -0.3883046
#> zn       0.3604453
#> indus   -0.4837252
#> chas     0.1752602
#> nox     -0.4273208
#> rm       0.6953599
#> age     -0.3769546
#> dis      0.2499287
#> rad     -0.3816262
#> tax     -0.4685359
#> ptratio -0.5077867
#> black    0.3334608
#> lstat   -0.7376627

The following procedure roughly illustrates the robustness. Take the predictor rm (average number of rooms per dwelling) as an example. We draw a 2-dimensional scatter plot with the fitted regression line on it. As shown below, the fitted line is not heavily affected by the outliers.

References

Tseng, Paul, and Sangwoon Yun. 2009. “A Coordinate Gradient Descent Method for Nonsmooth Separable Minimization.” Mathematical Programming 117 (1): 387–423. https://doi.org/10.1007/s10107-007-0170-0.
Wang, Xueqin, Yunlu Jiang, Mian Huang, and Heping Zhang. 2013. “Robust Variable Selection with Exponential Squared Loss.” Journal of the American Statistical Association 108 (502): 632–43. https://doi.org/10.1080/01621459.2013.766613.
Zou, Hui. 2006. “The Adaptive Lasso and Its Oracle Properties.” Journal of the American Statistical Association 101 (476): 1418–29. https://doi.org/10.1198/016214506000000735.