--- title: "robustlm: Robust Variable Selection with Exponential Squared Loss" author: "Borui Tang, Jin Zhu" output: rmarkdown::html_vignette bibliography: ref.bib vignette: > %\VignetteIndexEntry{Robust Variable Selection with Exponential Squared Loss} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", warning = FALSE ) ``` The goal of the **robustlm** package is to carry out robust variable selection through exponential squared loss [@wangRobustVariableSelection2013]. Specifically, it solves the following optimization problem: \[ \arg\min_{\beta} \sum_{i=1}^n(1-\exp\{-(y_i-x_i^T\beta)^2/\gamma_n\})+n\sum_{i=1}^d \lambda_{n j}|\beta_{j}|, \] where penalty function is the well-known adaptive LASSO [@zou2006alasso]. $y_i$s are responses and $x_i$ are predictors. $\gamma_n$ is the tuning parameter controlling the degree of robustness and efficiency of the regression estimators. Block coordinate gradient descent algorithm [@tsengCoordinateGradientDescent2009] is used to efficiently solve the optimization problem. The tuning parameter $\gamma_n$ and regularization parameter $\lambda_{nj}$ are chosen adaptively by default, while they can be supplied by the user. ## Quick Start This is a basic example which shows you how to use this package. First, we generate data which contain influential points in the response. Let covariate $x_i$ follows a multivariate normal distribution $N(0, \Sigma)$ where $\Sigma_{ij}=0.5^{|i-j|}$, and the error term $\epsilon$ follows a mixture normal distribution $0.8N(0,1)+0.2N(10,6^2)$. The response is generated by $Y=X^T\beta+\epsilon$, where $\beta=(1, 1.5, 2, 1, 0, 0, 0, 0)^T$. ```{r} set.seed(1) library(MASS) N <- 1000 p <- 8 rho <- 0.5 beta_true <- c(1, 1.5, 2, 1, 0, 0, 0, 0) H <- abs(outer(1:p, 1:p, "-")) V <- rho^H X <- mvrnorm(N, rep(0, p), V) # generate error term from a mixture normal distribution components <- sample(1:2, prob=c(0.8, 0.2), size=N, replace=TRUE) mus <- c(0, 10) sds <- c(1, 6) err <- rnorm(n=N,mean=mus[components],sd=sds[components]) Y <- X %*% beta_true + err ``` Regression diagnostics for simple linear regression provide some intuition about the data. ```{r} plot(lm(Y ~ X)) ``` The Q-Q plot suggests the normal assumption is not valid. Thus, a robust variable selection procedure is needed. We apply *robustlm* function to select important variables: ```{r} library(robustlm) robustlm1 <- robustlm(X, Y) robustlm1 ``` The estimated regression coefficients $(0.94, 1.58, 2.07, 0.95, 0.00, 0.00, 0.00, 0.00)$ are close to the true values$(1, 1.5, 2, 1, 0, 0, 0, 0)$. There is no mistaken selection or discard. **robustlm** package also provides generic *coef* function to extract estimated coefficients and *predict* function to make a prediction: ```{r} coef(robustlm1) Y_pred <- predict(robustlm1, X) head(Y_pred) ``` ## Boston Housing Price Dataset We apply this package to analysis the Boston Housing Price Dataset, which is available in **MASS** package. The data contain 14 variables and 506 observations. The response variable is medv (median value of owner-occupied homes in thousand dollars), and the rest are the predictors. Here the predictors are scaled to have zero mean and unit variance. The responses are centerized. ```{r} data(Boston, package = "MASS") head(Boston) ``` ```{r} # scaling and centering Boston[, -14] <- scale(Boston[, -14]) Boston[, 14] <- scale(Boston[, 14], scale = FALSE) # diagnostic set.seed(1) x <- as.matrix(Boston[, -14]) y <- Boston[, 14] lm_OLS <- lm(y ~ x - 1) plot(lm_OLS) ``` The diagnostic plots suggest the residuals may not follow normal distribution, we use **robustlm** to carry out variable selection with robustness. ```{r} # robustlm robustlm2 <- robustlm(x, y) coef(robustlm2) ``` In this example, **robustlm** selected seven of the predictors while discarding six of them. Take a look at the correlations, it appears **robustlm** tends to select variables which have higher correlations with the response. ```{r} cor(x, y) ``` The following procedure roughly illustrates the robustness. Take the predictor rm (average number of rooms per dwelling) as an example. We draw a 2-dimensional scatter plot with the fitted regression line on it. As shown below, the fitted line is not heavily affected by the outliers.
```{r, fig.align='center', fig.keep='all', echo=FALSE, eval=FALSE} beta <- robustlm2$beta fun6 <- function(z) beta[6]*z library(ggplot2) data6 <- data.frame(medv=y, rm=x[,6]) p <- ggplot(data = data6, aes(x=rm, y=medv)) + geom_function(fun = fun6) + geom_point() + theme_bw() p ggsave(p, filename = "boston.jpg", height = 3.6, width = 3.6) ``` ```{r, eval=FALSE, echo=FALSE} for (i in 1:13) { beta <- robustlm2$beta fun <- function(z) beta[i]*z library(ggplot2) data <- data.frame(y=y, x=x[,i]) pl <- ggplot(data = data, aes(x=x, y=y)) + geom_function(fun = fun) + geom_point() print(pl) } ``` # References