Robust Bayesian method to sparse high-dimensional regression
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
In high-dimensional regression problems, the demand for robust variable selection arises due to the widely observed outliers and heavy-tailed distributions of the response variable, as well as model misspecifications when structured sparsity is not properly accounted for. Although extensive frequentist-based robust regularization methods have been developed, within the Bayesian framework, the understanding of robust Bayesian analysis is still lim- ited. In this dissertation, we systematically develop a suite of robust Bayesian penalization methods that can efficiently accommodate the data contamination while identifying the complicated underlying sparsity patterns. In the first project, we propose a marginal robust Bayesian variable selection method for gene-environment (G×E) studies. Our study has been motivated by the difficulty to choosing sensible tuning parameters and lack of inferential power in existing robust marginal penalization methods. In particular, the Laplacian likelihood has been adopted in the Bayesian hierarchical model to accommodate data contamination and outliers. With the incorporation of spike-and-slab priors, we have implemented the Gibbs sampler based on Markov Chain Monte Carlo (MCMC). The proposed method outperforms a number of alternatives in extensive simulation studies. The utility of the proposed method has been further demonstrated using data from the Nurse Health Study (NHS). In the second project, we develop a novel Bayesian quantile elastic net with spike-and-slab priors, which significantly improves over multiple versions of elastic net regularization methods. The Bayesian formulation of quantile regression in our method distinguishes the proposed one from the least square based elastic net in the presence of long-tailed distributions of the disease phenotype and heteroscedasticity in the regression error. Incorporation of the spike-and-slab priors leads to higher variable selection accuracy over Bayesian methods using the credible interval criterion and the scaled neighborhood criterion. In particular, the proposed method enables exact statistical inference by providing Bayesian credible intervals with nominal coverage probabilities. The advantage of the proposed method has been fully demonstrated in both the simulation study and a biomedical data with high-dimensional genomics features. In the third project, we consider the robust Bayesian subgroup analysis on samples extracted from a population with underlying grouping structures, which is an important step toward developing individualized treatment strategies for personalized medicine. Our approach can successfully recover group membership while retaining the robust property in the Bayesian framework. Numerical studies have shown the superiority of the developed method over alternatives when subgroup structure detection has been conducted in the presence of data contaminations. Promising results have also been revealed through a case study using the breast cancer data from The Cancer Genome Atlas (TCGA). An important component of this dissertation is in developing user-friendly and publicly available software packages for reproducible research and broad dissemination of my work to the scientific community. To facilitate reproducible results and fast computation, we have developed high-speed R packages with C++ implementation for my dissertation projects. The R package marble for my first project and Bayenet for my second project are available on CRAN as well as on my Github pages. Currently, we are working on the R package of the third project.