High-dimensional variable selection in longitudinal and nonlinear gene-environment interaction studies


Journal Title

Journal ISSN

Volume Title



Variable selection from both the frequentist and Bayesian frameworks has gained increasing popularity in the analysis of high-dimensional genomic data. Despite the success of existing studies, challenges still remain as tailored methods for sparse interaction structures are not available when the response variables are repeatedly measured and/or have heavy-tailed distributions. These challenges have motivated the development of novel variable selection methods proposed in the following projects. Meanwhile, powerful software packages from these projects are publically available to facilitate fast and reliable computation, as well as reproducible research.

In the first project, we have developed a novel penalized variable selection method to identify important lipid–environment interactions in a longitudinal lipidomics study, where the environment factors refer to a group of dummy variables corresponding to a four-level treatment factor. An efficient Newton–Raphson based algorithm was proposed within the generalized estimating equation (GEE) framework. Simulation studies have demonstrated the superior performance of our method over alternatives, in terms of both identification accuracy and prediction performance. Analysis of the high-dimensional lipid datasets collected using mice from the skin cancer prevention study identified meaningful markers that provide fresh insight into the underlying mechanism of cancer preventive effects.

In the second project, we have proposed a sparse group penalization method for the bi-level GxE interaction study under the repeatedly measured phenotype to accommodate more general environment factors. Within the quadratic inference function (QIF) framework, the proposed method can achieve simultaneous identification of main and interaction effects on both the group and individual level. We conducted simulation studies to establish the advantage of the proposed regularization methods. In the case study, the environment factors include age, gender and treatment, which are either continuous or categorical. Our method leads to improved prediction and identification of main and interaction effects with important implications.

In the third project, a sparse Bayesian quantile varying coefficient model has been developed for non-linear GxE studies. The proposed model can accommodate heavy-tailed errors and outliers from the disease phenotypes while pinpointing important non-linear interactions through Bayesian variable selection based on spike-and-slab priors. Fast computation has been facilitated by the efficient Gibbs sampler. Simulation studies and real data analysis with age as the univariate environment factor have been performed to show the superiority of the proposed method over multiple competing alternatives.

The open source R packages with C++ implementations of all the methods under comparison have been provided along this dissertation. The R packages interep and springer, for the first two projects respectively, are available on CRAN. The R package for the last project on Bayesian regularized quantile varying coefficient model will be released soon to the public.



Gene-environment interaction, Regularization, Longitudinal, Bayesian variable selection, Quantile regression, Non-parametric modeling

Graduation Month



Doctor of Philosophy


Department of Statistics

Major Professor

Cen Wu