Variable selection for longitudinal and irregular high-dimensional data

Date

2024

Journal Title

Journal ISSN

Volume Title

Publisher

Kansas State University

Abstract

Variable selection is a commonly used method for analyzing genomic data with high dimensionality. It has been designed to handle complicated data structures and facilitate the identification of crucial genomic features by creating sparse models. Despite the success of existing studies, challenges still remain when the response variables are repeatedly measured or have heavy-tailed distributions. These challenges have motivated the development of novel variable selection methods proposed in the following projects from both frequentist and Bayesian perspectives. In longitudinal studies, regularized variable selection methods have been extensively developed while accommodating the intra-correlation among repeated measurements. Despite the success, they are limited especially in accommodating structured sparsity. For example, strong correlations generally exist among omics features. Ignoring such a correlation while conducting variable selection in longitudinal studies results in false identification and biased estimation. In the first project, we have proposed a network-based variable selection method under repeatedly measured disease phenotype. The strong interconnections among the omics predictors have been efficiently accommodated while performing variable selection. An efficient Newton-Raphson based algorithm was adopted within the generalized estimating equation (GEE) framework. The advantage of the proposed method has been demonstrated in extensive simulations and a study of the Childhood Asthma Management Program (CAMP) with high dimensional single nucleotide polymorphisms (SNPs) data. With the vast development of biotechnologies, high-dimensional omics data from different biological systems have been collected and analyzed in order to detect the associated genes responsible for certain diseases. However, this also brings challenges to the process of analysis since the number of candidate genetic features is far greater than the sample size and phenotype data can contain heavy-tailed distributions. In the second project, we proposed the spike-and-slab quantile LASSO for prediction and variable selection on large-scale genotype data. Our model used a spike-and-slab prior with a mixture of two double-exponential distributions. This allows us to keep the important effects while shrinking the coefficients of irrelevant features to exact zero. The expectation-maximization (EM) algorithm based on cyclic coordinate descent was used to fit the spike-and-slab quantile regression. Performance of the proposed method and other commonly used approach including LASSO, quantile LASSO as well as spike-and-slab LASSO was compared by measuring the identification and estimation accuracy during extensive simulation studies. The result indicates that the proposed approach delivers more precise variable selection results and prediction outcomes. We also applied the proposed procedure to the analysis of the lung adenocarcinomas (LUAD) and skin cutaneous melanoma (SKCM) data from The Cancer Genome Atlas (TCGA). Outcome of the analysis shows that the proposed method selects disease related genes and yields a more accurate prediction. In the third project, we have developed a nonparametric spike-and-slab quantile group LASSO based mixed model for high-dimensional time-varying SNP effects under longitudinal traits. The nonparametric time varying effects are effectively modeled through varying coefficient models. The splines for the varying coefficients after basis expansion are selected through the spike-and-slab quantile group LASSO that has integrated the merits from Bayesian and frequentist group LASSO, effectively circumventing their individual drawbacks. In addition, the intra-cluster correlations are accommodated through the random effects. We have established the advantage of the proposed model over Bayesian and frequentist alternatives through a diversity of settings under data contamination and model misspecifications. On the longitudinal GWAS data of asthma, the proposed method has yielded better prediction results in time-varying heritability plot along with SNPs with important biological implications. Overall, for repeated measurement data, we have developed a network-based penalized variable selection method under the GEE framework for analyzing high-dimensional single nucleotide polymorphism (SNP) data combining with the longitudinal features and responses. To accommodate the nonlinear trend, we have also developed a nonparametric spike-and-slab quantile group LASSO based mixed model for high-dimensional time-varying SNP effects under longitudinal traits. Under univariate setting, we have developed a novel variable selection method applying spike-and-slab LASSO prior to Bayesian quantile regression and constructed the spike-and-slab quantile LASSO (ssQLASSO) to analyze clinical trial data set containing high-dimensional genotype data and heavy tail distributed phenotype data. Open source R package has been constructed for the projects to enable reproducible research and quick computation. It offers efficient C++ implementation for all the proposed methods and alternative approaches. The package is available on CRAN.

Description

Keywords

High-dimensional, Longitudinal, Variable selection, Expectation–maximization algorithm, Generalized estimating equation, Mixed model

Graduation Month

August

Degree

Doctor of Philosophy

Department

Department of Statistics

Major Professor

Cen Wu

Date

Type

Dissertation

Citation