Statistics Faculty Research and Publications

Permanent URI for this collectionhttps://hdl.handle.net/2097/12343

Browse

Now showing 1 - 20 of 51

Open Access
Package 'mixedBayes': Bayesian Longitudinal Regularized Quantile Mixed Model
(CRAN, 2025-05-01) Fan, Kun; Wu, Cen
With high-dimensional omics features, repeated measure ANOVA leads to longitudinal gene-environment interaction studies that have intra-cluster correlations, outlying observations and structured sparsity arising from the ANOVA design. In this package, we have developed robust sparse Bayesian mixed effect models tailored for the above studies (Fan et al. (2025) ). An efficient Gibbs sampler has been developed to facilitate fast computation. The Markov chain Monte Carlo algorithms of the proposed and alternative methods are efficiently implemented in 'C++'. The development of this software package and the associated statistical methods have been partially supported by an Innovative Research Award from Johnson Cancer Research Center, Kansas State University.
Open Access
Is Seeing Believing? A Practitioner’s Perspective on High-Dimensional Statistical Inference in Cancer Genomics Studies
(2024-09-16) Fan, Kun; Subedi, Srijana; Yang, Gongshun; Lu, Xi; Ren, Jie; Wu, Cen
Variable selection methods have been extensively developed for and applied to cancer genomics data to identify important omics features associated with complex disease traits, including cancer outcomes. However, the reliability and reproducibility of the findings are in question if valid inferential procedures are not available to quantify the uncertainty of the findings. In this article, we provide a gentle but systematic review of high-dimensional frequentist and Bayesian inferential tools under sparse models which can yield uncertainty quantification measures, including confidence (or Bayesian credible) intervals, p values and false discovery rates (FDR). Connections in high-dimensional inferences between the two realms have been fully exploited under the “unpenalized loss function + penalty term” formulation for regularization methods and the “likelihood function × shrinkage prior” framework for regularized Bayesian analysis. In particular, we advocate for robust Bayesian variable selection in cancer genomics studies due to its ability to accommodate disease heterogeneity in the form of heavy-tailed errors and structured sparsity while providing valid statistical inference. The numerical results show that robust Bayesian analysis incorporating exact sparsity has yielded not only superior estimation and identification results but also valid Bayesian credible intervals under nominal coverage probabilities compared with alternative methods, especially in the presence of heavy-tailed model errors and outliers.
Open Access
Package ‘springer’
(2023-09-19) Zhou, Fei; Liu, Yuwen; Lu, Xi; Ren, Jie; Wu, Cen
Recently, regularized variable selection has emerged as a powerful tool to iden- tify and dissect gene-environment interactions. Nevertheless, in longitudinal studies with high di- mensional genetic factors, regularization methods for G×E interactions have not been systemati- cally developed. In this package, we provide the implementation of sparse group variable selec- tion, based on both the quadratic inference function (QIF) and generalized estimating equa- tion (GEE), to accommodate the bi-level selection for longitudinal G×E studies with high dimen- sional genomic features. Alternative methods conducting only the group or individual level se- lection have also been included. The core modules of the package have been developed in C++.
Open Access
Package ‘regnet’
(2022-08-18) Ren, Jie; Jung, Luann C.; Du, Yinhao; Wu, Cen; Jiang, Yu; Liu, Junhao
Network-based regularization has achieved success in variable selection for high-dimensional biological data due to its ability to incorporate correlations among genomic features. This package provides procedures of network-based variable selection for generalized linear models (Ren et al. (2017) and Ren et al.(2019) ). Continuous, binary, and survival response are supported. Robust network-based methods are available for continuous and survival responses.
Open Access
Supporting Information for “Robust Bayesian variable selection for gene-environment interactions”
(2022) Ren, Jie; Zhou, Fei; Li, Xiaoxi; Ma, Shuangge; Jiang, Yu; Wu, Cen
Open Access
Identification of Prognostic Genes and Pathways in Lung Adenocarcinoma Using a Bayesian Approach
(2017) Jiang, Yu; Huang, Yuan; Du, Yinhao; Zhao, Yinjun; Ren, Jie; Ma, Shuangge; Wu, Cen; ydu; jieren; wucen
Lung cancer is the leading cause of cancer-associated mortality in the United States and the world. Adenocarcinoma, the most common subtype of lung cancer, is generally diagnosed at the late stage with poor prognosis. In the past, extensive effort has been devoted to elucidating lung cancer pathogenesis and pinpointing genes associated with survival outcomes. As the progression of lung cancer is a complex process that involves coordinated actions of functionally associated genes from cancer-related pathways, there is a growing interest in simultaneous identification of both prognostic pathways and important genes within those pathways. In this study, we analyse The Cancer Genome Atlas lung adenocarcinoma data using a Bayesian approach incorporating the pathway information as well as the interconnections among genes. The top 11 pathways have been found to play significant roles in lung adenocarcinoma prognosis, including pathways in mitogen-activated protein kinase signalling, cytokine-cytokine receptor interaction, and ubiquitin-mediated proteolysis. We have also located key gene signatures such as RELB, MAP4K1, and UBE2C. These results indicate that the Bayesian approach may facilitate discovery of important genes and pathways that are tightly associated with the survival of patients with lung adenocarcinoma.
Open Access
Package ‘spinBayes’
(2020-06-22) Ren, Jie; Zhou, Fei; Li, Xiaoxi; Wu, Cen; Jiang, Yu; jieren@ksu.edu; wucen@ksu.edu
Many complex diseases are known to be affected by the interactions between genetic variants and environmental exposures beyond the main genetic and environmental effects. Existing Bayesian methods for gene-environment (G×E) interaction studies are challenged by the high-dimensional nature of the study and the complexity of environmental influences. We have developed a novel and powerful semi-parametric Bayesian variable selection method that can accommodate linear and nonlinear G×E interactions simultaneously(Ren et al. (2020) ). Furthermore, the proposed method can conduct structural identification by distinguishing nonlinear interactions from main effects only case within Bayesian framework. Spike-and-slab priors are incorporated on both individual and group level to shrink coefficients corresponding to irrelevant main and interaction effects to zero exactly. The Markov chain Monte Carlo algorithms of the proposed and alternative methods are efficiently implemented in C++.
Open Access
Package 'roben'
(2020) Ren, Jie; Zhou, Fei; Li, Xiaoxi; Wu, Cen; wucen
Gene-environment (G×E) interactions have important implications to elucidate the etiology of complex diseases beyond the main genetic and environmental effects. Outliers and data contamination in disease phenotypes of G×E studies have been commonly encountered, leading to the development of a broad spectrum of robust penalization methods. Nevertheless, within the Bayesian framework, the issue has not been taken care of in existing studies. We develop a robust Bayesian variable selection method for G×E interaction studies. The proposed Bayesian method can effectively accommodate heavy-tailed errors and outliers in the response variable while conducting variable selection by accounting for structural sparsity. In particular, the spike-and-slab priors have been imposed on both individual and group levels to identify important main and interaction effects. An efﬁcient Gibbs sampler has been developed to facilitate fast computation. The Markov chain Monte Carlo algorithms of the proposed and alternative methods are efﬁciently implemented in C++.
Open Access
Package ‘interep’
(2020) Zhou, Fei; Ren, Jie; Li, Xiaoxi; Wu, Cen; Jiang, Yu; wucen
Extensive penalized variable selection methods have been developed in the past two decades for analyzing high dimensional omics data, such as gene expressions, single nucleotide polymorphisms (SNPs), copy number variations (CNVs) and others. However, lipidomics data have been rarely investigated by using high dimensional variable selection methods. This package incorporates our recently developed penalization procedures to conduct interaction analysis for high dimensional lipidomics data with repeated measurements. The core module of this package is developed in C++. The development of this software package and the associated statistical methods have been partially supported by an Innovative Research Award from Johnson Cancer Research Center, Kansas State University.
Open Access
Genomic Prediction Accounting for Residual Heteroskedasticity
Ou, Z. N.; Tempelman, R. J.; Steibel, J. P.; Ernst, C. W.; Bates, R. O.; Bello, Nora M.; nbello; Bello, Nora M.
Whole-genome prediction (WGP) models that use single-nucleotide polymorphism marker information to predict genetic merit of animals and plants typically assume homogeneous residual variance. However, variability is often heterogeneous across agricultural production systems and may subsequently bias WGP-based inferences. This study extends classical WGP models based on normality, heavy-tailed specifications and variable selection to explicitly account for environmentally-driven residual heteroske-dasticity under a hierarchical Bayesian mixed-models framework. WGP models assuming homogeneous or heterogeneous residual variances were fitted to training data generated under simulation scenarios reflecting a gradient of increasing heteroskedasticity. Model fit was based on pseudo-Bayes factors and also on prediction accuracy of genomic breeding values computed on a validation data subset one generation removed from the simulated training dataset. Homogeneous vs. heterogeneous residual variance WGP models were also fitted to two quantitative traits, namely 45-min postmortem carcass temperature and loin muscle pH, recorded in a swine resource population dataset prescreened for high and mild residual heteroskedasticity, respectively. Fit of competing WGP models was compared using pseudo-Bayes factors. Predictive ability, defined as the correlation between predicted and observed phenotypes in validation sets of a five-fold cross-validation was also computed. Heteroskedastic error WGP models showed improved model fit and enhanced prediction accuracy compared to homoskedastic error WGP models although the magnitude of the improvement was small (less than two percentage points net gain in prediction accuracy). Nevertheless, accounting for residual heteroskedasticity did improve accuracy of selection, especially on individuals of extreme genetic merit.
Open Access
In utero exposure to polychlorinated biphenyls is associated with decreased fecundability in daughters of Michigan female fisheaters: a cohort study
Han, L.; Hsu, Wei-Wen; Todem, D.; Osuch, J.; Hungerink, A.; Karmaus, W.; wwhsu; Hsu, Wei-Wen
Background: Multiple studies have suggested a relationship between adult exposures to environmental organochlorines and fecundability. There is a paucity of data, however, regarding fetal exposure to organochlorines via the mother's blood and fecundability of adult female offspring. Methods: Data from a two-generation cohort of maternal fisheaters was investigated to assess female offspring fecundability. Serum concentrations of polychlorinated biphenyls (PCBs) and 1,1-bis-(4-chlorophenyl)-2,2-dichloroethene (DDE) in Michigan female anglers were serially measured between 1973 and 1991 and used to estimate in utero exposure in their female offspring using two different methods. The angler cohort included 391 women of whom 259 provided offspring information. Of 213 daughters aged 20-50, 151 participated (71 %) and provided information for time intervals of unprotected intercourse (TUI). The daughters reported 308 TUIs (repeated observations), of which 288 ended in pregnancy. We estimated the fecundability ratio (FR) for serum-PCB and serum-DDE adjusting for confounders and accounting for repeated measurements. An FR below one indicates a longer time to pregnancy. Results: Compared to serum-PCB of <2.5 mu g/L, the FR was 0.60 for serum-PCB between 2.5-7.4 mu g/L [95 % confidence intervals (CI) 0.36, 0.99], and 0.42 [95 % CI 0.20, 0.88] for serum-PCB > 7.4 mu g/L. Similar results were obtained using the alternative statistical method to estimate in utero serum-PCB. The association was stronger for TUIs when women planned a baby; FR = 0.50 for serum-PCB between 2.5-7.4 mu g/L, [95 % CI 0.29, 0.89], and 0.30 [95 % CI 0.13, 0.68] for serum-PCB > 7.4 mu g/L. There was no relationship between in utero exposure to DDE and fecundability in daughters. Conclusions: Decreased fecundability in female offspring of fisheaters was found to be associated with PCB exposure in utero, possibly related to endocrine disruption in the oocyte and/or other developing organs influencing reproductive capacity in adulthood.
Open Access
Inferential considerations for low-count RNA-seq transcripts: a case study on the dominant prairie grass Andropogon gerardii
Raithel, S.; Johnson, Loretta C.; Galliart, M.; Brown, Susan J.; Shelton, J.; Herndon, N.; Bello, Nora M.; nbello; johnson; sjbrown
Background: Differential expression (DE) analysis of RNA-seq data still poses inferential challenges, such as handling of transcripts characterized by low expression levels. In this study, we use a plasmode-based approach to assess the relative performance of alternative inferential strategies on RNA-seq transcripts, with special emphasis on transcripts characterized by a small number of read counts, so-called low-count transcripts, as motivated by an ecological application in prairie grasses. Big bluestem (Andropogon gerardii) is a wide-ranging dominant prairie grass of ecological and agricultural importance to the US Midwest while edaphic subspecies sand bluestem (A. gerardii ssp. Hallii) grows exclusively on sand dunes. Relative to big bluestem, sand bluestem exhibits qualitative phenotypic divergence consistent with enhanced drought tolerance, plausibly associated with transcripts of low expression levels. Our dataset consists of RNA-seq read counts for 25,582 transcripts (60 % of which are classified as low-count) collected from leaf tissue of individual plants of big bluestem (n = 4) and sand bluestem (n = 4). Focused on low-count transcripts, we compare alternative ad-hoc data filtering techniques commonly used in RNA-seq pipelines and assess the inferential performance of recently developed statistical methods for DE analysis, namely DESeq2 and edgeR robust. These methods attempt to overcome the inherently noisy behavior of low-count transcripts by either shrinkage or differential weighting of observations, respectively. Results: Both DE methods seemed to properly control family-wise type 1 error on low-count transcripts, whereas edgeR robust showed greater power and DESeq2 showed greater precision and accuracy. However, specification of the degree of freedom parameter under edgeR robust had a non-trivial impact on inference and should be handled carefully. When properly specified, both DE methods showed overall promising inferential performance on low-count transcripts, suggesting that ad-hoc data filtering steps at arbitrary expression thresholds may be unnecessary. A note of caution is in order regarding the approximate nature of DE tests under both methods. Conclusions: Practical recommendations for DE inference are provided when low-count RNA-seq transcripts are of interest, as is the case in the comparison of subspecies of bluestem grasses. Insights from this study may also be relevant to other applications focused on transcripts of low expression levels.
Open Access
SPATIO-TEMPORAL MODELS FOR SOME DATA SETS IN CONTINUOUS SPACE AND DISCRETE TIME
Demel, S. S.; Du, Juan; dujuan
Open Access
To See or Not to See: Do Front of Pack Nutrition Labels Affect Attention to Overall Nutrition Information?
Bix, L.; Sundar, R. P.; Bello, Nora M.; Peltier, C.; Weatherspoon, L. J.; Becker, M. W.; nbello
Open Access
Cell Based Drug Delivery: Micrococcus luteus Loaded Neutrophils as Chlorhexidine Delivery Vehicles in a Mouse Model of Liver Abscesses in Cattle
Wendel, S. O.; Menon, S.; Alshetaiwi, H.; Shrestha, Tej Bahadur; Chlebanowski, L.; Hsu, Wei-Wen; Bossmann, Stefan H.; Narayanan, Sanjeev K.; Troyer, Deryl L.; wwhsu; sbossman; snarayan; troyer
Open Access
Examining multiracial youth in context: ethnic identity development and mental health outcomes
(2015-04-06) Fisher, Sycarah; Reynolds, Jennifer L.; Hsu, Wei-Wen; Barnes, Jessica; Tyler, Kenneth; wwhsu
Although multiracial individuals are the fastest growing population in the United States, research on the identity development of multiracial adolescents remains scant. This study explores the relationship between ethnic identity, its components (affirmation, exploration), and mental health outcomes (anxiety, depression) within the contexts of schools for multiracial adolescents. Participants were multiracial and monoracial minority and majority high school students (n=4,766). Using Analysis of Variance and Multiple Indicators Multiple Causes (MIMIC) models, results indicated that multiracial youth experience more exploration and less affirmation than African Americans, but more than Caucasians. In addition, multiracial youth were found to have higher levels of mental health issues than their monoracial minority and majority peers. Specifically, multiracial youth had higher levels of depression than their African American and Caucasian counterparts. Multiracial and Caucasian youth had similar levels of anxiety but these levels were significantly higher than African Americans. Results also show that school diversity can mitigate mental health outcomes finding that multiracial youth in more diverse schools are at lower risk for mental health issues.
Open Access
A sequential naïve Bayes classifier for DNA barcodes
(2015-03-04) Anderson, Michael P.; Dubnicka, Suzanne R.; dubnicka
DNA barcodes are short strands of 255–700 nucleotide bases taken from the cytochrome c oxidase subunit 1 (COI) region of the mitochondrial DNA. It has been proposed that these barcodes may be used as a method of differentiating between biological species. Current methods of species classification utilize distance measures that are heavily dependent on both evolutionary model assumptions as well as a clearly defined “gap” between intra- and interspecies variation. Such distance measures fail to measure classification uncertainty or to indicate how much of the barcode is necessary for classification. We propose a sequential naïve Bayes classifier for species classification to address these limitations. The proposed method is shown to provide accurate species-level classification on real and simulated data. The method proposed here quantifies the uncertainty of each classification and addresses how much of the barcode is necessary.
Open Access
Estimating Mixture of Gaussian Processes by Kernel Smoothing
(2014-12-03) Huang, Mian; Li, Runze; Wang, Hansheng; Yao, Weixin; wxyao
When functional data are not homogenous, for example, when there are multiple classes of functional curves in the dataset, traditional estimation methods may fail. In this article, we propose a new estimation procedure for the mixture of Gaussian processes, to incorporate both functional and inhomogenous properties of the data. Our method can be viewed as a natural extension of high-dimensional normal mixtures. However, the key difference is that smoothed structures are imposed for both the mean and covariance functions. The model is shown to be identifiable, and can be estimated efficiently by a combination of the ideas from expectation-maximization (EM) algorithm, kernel regression, and functional principal component analysis. Our methodology is empirically justified by Monte Carlo simulations and illustrated by an analysis of a supermarket dataset.
Open Access
Minimum profile Hellinger distance estimation for a semiparametric mixture model
(2014-12-03) Xiang, Sijia; Yao, Weixin; Wu, Jingjing; wxyao
In this paper, we propose a new effective estimator for a class of semiparametric mixture models where one component has known distribution with possibly unknown parameters while the other component density and the mixing proportion are unknown. Such semiparametric mixture models have been often used in multiple hypothesis testing and the sequential clustering algorithm. The proposed estimator is based on the minimum profile Hellinger distance (MPHD), and its theoretical properties are investigated. In addition, we use simulation studies to illustrate the finite sample performance of the MPHD estimator and compare it with some other existing approaches. The empirical studies demonstrate that the new method outperforms existing estimators when data are generated under contamination and works comparably to existing estimators when data are not contaminated. Applications to two real data sets are also provided to illustrate the effectiveness of the new methodology.
Open Access
The role of packaging size on contamination rates during simulated presentation to a sterile field
(2014-10-01) Trier, Tony; Bello, Nora M.; Bush, Tamara Reid; Bix, Laura; nbello
Objective: The objective of this study was to assess the impact of package size on the contact between medical devices and non-sterile surfaces (i.e. the hands of the practitioner and the outside of the package) during aseptic presentation to a simulated sterile field. Rationale for this objective stems from the decades-long problem of hospital-acquired infections. This work approaches the problem from a unique perspective, namely packaging size. Design: Randomized complete block design with subsampling. Setting: Research study conducted at professional conferences for surgical technologists and nursing professionals. Participants: Ninety-seven healthcare providers, primarily surgical technologists and nurses. Methods: Participants were gloved and asked to present the contents of six pouches of three different sizes to a simulated sterile field. The exterior of pouches and gloves of participants were coated with a simulated contaminant prior to each opening trial. After presentation to the simulated sterile field, the presence of the contaminant on package contents was recorded as indicative of contact with non-sterile surfaces and analyzed in a binary fashion using a generalized linear mixed model. Results: Recruited subjects were 26–64 years of age (81 females, 16 males), with 2.5–44 years of professional experience. Results indicated a significant main effect of pouch size on contact rate of package contents (P = 0.0108), whereby larger pouches induced greater rates of contact than smaller pouches (estimates±SEM: 14.7±2.9% vs. 6.0±1.7%, respectively). Discussion and Conclusion: This study utilized novel methodologies which simulate contamination in aseptic presentation. Results of this work indicate that increased contamination rates are associated with larger pouches when compared to smaller pouches. The results add to a growing body of research which investigate packaging's role in serving as a pathway for product contamination during aseptic presentation. Future work should investigate other packaging design factors (e.g. material, rigidity, and closure systems) and their role in contamination.