Statistics Faculty Research and Publications

Permanent URI for this collection


Recent Submissions

Now showing 1 - 20 of 49
  • ItemOpen Access
    Package ‘springer’
    (2023-09-19) Zhou, Fei; Liu, Yuwen; Lu, Xi; Ren, Jie; Wu, Cen
    Recently, regularized variable selection has emerged as a powerful tool to iden- tify and dissect gene-environment interactions. Nevertheless, in longitudinal studies with high di- mensional genetic factors, regularization methods for G×E interactions have not been systemati- cally developed. In this package, we provide the implementation of sparse group variable selec- tion, based on both the quadratic inference function (QIF) and generalized estimating equa- tion (GEE), to accommodate the bi-level selection for longitudinal G×E studies with high dimen- sional genomic features. Alternative methods conducting only the group or individual level se- lection have also been included. The core modules of the package have been developed in C++.
  • ItemOpen Access
    Package ‘regnet’
    (2022-08-18) Ren, Jie; Jung, Luann C.; Du, Yinhao; Wu, Cen; Jiang, Yu; Liu, Junhao
    Network-based regularization has achieved success in variable selection for high-dimensional biological data due to its ability to incorporate correlations among genomic features. This package provides procedures of network-based variable selection for generalized linear models (Ren et al. (2017) and Ren et al.(2019) ). Continuous, binary, and survival response are supported. Robust network-based methods are available for continuous and survival responses.
  • ItemOpen Access
    Supporting Information for “Robust Bayesian variable selection for gene-environment interactions”
    (2022) Ren, Jie; Zhou, Fei; Li, Xiaoxi; Ma, Shuangge; Jiang, Yu; Wu, Cen
  • ItemOpen Access
    Identification of Prognostic Genes and Pathways in Lung Adenocarcinoma Using a Bayesian Approach
    (2017) Jiang, Yu; Huang, Yuan; Du, Yinhao; Zhao, Yinjun; Ren, Jie; Ma, Shuangge; Wu, Cen; ydu; jieren; wucen
    Lung cancer is the leading cause of cancer-associated mortality in the United States and the world. Adenocarcinoma, the most common subtype of lung cancer, is generally diagnosed at the late stage with poor prognosis. In the past, extensive effort has been devoted to elucidating lung cancer pathogenesis and pinpointing genes associated with survival outcomes. As the progression of lung cancer is a complex process that involves coordinated actions of functionally associated genes from cancer-related pathways, there is a growing interest in simultaneous identification of both prognostic pathways and important genes within those pathways. In this study, we analyse The Cancer Genome Atlas lung adenocarcinoma data using a Bayesian approach incorporating the pathway information as well as the interconnections among genes. The top 11 pathways have been found to play significant roles in lung adenocarcinoma prognosis, including pathways in mitogen-activated protein kinase signalling, cytokine-cytokine receptor interaction, and ubiquitin-mediated proteolysis. We have also located key gene signatures such as RELB, MAP4K1, and UBE2C. These results indicate that the Bayesian approach may facilitate discovery of important genes and pathways that are tightly associated with the survival of patients with lung adenocarcinoma.
  • ItemOpen Access
    Package ‘spinBayes’
    (2020-06-22) Ren, Jie; Zhou, Fei; Li, Xiaoxi; Wu, Cen; Jiang, Yu;;
    Many complex diseases are known to be affected by the interactions between genetic variants and environmental exposures beyond the main genetic and environmental effects. Existing Bayesian methods for gene-environment (G×E) interaction studies are challenged by the high-dimensional nature of the study and the complexity of environmental influences. We have developed a novel and powerful semi-parametric Bayesian variable selection method that can accommodate linear and nonlinear G×E interactions simultaneously(Ren et al. (2020) ). Furthermore, the proposed method can conduct structural identification by distinguishing nonlinear interactions from main effects only case within Bayesian framework. Spike-and-slab priors are incorporated on both individual and group level to shrink coefficients corresponding to irrelevant main and interaction effects to zero exactly. The Markov chain Monte Carlo algorithms of the proposed and alternative methods are efficiently implemented in C++.
  • ItemOpen Access
    Package 'roben'
    (2020) Ren, Jie; Zhou, Fei; Li, Xiaoxi; Wu, Cen; wucen
    Gene-environment (G×E) interactions have important implications to elucidate the etiology of complex diseases beyond the main genetic and environmental effects. Outliers and data contamination in disease phenotypes of G×E studies have been commonly encountered, leading to the development of a broad spectrum of robust penalization methods. Nevertheless, within the Bayesian framework, the issue has not been taken care of in existing studies. We develop a robust Bayesian variable selection method for G×E interaction studies. The proposed Bayesian method can effectively accommodate heavy-tailed errors and outliers in the response variable while conducting variable selection by accounting for structural sparsity. In particular, the spike-and-slab priors have been imposed on both individual and group levels to identify important main and interaction effects. An efficient Gibbs sampler has been developed to facilitate fast computation. The Markov chain Monte Carlo algorithms of the proposed and alternative methods are efficiently implemented in C++.
  • ItemOpen Access
    Package ‘interep’
    (2020) Zhou, Fei; Ren, Jie; Li, Xiaoxi; Wu, Cen; Jiang, Yu; wucen
    Extensive penalized variable selection methods have been developed in the past two decades for analyzing high dimensional omics data, such as gene expressions, single nucleotide polymorphisms (SNPs), copy number variations (CNVs) and others. However, lipidomics data have been rarely investigated by using high dimensional variable selection methods. This package incorporates our recently developed penalization procedures to conduct interaction analysis for high dimensional lipidomics data with repeated measurements. The core module of this package is developed in C++. The development of this software package and the associated statistical methods have been partially supported by an Innovative Research Award from Johnson Cancer Research Center, Kansas State University.
  • ItemOpen Access
    Genomic Prediction Accounting for Residual Heteroskedasticity
    Ou, Z. N.; Tempelman, R. J.; Steibel, J. P.; Ernst, C. W.; Bates, R. O.; Bello, Nora M.; nbello; Bello, Nora M.
    Whole-genome prediction (WGP) models that use single-nucleotide polymorphism marker information to predict genetic merit of animals and plants typically assume homogeneous residual variance. However, variability is often heterogeneous across agricultural production systems and may subsequently bias WGP-based inferences. This study extends classical WGP models based on normality, heavy-tailed specifications and variable selection to explicitly account for environmentally-driven residual heteroske-dasticity under a hierarchical Bayesian mixed-models framework. WGP models assuming homogeneous or heterogeneous residual variances were fitted to training data generated under simulation scenarios reflecting a gradient of increasing heteroskedasticity. Model fit was based on pseudo-Bayes factors and also on prediction accuracy of genomic breeding values computed on a validation data subset one generation removed from the simulated training dataset. Homogeneous vs. heterogeneous residual variance WGP models were also fitted to two quantitative traits, namely 45-min postmortem carcass temperature and loin muscle pH, recorded in a swine resource population dataset prescreened for high and mild residual heteroskedasticity, respectively. Fit of competing WGP models was compared using pseudo-Bayes factors. Predictive ability, defined as the correlation between predicted and observed phenotypes in validation sets of a five-fold cross-validation was also computed. Heteroskedastic error WGP models showed improved model fit and enhanced prediction accuracy compared to homoskedastic error WGP models although the magnitude of the improvement was small (less than two percentage points net gain in prediction accuracy). Nevertheless, accounting for residual heteroskedasticity did improve accuracy of selection, especially on individuals of extreme genetic merit.
  • ItemOpen Access
    In utero exposure to polychlorinated biphenyls is associated with decreased fecundability in daughters of Michigan female fisheaters: a cohort study
    Han, L.; Hsu, Wei-Wen; Todem, D.; Osuch, J.; Hungerink, A.; Karmaus, W.; wwhsu; Hsu, Wei-Wen
    Background: Multiple studies have suggested a relationship between adult exposures to environmental organochlorines and fecundability. There is a paucity of data, however, regarding fetal exposure to organochlorines via the mother's blood and fecundability of adult female offspring. Methods: Data from a two-generation cohort of maternal fisheaters was investigated to assess female offspring fecundability. Serum concentrations of polychlorinated biphenyls (PCBs) and 1,1-bis-(4-chlorophenyl)-2,2-dichloroethene (DDE) in Michigan female anglers were serially measured between 1973 and 1991 and used to estimate in utero exposure in their female offspring using two different methods. The angler cohort included 391 women of whom 259 provided offspring information. Of 213 daughters aged 20-50, 151 participated (71 %) and provided information for time intervals of unprotected intercourse (TUI). The daughters reported 308 TUIs (repeated observations), of which 288 ended in pregnancy. We estimated the fecundability ratio (FR) for serum-PCB and serum-DDE adjusting for confounders and accounting for repeated measurements. An FR below one indicates a longer time to pregnancy. Results: Compared to serum-PCB of <2.5 mu g/L, the FR was 0.60 for serum-PCB between 2.5-7.4 mu g/L [95 % confidence intervals (CI) 0.36, 0.99], and 0.42 [95 % CI 0.20, 0.88] for serum-PCB > 7.4 mu g/L. Similar results were obtained using the alternative statistical method to estimate in utero serum-PCB. The association was stronger for TUIs when women planned a baby; FR = 0.50 for serum-PCB between 2.5-7.4 mu g/L, [95 % CI 0.29, 0.89], and 0.30 [95 % CI 0.13, 0.68] for serum-PCB > 7.4 mu g/L. There was no relationship between in utero exposure to DDE and fecundability in daughters. Conclusions: Decreased fecundability in female offspring of fisheaters was found to be associated with PCB exposure in utero, possibly related to endocrine disruption in the oocyte and/or other developing organs influencing reproductive capacity in adulthood.
  • ItemOpen Access
    Inferential considerations for low-count RNA-seq transcripts: a case study on the dominant prairie grass Andropogon gerardii
    Raithel, S.; Johnson, Loretta C.; Galliart, M.; Brown, Susan J.; Shelton, J.; Herndon, N.; Bello, Nora M.; nbello; johnson; sjbrown
    Background: Differential expression (DE) analysis of RNA-seq data still poses inferential challenges, such as handling of transcripts characterized by low expression levels. In this study, we use a plasmode-based approach to assess the relative performance of alternative inferential strategies on RNA-seq transcripts, with special emphasis on transcripts characterized by a small number of read counts, so-called low-count transcripts, as motivated by an ecological application in prairie grasses. Big bluestem (Andropogon gerardii) is a wide-ranging dominant prairie grass of ecological and agricultural importance to the US Midwest while edaphic subspecies sand bluestem (A. gerardii ssp. Hallii) grows exclusively on sand dunes. Relative to big bluestem, sand bluestem exhibits qualitative phenotypic divergence consistent with enhanced drought tolerance, plausibly associated with transcripts of low expression levels. Our dataset consists of RNA-seq read counts for 25,582 transcripts (60 % of which are classified as low-count) collected from leaf tissue of individual plants of big bluestem (n = 4) and sand bluestem (n = 4). Focused on low-count transcripts, we compare alternative ad-hoc data filtering techniques commonly used in RNA-seq pipelines and assess the inferential performance of recently developed statistical methods for DE analysis, namely DESeq2 and edgeR robust. These methods attempt to overcome the inherently noisy behavior of low-count transcripts by either shrinkage or differential weighting of observations, respectively. Results: Both DE methods seemed to properly control family-wise type 1 error on low-count transcripts, whereas edgeR robust showed greater power and DESeq2 showed greater precision and accuracy. However, specification of the degree of freedom parameter under edgeR robust had a non-trivial impact on inference and should be handled carefully. When properly specified, both DE methods showed overall promising inferential performance on low-count transcripts, suggesting that ad-hoc data filtering steps at arbitrary expression thresholds may be unnecessary. A note of caution is in order regarding the approximate nature of DE tests under both methods. Conclusions: Practical recommendations for DE inference are provided when low-count RNA-seq transcripts are of interest, as is the case in the comparison of subspecies of bluestem grasses. Insights from this study may also be relevant to other applications focused on transcripts of low expression levels.
  • ItemOpen Access
    To See or Not to See: Do Front of Pack Nutrition Labels Affect Attention to Overall Nutrition Information?
    Bix, L.; Sundar, R. P.; Bello, Nora M.; Peltier, C.; Weatherspoon, L. J.; Becker, M. W.; nbello
  • ItemOpen Access
    Cell Based Drug Delivery: Micrococcus luteus Loaded Neutrophils as Chlorhexidine Delivery Vehicles in a Mouse Model of Liver Abscesses in Cattle
    Wendel, S. O.; Menon, S.; Alshetaiwi, H.; Shrestha, Tej Bahadur; Chlebanowski, L.; Hsu, Wei-Wen; Bossmann, Stefan H.; Narayanan, Sanjeev K.; Troyer, Deryl L.; wwhsu; sbossman; snarayan; troyer
  • ItemOpen Access
    Examining multiracial youth in context: ethnic identity development and mental health outcomes
    (2015-04-06) Fisher, Sycarah; Reynolds, Jennifer L.; Hsu, Wei-Wen; Barnes, Jessica; Tyler, Kenneth; wwhsu
    Although multiracial individuals are the fastest growing population in the United States, research on the identity development of multiracial adolescents remains scant. This study explores the relationship between ethnic identity, its components (affirmation, exploration), and mental health outcomes (anxiety, depression) within the contexts of schools for multiracial adolescents. Participants were multiracial and monoracial minority and majority high school students (n=4,766). Using Analysis of Variance and Multiple Indicators Multiple Causes (MIMIC) models, results indicated that multiracial youth experience more exploration and less affirmation than African Americans, but more than Caucasians. In addition, multiracial youth were found to have higher levels of mental health issues than their monoracial minority and majority peers. Specifically, multiracial youth had higher levels of depression than their African American and Caucasian counterparts. Multiracial and Caucasian youth had similar levels of anxiety but these levels were significantly higher than African Americans. Results also show that school diversity can mitigate mental health outcomes finding that multiracial youth in more diverse schools are at lower risk for mental health issues.
  • ItemOpen Access
    A sequential naïve Bayes classifier for DNA barcodes
    (2015-03-04) Anderson, Michael P.; Dubnicka, Suzanne R.; dubnicka
    DNA barcodes are short strands of 255–700 nucleotide bases taken from the cytochrome c oxidase subunit 1 (COI) region of the mitochondrial DNA. It has been proposed that these barcodes may be used as a method of differentiating between biological species. Current methods of species classification utilize distance measures that are heavily dependent on both evolutionary model assumptions as well as a clearly defined “gap” between intra- and interspecies variation. Such distance measures fail to measure classification uncertainty or to indicate how much of the barcode is necessary for classification. We propose a sequential naïve Bayes classifier for species classification to address these limitations. The proposed method is shown to provide accurate species-level classification on real and simulated data. The method proposed here quantifies the uncertainty of each classification and addresses how much of the barcode is necessary.
  • ItemOpen Access
    Estimating Mixture of Gaussian Processes by Kernel Smoothing
    (2014-12-03) Huang, Mian; Li, Runze; Wang, Hansheng; Yao, Weixin; wxyao
    When functional data are not homogenous, for example, when there are multiple classes of functional curves in the dataset, traditional estimation methods may fail. In this article, we propose a new estimation procedure for the mixture of Gaussian processes, to incorporate both functional and inhomogenous properties of the data. Our method can be viewed as a natural extension of high-dimensional normal mixtures. However, the key difference is that smoothed structures are imposed for both the mean and covariance functions. The model is shown to be identifiable, and can be estimated efficiently by a combination of the ideas from expectation-maximization (EM) algorithm, kernel regression, and functional principal component analysis. Our methodology is empirically justified by Monte Carlo simulations and illustrated by an analysis of a supermarket dataset.
  • ItemOpen Access
    Minimum profile Hellinger distance estimation for a semiparametric mixture model
    (2014-12-03) Xiang, Sijia; Yao, Weixin; Wu, Jingjing; wxyao
    In this paper, we propose a new effective estimator for a class of semiparametric mixture models where one component has known distribution with possibly unknown parameters while the other component density and the mixing proportion are unknown. Such semiparametric mixture models have been often used in multiple hypothesis testing and the sequential clustering algorithm. The proposed estimator is based on the minimum profile Hellinger distance (MPHD), and its theoretical properties are investigated. In addition, we use simulation studies to illustrate the finite sample performance of the MPHD estimator and compare it with some other existing approaches. The empirical studies demonstrate that the new method outperforms existing estimators when data are generated under contamination and works comparably to existing estimators when data are not contaminated. Applications to two real data sets are also provided to illustrate the effectiveness of the new methodology.
  • ItemOpen Access
    The role of packaging size on contamination rates during simulated presentation to a sterile field
    (2014-10-01) Trier, Tony; Bello, Nora M.; Bush, Tamara Reid; Bix, Laura; nbello
    Objective: The objective of this study was to assess the impact of package size on the contact between medical devices and non-sterile surfaces (i.e. the hands of the practitioner and the outside of the package) during aseptic presentation to a simulated sterile field. Rationale for this objective stems from the decades-long problem of hospital-acquired infections. This work approaches the problem from a unique perspective, namely packaging size. Design: Randomized complete block design with subsampling. Setting: Research study conducted at professional conferences for surgical technologists and nursing professionals. Participants: Ninety-seven healthcare providers, primarily surgical technologists and nurses. Methods: Participants were gloved and asked to present the contents of six pouches of three different sizes to a simulated sterile field. The exterior of pouches and gloves of participants were coated with a simulated contaminant prior to each opening trial. After presentation to the simulated sterile field, the presence of the contaminant on package contents was recorded as indicative of contact with non-sterile surfaces and analyzed in a binary fashion using a generalized linear mixed model. Results: Recruited subjects were 26–64 years of age (81 females, 16 males), with 2.5–44 years of professional experience. Results indicated a significant main effect of pouch size on contact rate of package contents (P = 0.0108), whereby larger pouches induced greater rates of contact than smaller pouches (estimates±SEM: 14.7±2.9% vs. 6.0±1.7%, respectively). Discussion and Conclusion: This study utilized novel methodologies which simulate contamination in aseptic presentation. Results of this work indicate that increased contamination rates are associated with larger pouches when compared to smaller pouches. The results add to a growing body of research which investigate packaging's role in serving as a pathway for product contamination during aseptic presentation. Future work should investigate other packaging design factors (e.g. material, rigidity, and closure systems) and their role in contamination.
  • ItemOpen Access
    DDE and PCB serum concentration in maternal blood and their adult female offspring
    (2014-07-16) Hsu, Wei-Wen; Osuch, Janet Rose; Todem, David; Taffe, Bonita; O’Keefe, Michael; Adera, Selamawit; Karmaus, Wilfried; wwhsu
    Background: Dichlorodiphenyl dichloroethylene (DDE) and polychlorinated biphenyls (PCBs) can be passed from mother to offspring through placental transfer or breast feeding. Unknown is whether maternal levels can predict concentrations in adult offspring. Objectives: To test the association between maternal blood levels of DDE and PCBs and adult female offspring levels of these compounds using data from the Michigan Fisheaters’Cohort. Methods: DDE and PCB concentrations were determined in 132 adult daughters from 84 mothers. Prenatal exposures were estimated based on maternal DDE and PCB serum levels measured between 1973 and 1991. Levels in adult daughters were regressed on maternal and estimated prenatal exposure levels, adjusting for potential confounders using linear mixed models. Confounders included daughter’s age, birth order, birth weight, number of pregnancies, the length of time the daughter was breast-fed, the length of time the daughter breast-fed her own children, last year fish-eating status, body mass index, and lipid weight. Results: The median age of the participants was 40.4 years (range 18.4 to 65.4, 5–95 percentiles 22.5-54.6%, respectively). Controlling for confounders and intra-familial associations, DDE and PCB concentrations in adult daughters were significantly positively associated with estimated prenatal levels and with maternal concentrations. The proportion of variance in the adult daughters’ organochlorine concentrations explained by the maternal exposure levels is approximately 23% for DDE and 43% for PCBs. The equivalent of a median of 3.67 μg/L prenatal DDE and a median of 2.56 μg/L PCBs were 15.64 and 10.49 years of fish consumption, respectively. When controlling for effects of the shared environment (e.g., fish diet) by using a subsample of paternal levels measured during the same time frames (n=53 and n=37), we determined that the direct maternal transfer remains important. Conclusions: Estimated intrauterine DDE and PCB levels predicted concentrations in adult female offspring 40 years later. Interpretation of adverse health effects from intrauterine exposures of persistent pollutants may need to consider the sustained impact of maternal DDE and PCB levels found in their offspring.
  • ItemOpen Access
    A pipeline for improved QSAR analysis of peptides: physiochemical property parameter selection via BMSF, near-neighbor sample selection via semivariogram, and weighted SVR regression and prediction
    (2014-06-24) Dai, Zhijun; Wang, Lifeng; Chen, Yuan; Wang, Haiyan; Bai, Lianyang; Yuan, Zheming; hwang
    In this paper, we present a pipeline to perform improved QSAR analysis of peptides. The modeling involves a double selection procedure that first performs feature selection and then conducts sample selection before the final regression analysis. Five hundred and thirty-one physicochemical property parameters of amino acids were used as descriptors to characterize the structure of peptides. These high-dimensional descriptors then go through a feature selection process given by the Binary Matrix Shuffling Filter (BMSF) to obtain a set of important low dimensional features. Each descriptor that passed the BMSF filtering also receives a weight defined through its contribution to reduce the estimation error. These selected features were served as the predictors for subsequent sample selection and modeling. Based on the weighted Euclidean distances between samples, a common range was determined with high-dimensional semivariogram and then used as a threshold to select the near-neighbor samples from the training set. For each sample to be predicted, the QSAR model was established using SVR with the weighted, selected features based on the exclusive set of near-neighbor training samples. Prediction was conducted for each test sample accordingly. The performances of this pipeline are tested with the QSAR analysis of angiotensin-converting enzyme (ACE) inhibitors and HLA-A*0201 data sets. Improved prediction accuracy was obtained in both applications. This pipeline can optimize the QSAR modeling from both the feature selection and sample selection perspectives. This leads to improved accuracy over single selection methods. We expect this pipeline to have extensive application prospect in the field of regression prediction.