Statistics Faculty Research and Publications
Permanent URI for this collectionhttps://hdl.handle.net/2097/12343
Browse
Recent Submissions
Item Open Access Is Seeing Believing? A Practitioner’s Perspective on High-Dimensional Statistical Inference in Cancer Genomics Studies(2024-09-16) Fan, Kun; Subedi, Srijana; Yang, Gongshun; Lu, Xi; Ren, Jie; Wu, CenVariable selection methods have been extensively developed for and applied to cancer genomics data to identify important omics features associated with complex disease traits, including cancer outcomes. However, the reliability and reproducibility of the findings are in question if valid inferential procedures are not available to quantify the uncertainty of the findings. In this article, we provide a gentle but systematic review of high-dimensional frequentist and Bayesian inferential tools under sparse models which can yield uncertainty quantification measures, including confidence (or Bayesian credible) intervals, p values and false discovery rates (FDR). Connections in high-dimensional inferences between the two realms have been fully exploited under the “unpenalized loss function + penalty term” formulation for regularization methods and the “likelihood function × shrinkage prior” framework for regularized Bayesian analysis. In particular, we advocate for robust Bayesian variable selection in cancer genomics studies due to its ability to accommodate disease heterogeneity in the form of heavy-tailed errors and structured sparsity while providing valid statistical inference. The numerical results show that robust Bayesian analysis incorporating exact sparsity has yielded not only superior estimation and identification results but also valid Bayesian credible intervals under nominal coverage probabilities compared with alternative methods, especially in the presence of heavy-tailed model errors and outliers.Item Open Access Package ‘springer’(2023-09-19) Zhou, Fei; Liu, Yuwen; Lu, Xi; Ren, Jie; Wu, CenRecently, regularized variable selection has emerged as a powerful tool to iden- tify and dissect gene-environment interactions. Nevertheless, in longitudinal studies with high di- mensional genetic factors, regularization methods for G×E interactions have not been systemati- cally developed. In this package, we provide the implementation of sparse group variable selec- tion, based on both the quadratic inference function (QIF) and generalized estimating equa- tion (GEE), to accommodate the bi-level selection for longitudinal G×E studies with high dimen- sional genomic features. Alternative methods conducting only the group or individual level se- lection have also been included. The core modules of the package have been developed in C++.Item Open Access Package ‘regnet’(2022-08-18) Ren, Jie; Jung, Luann C.; Du, Yinhao; Wu, Cen; Jiang, Yu; Liu, JunhaoNetwork-based regularization has achieved success in variable selection for high-dimensional biological data due to its ability to incorporate correlations among genomic features. This package provides procedures of network-based variable selection for generalized linear models (Ren et al. (2017) and Ren et al.(2019) ). Continuous, binary, and survival response are supported. Robust network-based methods are available for continuous and survival responses.Item Open Access Supporting Information for “Robust Bayesian variable selection for gene-environment interactions”(2022) Ren, Jie; Zhou, Fei; Li, Xiaoxi; Ma, Shuangge; Jiang, Yu; Wu, CenItem Open Access Identification of Prognostic Genes and Pathways in Lung Adenocarcinoma Using a Bayesian Approach(2017) Jiang, Yu; Huang, Yuan; Du, Yinhao; Zhao, Yinjun; Ren, Jie; Ma, Shuangge; Wu, Cen; ydu; jieren; wucenLung cancer is the leading cause of cancer-associated mortality in the United States and the world. Adenocarcinoma, the most common subtype of lung cancer, is generally diagnosed at the late stage with poor prognosis. In the past, extensive effort has been devoted to elucidating lung cancer pathogenesis and pinpointing genes associated with survival outcomes. As the progression of lung cancer is a complex process that involves coordinated actions of functionally associated genes from cancer-related pathways, there is a growing interest in simultaneous identification of both prognostic pathways and important genes within those pathways. In this study, we analyse The Cancer Genome Atlas lung adenocarcinoma data using a Bayesian approach incorporating the pathway information as well as the interconnections among genes. The top 11 pathways have been found to play significant roles in lung adenocarcinoma prognosis, including pathways in mitogen-activated protein kinase signalling, cytokine-cytokine receptor interaction, and ubiquitin-mediated proteolysis. We have also located key gene signatures such as RELB, MAP4K1, and UBE2C. These results indicate that the Bayesian approach may facilitate discovery of important genes and pathways that are tightly associated with the survival of patients with lung adenocarcinoma.Item Open Access Package ‘spinBayes’(2020-06-22) Ren, Jie; Zhou, Fei; Li, Xiaoxi; Wu, Cen; Jiang, Yu; jieren@ksu.edu; wucen@ksu.eduMany complex diseases are known to be affected by the interactions between genetic variants and environmental exposures beyond the main genetic and environmental effects. Existing Bayesian methods for gene-environment (G×E) interaction studies are challenged by the high-dimensional nature of the study and the complexity of environmental influences. We have developed a novel and powerful semi-parametric Bayesian variable selection method that can accommodate linear and nonlinear G×E interactions simultaneously(Ren et al. (2020) ). Furthermore, the proposed method can conduct structural identification by distinguishing nonlinear interactions from main effects only case within Bayesian framework. Spike-and-slab priors are incorporated on both individual and group level to shrink coefficients corresponding to irrelevant main and interaction effects to zero exactly. The Markov chain Monte Carlo algorithms of the proposed and alternative methods are efficiently implemented in C++.Item Open Access Package 'roben'(2020) Ren, Jie; Zhou, Fei; Li, Xiaoxi; Wu, Cen; wucenGene-environment (G×E) interactions have important implications to elucidate the etiology of complex diseases beyond the main genetic and environmental effects. Outliers and data contamination in disease phenotypes of G×E studies have been commonly encountered, leading to the development of a broad spectrum of robust penalization methods. Nevertheless, within the Bayesian framework, the issue has not been taken care of in existing studies. We develop a robust Bayesian variable selection method for G×E interaction studies. The proposed Bayesian method can effectively accommodate heavy-tailed errors and outliers in the response variable while conducting variable selection by accounting for structural sparsity. In particular, the spike-and-slab priors have been imposed on both individual and group levels to identify important main and interaction effects. An efficient Gibbs sampler has been developed to facilitate fast computation. The Markov chain Monte Carlo algorithms of the proposed and alternative methods are efficiently implemented in C++.Item Open Access Package ‘interep’(2020) Zhou, Fei; Ren, Jie; Li, Xiaoxi; Wu, Cen; Jiang, Yu; wucenExtensive penalized variable selection methods have been developed in the past two decades for analyzing high dimensional omics data, such as gene expressions, single nucleotide polymorphisms (SNPs), copy number variations (CNVs) and others. However, lipidomics data have been rarely investigated by using high dimensional variable selection methods. This package incorporates our recently developed penalization procedures to conduct interaction analysis for high dimensional lipidomics data with repeated measurements. The core module of this package is developed in C++. The development of this software package and the associated statistical methods have been partially supported by an Innovative Research Award from Johnson Cancer Research Center, Kansas State University.Item Open Access Genomic Prediction Accounting for Residual HeteroskedasticityOu, Z. N.; Tempelman, R. J.; Steibel, J. P.; Ernst, C. W.; Bates, R. O.; Bello, Nora M.; nbello; Bello, Nora M.Whole-genome prediction (WGP) models that use single-nucleotide polymorphism marker information to predict genetic merit of animals and plants typically assume homogeneous residual variance. However, variability is often heterogeneous across agricultural production systems and may subsequently bias WGP-based inferences. This study extends classical WGP models based on normality, heavy-tailed specifications and variable selection to explicitly account for environmentally-driven residual heteroske-dasticity under a hierarchical Bayesian mixed-models framework. WGP models assuming homogeneous or heterogeneous residual variances were fitted to training data generated under simulation scenarios reflecting a gradient of increasing heteroskedasticity. Model fit was based on pseudo-Bayes factors and also on prediction accuracy of genomic breeding values computed on a validation data subset one generation removed from the simulated training dataset. Homogeneous vs. heterogeneous residual variance WGP models were also fitted to two quantitative traits, namely 45-min postmortem carcass temperature and loin muscle pH, recorded in a swine resource population dataset prescreened for high and mild residual heteroskedasticity, respectively. Fit of competing WGP models was compared using pseudo-Bayes factors. Predictive ability, defined as the correlation between predicted and observed phenotypes in validation sets of a five-fold cross-validation was also computed. Heteroskedastic error WGP models showed improved model fit and enhanced prediction accuracy compared to homoskedastic error WGP models although the magnitude of the improvement was small (less than two percentage points net gain in prediction accuracy). Nevertheless, accounting for residual heteroskedasticity did improve accuracy of selection, especially on individuals of extreme genetic merit.Item Open Access In utero exposure to polychlorinated biphenyls is associated with decreased fecundability in daughters of Michigan female fisheaters: a cohort studyHan, L.; Hsu, Wei-Wen; Todem, D.; Osuch, J.; Hungerink, A.; Karmaus, W.; wwhsu; Hsu, Wei-WenBackground: Multiple studies have suggested a relationship between adult exposures to environmental organochlorines and fecundability. There is a paucity of data, however, regarding fetal exposure to organochlorines via the mother's blood and fecundability of adult female offspring. Methods: Data from a two-generation cohort of maternal fisheaters was investigated to assess female offspring fecundability. Serum concentrations of polychlorinated biphenyls (PCBs) and 1,1-bis-(4-chlorophenyl)-2,2-dichloroethene (DDE) in Michigan female anglers were serially measured between 1973 and 1991 and used to estimate in utero exposure in their female offspring using two different methods. The angler cohort included 391 women of whom 259 provided offspring information. Of 213 daughters aged 20-50, 151 participated (71 %) and provided information for time intervals of unprotected intercourse (TUI). The daughters reported 308 TUIs (repeated observations), of which 288 ended in pregnancy. We estimated the fecundability ratio (FR) for serum-PCB and serum-DDE adjusting for confounders and accounting for repeated measurements. An FR below one indicates a longer time to pregnancy. Results: Compared to serum-PCB of <2.5 mu g/L, the FR was 0.60 for serum-PCB between 2.5-7.4 mu g/L [95 % confidence intervals (CI) 0.36, 0.99], and 0.42 [95 % CI 0.20, 0.88] for serum-PCB > 7.4 mu g/L. Similar results were obtained using the alternative statistical method to estimate in utero serum-PCB. The association was stronger for TUIs when women planned a baby; FR = 0.50 for serum-PCB between 2.5-7.4 mu g/L, [95 % CI 0.29, 0.89], and 0.30 [95 % CI 0.13, 0.68] for serum-PCB > 7.4 mu g/L. There was no relationship between in utero exposure to DDE and fecundability in daughters. Conclusions: Decreased fecundability in female offspring of fisheaters was found to be associated with PCB exposure in utero, possibly related to endocrine disruption in the oocyte and/or other developing organs influencing reproductive capacity in adulthood.Item Open Access Inferential considerations for low-count RNA-seq transcripts: a case study on the dominant prairie grass Andropogon gerardiiRaithel, S.; Johnson, Loretta C.; Galliart, M.; Brown, Susan J.; Shelton, J.; Herndon, N.; Bello, Nora M.; nbello; johnson; sjbrownBackground: Differential expression (DE) analysis of RNA-seq data still poses inferential challenges, such as handling of transcripts characterized by low expression levels. In this study, we use a plasmode-based approach to assess the relative performance of alternative inferential strategies on RNA-seq transcripts, with special emphasis on transcripts characterized by a small number of read counts, so-called low-count transcripts, as motivated by an ecological application in prairie grasses. Big bluestem (Andropogon gerardii) is a wide-ranging dominant prairie grass of ecological and agricultural importance to the US Midwest while edaphic subspecies sand bluestem (A. gerardii ssp. Hallii) grows exclusively on sand dunes. Relative to big bluestem, sand bluestem exhibits qualitative phenotypic divergence consistent with enhanced drought tolerance, plausibly associated with transcripts of low expression levels. Our dataset consists of RNA-seq read counts for 25,582 transcripts (60 % of which are classified as low-count) collected from leaf tissue of individual plants of big bluestem (n = 4) and sand bluestem (n = 4). Focused on low-count transcripts, we compare alternative ad-hoc data filtering techniques commonly used in RNA-seq pipelines and assess the inferential performance of recently developed statistical methods for DE analysis, namely DESeq2 and edgeR robust. These methods attempt to overcome the inherently noisy behavior of low-count transcripts by either shrinkage or differential weighting of observations, respectively. Results: Both DE methods seemed to properly control family-wise type 1 error on low-count transcripts, whereas edgeR robust showed greater power and DESeq2 showed greater precision and accuracy. However, specification of the degree of freedom parameter under edgeR robust had a non-trivial impact on inference and should be handled carefully. When properly specified, both DE methods showed overall promising inferential performance on low-count transcripts, suggesting that ad-hoc data filtering steps at arbitrary expression thresholds may be unnecessary. A note of caution is in order regarding the approximate nature of DE tests under both methods. Conclusions: Practical recommendations for DE inference are provided when low-count RNA-seq transcripts are of interest, as is the case in the comparison of subspecies of bluestem grasses. Insights from this study may also be relevant to other applications focused on transcripts of low expression levels.Item Open Access SPATIO-TEMPORAL MODELS FOR SOME DATA SETS IN CONTINUOUS SPACE AND DISCRETE TIMEDemel, S. S.; Du, Juan; dujuanItem Open Access To See or Not to See: Do Front of Pack Nutrition Labels Affect Attention to Overall Nutrition Information?Bix, L.; Sundar, R. P.; Bello, Nora M.; Peltier, C.; Weatherspoon, L. J.; Becker, M. W.; nbelloItem Open Access Cell Based Drug Delivery: Micrococcus luteus Loaded Neutrophils as Chlorhexidine Delivery Vehicles in a Mouse Model of Liver Abscesses in CattleWendel, S. O.; Menon, S.; Alshetaiwi, H.; Shrestha, Tej Bahadur; Chlebanowski, L.; Hsu, Wei-Wen; Bossmann, Stefan H.; Narayanan, Sanjeev K.; Troyer, Deryl L.; wwhsu; sbossman; snarayan; troyerItem Open Access Examining multiracial youth in context: ethnic identity development and mental health outcomes(2015-04-06) Fisher, Sycarah; Reynolds, Jennifer L.; Hsu, Wei-Wen; Barnes, Jessica; Tyler, Kenneth; wwhsuAlthough multiracial individuals are the fastest growing population in the United States, research on the identity development of multiracial adolescents remains scant. This study explores the relationship between ethnic identity, its components (affirmation, exploration), and mental health outcomes (anxiety, depression) within the contexts of schools for multiracial adolescents. Participants were multiracial and monoracial minority and majority high school students (n=4,766). Using Analysis of Variance and Multiple Indicators Multiple Causes (MIMIC) models, results indicated that multiracial youth experience more exploration and less affirmation than African Americans, but more than Caucasians. In addition, multiracial youth were found to have higher levels of mental health issues than their monoracial minority and majority peers. Specifically, multiracial youth had higher levels of depression than their African American and Caucasian counterparts. Multiracial and Caucasian youth had similar levels of anxiety but these levels were significantly higher than African Americans. Results also show that school diversity can mitigate mental health outcomes finding that multiracial youth in more diverse schools are at lower risk for mental health issues.Item Open Access A sequential naïve Bayes classifier for DNA barcodes(2015-03-04) Anderson, Michael P.; Dubnicka, Suzanne R.; dubnickaDNA barcodes are short strands of 255–700 nucleotide bases taken from the cytochrome c oxidase subunit 1 (COI) region of the mitochondrial DNA. It has been proposed that these barcodes may be used as a method of differentiating between biological species. Current methods of species classification utilize distance measures that are heavily dependent on both evolutionary model assumptions as well as a clearly defined “gap” between intra- and interspecies variation. Such distance measures fail to measure classification uncertainty or to indicate how much of the barcode is necessary for classification. We propose a sequential naïve Bayes classifier for species classification to address these limitations. The proposed method is shown to provide accurate species-level classification on real and simulated data. The method proposed here quantifies the uncertainty of each classification and addresses how much of the barcode is necessary.Item Open Access Estimating Mixture of Gaussian Processes by Kernel Smoothing(2014-12-03) Huang, Mian; Li, Runze; Wang, Hansheng; Yao, Weixin; wxyaoWhen functional data are not homogenous, for example, when there are multiple classes of functional curves in the dataset, traditional estimation methods may fail. In this article, we propose a new estimation procedure for the mixture of Gaussian processes, to incorporate both functional and inhomogenous properties of the data. Our method can be viewed as a natural extension of high-dimensional normal mixtures. However, the key difference is that smoothed structures are imposed for both the mean and covariance functions. The model is shown to be identifiable, and can be estimated efficiently by a combination of the ideas from expectation-maximization (EM) algorithm, kernel regression, and functional principal component analysis. Our methodology is empirically justified by Monte Carlo simulations and illustrated by an analysis of a supermarket dataset.Item Open Access Minimum profile Hellinger distance estimation for a semiparametric mixture model(2014-12-03) Xiang, Sijia; Yao, Weixin; Wu, Jingjing; wxyaoIn this paper, we propose a new effective estimator for a class of semiparametric mixture models where one component has known distribution with possibly unknown parameters while the other component density and the mixing proportion are unknown. Such semiparametric mixture models have been often used in multiple hypothesis testing and the sequential clustering algorithm. The proposed estimator is based on the minimum profile Hellinger distance (MPHD), and its theoretical properties are investigated. In addition, we use simulation studies to illustrate the finite sample performance of the MPHD estimator and compare it with some other existing approaches. The empirical studies demonstrate that the new method outperforms existing estimators when data are generated under contamination and works comparably to existing estimators when data are not contaminated. Applications to two real data sets are also provided to illustrate the effectiveness of the new methodology.Item Open Access The role of packaging size on contamination rates during simulated presentation to a sterile field(2014-10-01) Trier, Tony; Bello, Nora M.; Bush, Tamara Reid; Bix, Laura; nbelloObjective: The objective of this study was to assess the impact of package size on the contact between medical devices and non-sterile surfaces (i.e. the hands of the practitioner and the outside of the package) during aseptic presentation to a simulated sterile field. Rationale for this objective stems from the decades-long problem of hospital-acquired infections. This work approaches the problem from a unique perspective, namely packaging size. Design: Randomized complete block design with subsampling. Setting: Research study conducted at professional conferences for surgical technologists and nursing professionals. Participants: Ninety-seven healthcare providers, primarily surgical technologists and nurses. Methods: Participants were gloved and asked to present the contents of six pouches of three different sizes to a simulated sterile field. The exterior of pouches and gloves of participants were coated with a simulated contaminant prior to each opening trial. After presentation to the simulated sterile field, the presence of the contaminant on package contents was recorded as indicative of contact with non-sterile surfaces and analyzed in a binary fashion using a generalized linear mixed model. Results: Recruited subjects were 26–64 years of age (81 females, 16 males), with 2.5–44 years of professional experience. Results indicated a significant main effect of pouch size on contact rate of package contents (P = 0.0108), whereby larger pouches induced greater rates of contact than smaller pouches (estimates±SEM: 14.7±2.9% vs. 6.0±1.7%, respectively). Discussion and Conclusion: This study utilized novel methodologies which simulate contamination in aseptic presentation. Results of this work indicate that increased contamination rates are associated with larger pouches when compared to smaller pouches. The results add to a growing body of research which investigate packaging's role in serving as a pathway for product contamination during aseptic presentation. Future work should investigate other packaging design factors (e.g. material, rigidity, and closure systems) and their role in contamination.Item Open Access DDE and PCB serum concentration in maternal blood and their adult female offspring(2014-07-16) Hsu, Wei-Wen; Osuch, Janet Rose; Todem, David; Taffe, Bonita; O’Keefe, Michael; Adera, Selamawit; Karmaus, Wilfried; wwhsuBackground: Dichlorodiphenyl dichloroethylene (DDE) and polychlorinated biphenyls (PCBs) can be passed from mother to offspring through placental transfer or breast feeding. Unknown is whether maternal levels can predict concentrations in adult offspring. Objectives: To test the association between maternal blood levels of DDE and PCBs and adult female offspring levels of these compounds using data from the Michigan Fisheaters’Cohort. Methods: DDE and PCB concentrations were determined in 132 adult daughters from 84 mothers. Prenatal exposures were estimated based on maternal DDE and PCB serum levels measured between 1973 and 1991. Levels in adult daughters were regressed on maternal and estimated prenatal exposure levels, adjusting for potential confounders using linear mixed models. Confounders included daughter’s age, birth order, birth weight, number of pregnancies, the length of time the daughter was breast-fed, the length of time the daughter breast-fed her own children, last year fish-eating status, body mass index, and lipid weight. Results: The median age of the participants was 40.4 years (range 18.4 to 65.4, 5–95 percentiles 22.5-54.6%, respectively). Controlling for confounders and intra-familial associations, DDE and PCB concentrations in adult daughters were significantly positively associated with estimated prenatal levels and with maternal concentrations. The proportion of variance in the adult daughters’ organochlorine concentrations explained by the maternal exposure levels is approximately 23% for DDE and 43% for PCBs. The equivalent of a median of 3.67 μg/L prenatal DDE and a median of 2.56 μg/L PCBs were 15.64 and 10.49 years of fish consumption, respectively. When controlling for effects of the shared environment (e.g., fish diet) by using a subsample of paternal levels measured during the same time frames (n=53 and n=37), we determined that the direct maternal transfer remains important. Conclusions: Estimated intrauterine DDE and PCB levels predicted concentrations in adult female offspring 40 years later. Interpretation of adverse health effects from intrauterine exposures of persistent pollutants may need to consider the sustained impact of maternal DDE and PCB levels found in their offspring.
- «
- 1 (current)
- 2
- 3
- »