Sample size calculation for competing risk models and design considerations for agricultural experiments
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
This dissertation develops methods for the design and analysis of experiments seen in agricultural settings. First, we develop methods for determining requisite sample sizes for obtaining a pre-specified power in competing risk models, a multistate model for survival analysis. Typically, prior to conducting an “official” elaborate and expensive study, a small pilot study is performed to better understand the target population and the differences between treatments. For first-event classic survival analysis, a consistent hazard rate is often assumed to compute the sample size for the official experiment. However, for competing risk models, the hazards for each type of event need to be specified. Additionally, the number of occurrences for each type of event may be small, especially if we further stratify the data regarding the treatment or covariates. Thus, commonly-used approaches for modeling data from the pilot study—for example, using a parametric survival model assuming an exponential distribution—is unlikely to yield useful information from the pilot study. Therefore, we propose to use a flexible parametric survival model, the generalized log-gamma survival model, to extract information from the pilot study. After that, we simulate data related to the data generation mechanism of competing risks to fit a pre-specified statistical test intended for the official study. We repeat the previous step a large number of times to compute power. We adjust the size of the simulated dataset regarding the output power until the output power reaches a pre-specified level. The minimal sample size results in the pre-specified power is the one we recommend for the official experiment. Next, cluster-randomized experiments, in which units are grouped together in some way and treatment is assigned to groups of units, are common in agricultural settings. For example, studies assessing the efficacy of a diet on the weight gain of pigs may assign the diet to pens, rather than individual pigs, as administration of the diet at the pen level is much less costly. However, in many settings, researchers may have some ability to form the clusters of units prior to randomizing treatment. In this study, we determine best practices for forming clusters of units when such an option is available. Under the Neyman-Rubin Causal Model (NRCM), we derive expressions for the efficiency loss due to cluster randomization (relative to complete randomization) in a simplified setting where cluster sizes are equal. We show that efficiency loss is a function of the intra-cluster coefficient and the correlation between potential outcomes. The efficiency loss is minimized—and, in fact, cluster-randomized experiments can outperform completely randomized experiments—when the “discrepancy” between clusters is small—that is, when the distribution of units within each cluster is as similar as possible across clusters. We then introduce a heuristic algorithm for assigning units to clusters to ensure small discrepancy between clusters. We give a thorough simulation studies to verify our results. We conclude by applying our heuristic algorithm to improve divide and conquer methods for performing kernel ridge regression in large-to-massive data settings.