Statistical analysis of pyrosequence data

dc.contributor.authorKeating, Karen
dc.date.accessioned2012-07-13T14:46:36Z
dc.date.available2012-07-13T14:46:36Z
dc.date.graduationmonthAugusten_US
dc.date.issued2012-07-13
dc.date.published2012en_US
dc.description.abstractSince their commercial introduction in 2005, DNA sequencing technologies have become widely available and are now cost-effective tools for determining the genetic characteristics of organisms. While the biomedical applications of DNA sequencing are apparent, these technologies have been applied to many other research areas. One such area is community ecology, in which DNA sequence data are used to identify the presence and abundance of microscopic organisms that inhabit an environment. This is currently an active area of research, since it is generally believed that a change in the composition of microscopic species in a geographic area may signal a change in the overall health of the environment. An overview of DNA pyrosequencing, as implemented by the Roche/Life Science 454 platform, is presented and aspects of the process that can introduce variability in data are identified. Four ecological data sets that were generated by the 454 platform are used for illustration. Characteristics of these data include high dimensionality, a large proportion of zeros (usually in excess of 90%), and nonzero values that are strongly right-skewed. A nonparametric method to standardize these data is presented and effects of standardization on outliers and skewness are examined. Traditional statistical methods for analyzing macroscopic species abundance data are discussed, and the applicability of these methods to microscopic species data is examined. One objective that receives focus is the classification of microscopic species as either rare or common species. This is an important distinction since there is much evidence to suggest that the biological and environmental mechanisms that govern common species are distinctly different than the mechanisms that govern rare species. This indicates that the abundance patterns for common and rare species may follow different probability models, and the suitability of the Pareto distribution for rare species is examined. Techniques for classifying macroscopic species are shown to be ill-suited for microscopic species, and an alternative technique is presented. Recognizing that the structure of the data is similar to that of financial applications (such as insurance claims and the distribution of wealth), the Gini index and other statistics based on the Lorenz curve are explored as potential test statistics for distinguishing rare versus common species.en_US
dc.description.advisorGary L. Gadburyen_US
dc.description.degreeDoctor of Philosophyen_US
dc.description.departmentDepartment of Statisticsen_US
dc.description.levelDoctoralen_US
dc.identifier.urihttp://hdl.handle.net/2097/14026
dc.language.isoen_USen_US
dc.publisherKansas State Universityen
dc.subjectStatisticsen_US
dc.subjectEcologyen_US
dc.subjectStandardizationen_US
dc.subjectGini Indexen_US
dc.subjectPareto distributionen_US
dc.subject.umiEcology (0329)en_US
dc.subject.umiStatistics (0463)en_US
dc.titleStatistical analysis of pyrosequence dataen_US
dc.typeDissertationen_US

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
KarenKeating2012.pdf
Size:
3.47 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.62 KB
Format:
Item-specific license agreed upon to submission
Description: