Partition clustering of High Dimensional Low Sample Size data based on P-Values

dc.contributor.authorVon Borries, George Freitas
dc.date.accessioned2008-03-28T13:55:43Z
dc.date.available2008-03-28T13:55:43Z
dc.date.graduationmonthMay
dc.date.issued2008-03-28T13:55:43Z
dc.date.published2008
dc.description.abstractThis thesis introduces a new partitioning algorithm to cluster variables in high dimensional low sample size (HDLSS) data and high dimensional longitudinal low sample size (HDLLSS) data. HDLSS data contain a large number of variables with small number of replications per variable, and HDLLSS data refer to HDLSS data observed over time. Clustering technique plays an important role in analyzing high dimensional low sample size data as is seen commonly in microarray experiment, mass spectrometry data, pattern recognition. Most current clustering algorithms for HDLSS and HDLLSS data are adaptations from traditional multivariate analysis, where the number of variables is not high and sample sizes are relatively large. Current algorithms show poor performance when applied to high dimensional data, especially in small sample size cases. In addition, available algorithms often exhibit poor clustering accuracy and stability for non-normal data. Simulations show that traditional clustering algorithms used in high dimensional data are not robust to monotone transformations. The proposed clustering algorithm PPCLUST is a powerful tool for clustering HDLSS data, which uses p-values from nonparametric rank tests of homogeneous distribution as a measure of similarity between groups of variables. Inherited from the robustness of rank procedure, the new algorithm is robust to outliers and invariant to monotone transformations of data. PPCLUSTEL is an extension of PPCLUST for clustering of HDLLSS data. A nonparametric test of no simple effect of group is developed and the p-value from the test is used as a measure of similarity between groups of variables. PPCLUST and PPCLUSTEL are able to cluster a large number of variables in the presence of very few replications and in case of PPCLUSTEL, the algorithm require neither a large number nor equally spaced time points. PPCLUST and PPCLUSTEL do not suffer from loss of power due to distributional assumptions, general multiple comparison problems and difficulty in controlling heterocedastic variances. Applications with available data from previous microarray studies show promising results and simulations studies reveal that the algorithm outperforms a series of benchmark algorithms applied to HDLSS data exhibiting high clustering accuracy and stability.
dc.description.advisorHaiyan Wang
dc.description.degreeDoctor of Philosophy
dc.description.departmentDepartment of Statistics
dc.description.levelDoctoral
dc.identifier.urihttp://hdl.handle.net/2097/590
dc.language.isoen_US
dc.publisherKansas State University
dc.rights© the author. This Item is protected by copyright and/or related rights. You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s).
dc.rights.urihttp://rightsstatements.org/vocab/InC/1.0/
dc.subjectHigh Dimensional Data
dc.subjectClustering
dc.subjectNonparametric Inference
dc.subjectBioinformatics
dc.subjectStatistical Learning
dc.subjectData Mining
dc.subject.umiStatistics (0463)
dc.titlePartition clustering of High Dimensional Low Sample Size data based on P-Values
dc.typeDissertation

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
GeorgevonBorries2008.pdf
Size:
4.69 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.7 KB
Format:
Item-specific license agreed upon to submission
Description: