Partition clustering of High Dimensional Low Sample Size data based on P-Values

dc.contributor.authorVon Borries, George Freitasen
dc.date.accessioned2008-03-28T13:55:43Z
dc.date.available2008-03-28T13:55:43Z
dc.date.graduationmonthMayen
dc.date.issued2008-03-28T13:55:43Z
dc.date.published2008en
dc.description.abstractThis thesis introduces a new partitioning algorithm to cluster variables in high dimensional low sample size (HDLSS) data and high dimensional longitudinal low sample size (HDLLSS) data. HDLSS data contain a large number of variables with small number of replications per variable, and HDLLSS data refer to HDLSS data observed over time. Clustering technique plays an important role in analyzing high dimensional low sample size data as is seen commonly in microarray experiment, mass spectrometry data, pattern recognition. Most current clustering algorithms for HDLSS and HDLLSS data are adaptations from traditional multivariate analysis, where the number of variables is not high and sample sizes are relatively large. Current algorithms show poor performance when applied to high dimensional data, especially in small sample size cases. In addition, available algorithms often exhibit poor clustering accuracy and stability for non-normal data. Simulations show that traditional clustering algorithms used in high dimensional data are not robust to monotone transformations. The proposed clustering algorithm PPCLUST is a powerful tool for clustering HDLSS data, which uses p-values from nonparametric rank tests of homogeneous distribution as a measure of similarity between groups of variables. Inherited from the robustness of rank procedure, the new algorithm is robust to outliers and invariant to monotone transformations of data. PPCLUSTEL is an extension of PPCLUST for clustering of HDLLSS data. A nonparametric test of no simple effect of group is developed and the p-value from the test is used as a measure of similarity between groups of variables. PPCLUST and PPCLUSTEL are able to cluster a large number of variables in the presence of very few replications and in case of PPCLUSTEL, the algorithm require neither a large number nor equally spaced time points. PPCLUST and PPCLUSTEL do not suffer from loss of power due to distributional assumptions, general multiple comparison problems and difficulty in controlling heterocedastic variances. Applications with available data from previous microarray studies show promising results and simulations studies reveal that the algorithm outperforms a series of benchmark algorithms applied to HDLSS data exhibiting high clustering accuracy and stability.en
dc.description.advisorHaiyan Wangen
dc.description.degreeDoctor of Philosophyen
dc.description.departmentDepartment of Statisticsen
dc.description.levelDoctoralen
dc.identifier.urihttp://hdl.handle.net/2097/590
dc.language.isoen_USen
dc.publisherKansas State Universityen
dc.subjectHigh Dimensional Dataen
dc.subjectClusteringen
dc.subjectNonparametric Inferenceen
dc.subjectBioinformaticsen
dc.subjectStatistical Learningen
dc.subjectData Miningen
dc.subject.umiStatistics (0463)en
dc.titlePartition clustering of High Dimensional Low Sample Size data based on P-Valuesen
dc.typeDissertationen

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
GeorgevonBorries2008.pdf
Size:
4.69 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.7 KB
Format:
Item-specific license agreed upon to submission
Description: