Partition clustering of High Dimensional Low Sample Size data based on P-Values

Von Borries, George Freitas

Partition clustering of High Dimensional Low Sample Size data based on P-Values

dc.contributor.author	Von Borries, George Freitas
dc.date.accessioned	2008-03-28T13:55:43Z
dc.date.available	2008-03-28T13:55:43Z
dc.date.graduationmonth	May
dc.date.issued	2008-03-28T13:55:43Z
dc.date.published	2008
dc.description.abstract	This thesis introduces a new partitioning algorithm to cluster variables in high dimensional low sample size (HDLSS) data and high dimensional longitudinal low sample size (HDLLSS) data. HDLSS data contain a large number of variables with small number of replications per variable, and HDLLSS data refer to HDLSS data observed over time. Clustering technique plays an important role in analyzing high dimensional low sample size data as is seen commonly in microarray experiment, mass spectrometry data, pattern recognition. Most current clustering algorithms for HDLSS and HDLLSS data are adaptations from traditional multivariate analysis, where the number of variables is not high and sample sizes are relatively large. Current algorithms show poor performance when applied to high dimensional data, especially in small sample size cases. In addition, available algorithms often exhibit poor clustering accuracy and stability for non-normal data. Simulations show that traditional clustering algorithms used in high dimensional data are not robust to monotone transformations. The proposed clustering algorithm PPCLUST is a powerful tool for clustering HDLSS data, which uses p-values from nonparametric rank tests of homogeneous distribution as a measure of similarity between groups of variables. Inherited from the robustness of rank procedure, the new algorithm is robust to outliers and invariant to monotone transformations of data. PPCLUSTEL is an extension of PPCLUST for clustering of HDLLSS data. A nonparametric test of no simple effect of group is developed and the p-value from the test is used as a measure of similarity between groups of variables. PPCLUST and PPCLUSTEL are able to cluster a large number of variables in the presence of very few replications and in case of PPCLUSTEL, the algorithm require neither a large number nor equally spaced time points. PPCLUST and PPCLUSTEL do not suffer from loss of power due to distributional assumptions, general multiple comparison problems and difficulty in controlling heterocedastic variances. Applications with available data from previous microarray studies show promising results and simulations studies reveal that the algorithm outperforms a series of benchmark algorithms applied to HDLSS data exhibiting high clustering accuracy and stability.
dc.description.advisor	Haiyan Wang
dc.description.degree	Doctor of Philosophy
dc.description.department	Department of Statistics
dc.description.level	Doctoral
dc.identifier.uri	http://hdl.handle.net/2097/590
dc.language.iso	en_US
dc.publisher	Kansas State University
dc.rights	© the author. This Item is protected by copyright and/or related rights. You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s).
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/
dc.subject	High Dimensional Data
dc.subject	Clustering
dc.subject	Nonparametric Inference
dc.subject	Bioinformatics
dc.subject	Statistical Learning
dc.subject	Data Mining
dc.subject.umi	Statistics (0463)
dc.title	Partition clustering of High Dimensional Low Sample Size data based on P-Values
dc.type	Dissertation

Files

Original bundle

Now showing 1 - 1 of 1

Name:: GeorgevonBorries2008.pdf
Size:: 4.69 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.7 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

K-State Electronic Theses, Dissertations, and Reports: 2004 -