Threshold clustering for massive data

dc.contributor.authorLuo, Jianmei
dc.date.accessioned2019-07-19T19:50:21Z
dc.date.available2019-07-19T19:50:21Z
dc.date.graduationmonthAugust
dc.date.issued2019-08-01
dc.description.abstractStatistical clustering is the process of partitioning objects into clusters so that units within the same cluster have similar characteristics. Threshold clustering (TC) is a recently-developed method designed so that, given a pre-specified threshold t*, each cluster contains at least t* units and the maximum within-cluster dissimilarity (MWCD) is small. In fact, among all clusterings which contain at least t* units within each cluster, the MWCD obtained through TC is, at most, 4 times as large as the smallest possible MWCD; finding the clustering with the smallest MWCD possible is known as the bottleneck threshold partitioning problem (BTPP). TC requires only O(t*n) time and space outside of the construction of a (t*-1)--nearest-neighbors graph. This dissertation centers on extending the TC methodology in several different ways. First, we extend TC to instance selection problems on massive data through a procedure called iterated threshold instance selection (ITIS). It works as follows. First, TC is performed on the n units to form n* clusters, each containing t* or more units. Then, prototypes are formed by computing a center point for each of the n* groups (e.g. a centroid or medoid). Finally, if the data is not yet sufficiently reduced, TC is applied again to the n* prototypes. These steps are repeated until the data has been sufficiently reduced. Besides scaling down the data size, we use ITIS as a pre-processing step to allow for the use of more sophisticated clustering methods. A statistical clustering of all n units can be obtained by an approach called Iterative Hybridized Threshold Clustering (IHTC). In addition to improving efficiency, IHTC also prevents singular data points from being overfit by desired clustering methods; specifically, for m iterations of ITIS at size threshold t*, IHTC ensures that clusters contain at least (𝑡*)^𝑚 units. We give results from simulations and applications to real datasets showing that our methods reduce the run time and memory usage of the original clustering algorithms while still preserving their performance. Second, we extend TC to hierarchical clustering in two ways. The first method, called hierarchical agglomerative threshold clustering (HATC), follows the hierarchical agglomerative clustering (HAC) paradigm closely. First, TC is performed on the n units to form several clusters, each containing t* or more units. Then, the cluster with the smallest within-cluster dissimilarity is condensed into a single point, and TC is again performed on the reduced data. These steps are repeated until there are fewer than t* data points. The second method repeatedly performs ITIS to form a cluster hierarchy. Initially, TC is performed on the n units. Then, a prototype is formed for each cluster, and TC is performed again on these prototypes. As before, these steps are repeated until there are fewer than t* points remaining. Results from simulations show that these clustering methods perform comparably to, if not better than, HAC. Finally, we fill in some gaps in understanding on how well an algorithm can approximate BTPP in polynomial time. Previous work has shown the existence of a 4-approximation polynomial-time algorithm to BTPP for arbitrary t* and has shown that, when t* > 2, no (2-𝜀)--approximation algorithm can be performed in polynomial time unless P=NP. We show that a polynomial-time 2--approximation algorithm exists when t* = 2 and a polynomial-time 3--approximation algorithm exists when t* = 3.
dc.description.advisorMichael J. Higgins
dc.description.degreeDoctor of Philosophy
dc.description.departmentDepartment of Statistics
dc.description.levelDoctoral
dc.identifier.urihttp://hdl.handle.net/2097/39823
dc.language.isoen_US
dc.publisherKansas State University
dc.rights© the author. This Item is protected by copyright and/or related rights. You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s).
dc.rights.urihttp://rightsstatements.org/vocab/InC/1.0/
dc.subjectThreshold clustering
dc.subjectHybridized clustering
dc.subjectInstance selection
dc.subjectHierarchical clustering
dc.subjectMassive data
dc.subjectApproximation algorithm
dc.titleThreshold clustering for massive data
dc.typeDissertation

Files

Original bundle

Now showing 1 - 3 of 3
No Thumbnail Available
Name:
dissertation_ch2_code.R
Size:
7.3 KB
Format:
Unknown data format
Description:
R code for methods in Chapter 2
No Thumbnail Available
Name:
dissertation_ch3_code.R
Size:
10.2 KB
Format:
Unknown data format
Description:
R code for methods in Chapter 3
Loading...
Thumbnail Image
Name:
JianmeiLuo2019.pdf
Size:
746.86 KB
Format:
Adobe Portable Document Format
Description:
Main dissertation

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.62 KB
Format:
Item-specific license agreed upon to submission
Description: