# Threshold clustering for massive data

## Date

## Authors

## Journal Title

## Journal ISSN

## Volume Title

## Publisher

## Abstract

Statistical clustering is the process of partitioning objects into clusters so that units within the same cluster have similar characteristics. Threshold clustering (TC) is a recently-developed method designed so that, given a pre-specified threshold t*, each cluster contains at least t* units and the maximum within-cluster dissimilarity (MWCD) is small. In fact, among all clusterings which contain at least t* units within each cluster, the MWCD obtained through TC is, at most, 4 times as large as the smallest possible MWCD; finding the clustering with the smallest MWCD possible is known as the bottleneck threshold partitioning problem (BTPP). TC requires only O(t*n) time and space outside of the construction of a (t*-1)--nearest-neighbors graph.

This dissertation centers on extending the TC methodology in several different ways. First, we extend TC to instance selection problems on massive data through a procedure called iterated threshold instance selection (ITIS). It works as follows. First, TC is performed on the n units to form n* clusters, each containing t* or more units. Then, prototypes are formed by computing a center point for each of the n* groups (e.g. a centroid or medoid). Finally, if the data is not yet sufficiently reduced, TC is applied again to the n* prototypes. These steps are repeated until the data has been sufficiently reduced. Besides scaling down the data size, we use ITIS as a pre-processing step to allow for the use of more sophisticated clustering methods. A statistical clustering of all n units can be obtained by an approach called Iterative Hybridized Threshold Clustering (IHTC). In addition to improving efficiency, IHTC also prevents singular data points from being overfit by desired clustering methods; specifically, for m iterations of ITIS at size threshold t*, IHTC ensures that clusters contain at least (𝑡*)^𝑚 units. We give results from simulations and applications to real datasets showing that our methods reduce the run time and memory usage of the original clustering algorithms while still preserving their performance.

Second, we extend TC to hierarchical clustering in two ways. The first method, called hierarchical agglomerative threshold clustering (HATC), follows the hierarchical agglomerative clustering (HAC) paradigm closely. First, TC is performed on the n units to form several clusters, each containing t* or more units. Then, the cluster with the smallest within-cluster dissimilarity is condensed into a single point, and TC is again performed on the reduced data. These steps are repeated until there are fewer than t* data points. The second method repeatedly performs ITIS to form a cluster hierarchy. Initially, TC is performed on the n units. Then, a prototype is formed for each cluster, and TC is performed again on these prototypes. As before, these steps are repeated until there are fewer than t* points remaining. Results from simulations show that these clustering methods perform comparably to, if not better than, HAC.

Finally, we fill in some gaps in understanding on how well an algorithm can approximate BTPP in polynomial time. Previous work has shown the existence of a 4-approximation polynomial-time algorithm to BTPP for arbitrary t* and has shown that, when t* > 2, no (2-𝜀)--approximation algorithm can be performed in polynomial time unless P=NP. We show that a polynomial-time 2--approximation algorithm exists when t* = 2 and a polynomial-time 3--approximation algorithm exists when t* = 3.