Parallel iterative hybridized threshold clustering for massive data

K-REx Repository

Show simple item record Yang, Yang 2020-10-08T13:29:50Z 2020-10-08T13:29:50Z 2020-12-11
dc.description.abstract Iterative hybridized threshold clustering (IHTC) is a recently developed algorithm designed to decrease runtime and reduce memory usage of commonly used clustering algorithms under massive data settings. The IHTC pre-processes the data with iterative threshold instance selection (ITIS) to scale down the size of data before proceeding with the standard clustering analysis, such as k-means and hierarchical clustering. However, when dealing with massive amounts of data, for example, when the number of data points n > 10⁸, the computational cost of IHTC may still be prohibitive. Efficient parallel implementation may provide a pathway to further reduce computational cost and memory usage of IHTC. In this study, we partition the data points into batches and IHTC is performed on each batch. The prototypes generated from IHTC are collected for k-means clustering. We implement the parallelization using the R packages “Rdsm” and “parallel”, and test our implementation on simulated data on the Beocat high-performance cluster by varying the number of cores and batches. Performance is evaluated though accuracy, runtime in seconds, and memory usage in GB. We find that parallelization improves the runtime of IHTC substantially. For example, for a dataset of size n = 10⁹, dividing the data into 500 batches and applying paralellization through the parallel package on a node with 8 cores decreases runtime by a factor of 4.36. Additionally, Rdsm parallelism for small scale data (n = 10⁸) may decrease memory usage while preserving clustering accuracy. We conclude that a parallel programming design should create a proper number of threads to provide enough work for all cores to efficiently use the available computing resources. en_US
dc.language.iso en en_US
dc.subject Parallel en_US
dc.subject Iterative hybridized threshold clustering en_US
dc.subject Massive data en_US
dc.title Parallel iterative hybridized threshold clustering for massive data en_US
dc.type Report en_US Master of Science en_US
dc.description.level Masters en_US
dc.description.department Department of Statistics en_US
dc.description.advisor Michael J. Higgins en_US 2020 en_US December en_US

Files in this item

This item appears in the following Collection(s)

Show simple item record

Search K-REx

Advanced Search


My Account


Center for the

Advancement of Digital