Parallel iterative hybridized threshold clustering for massive data

K-REx Repository

Show simple item record

dc.contributor.author Yang, Yang
dc.date.accessioned 2020-10-08T13:29:50Z
dc.date.available 2020-10-08T13:29:50Z
dc.date.issued 2020-12-11
dc.identifier.uri https://hdl.handle.net/2097/40872
dc.description.abstract Iterative hybridized threshold clustering (IHTC) is a recently developed algorithm designed to decrease runtime and reduce memory usage of commonly used clustering algorithms under massive data settings. The IHTC pre-processes the data with iterative threshold instance selection (ITIS) to scale down the size of data before proceeding with the standard clustering analysis, such as k-means and hierarchical clustering. However, when dealing with massive amounts of data, for example, when the number of data points n > 10⁸, the computational cost of IHTC may still be prohibitive. Efficient parallel implementation may provide a pathway to further reduce computational cost and memory usage of IHTC. In this study, we partition the data points into batches and IHTC is performed on each batch. The prototypes generated from IHTC are collected for k-means clustering. We implement the parallelization using the R packages “Rdsm” and “parallel”, and test our implementation on simulated data on the Beocat high-performance cluster by varying the number of cores and batches. Performance is evaluated though accuracy, runtime in seconds, and memory usage in GB. We find that parallelization improves the runtime of IHTC substantially. For example, for a dataset of size n = 10⁹, dividing the data into 500 batches and applying paralellization through the parallel package on a node with 8 cores decreases runtime by a factor of 4.36. Additionally, Rdsm parallelism for small scale data (n = 10⁸) may decrease memory usage while preserving clustering accuracy. We conclude that a parallel programming design should create a proper number of threads to provide enough work for all cores to efficiently use the available computing resources. en_US
dc.language.iso en en_US
dc.subject Parallel en_US
dc.subject Iterative hybridized threshold clustering en_US
dc.subject Massive data en_US
dc.title Parallel iterative hybridized threshold clustering for massive data en_US
dc.type Report en_US
dc.description.degree Master of Science en_US
dc.description.level Masters en_US
dc.description.department Department of Statistics en_US
dc.description.advisor Michael J. Higgins en_US
dc.date.published 2020 en_US
dc.date.graduationmonth December en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search K-REx


Advanced Search

Browse

My Account

Statistics








Center for the

Advancement of Digital

Scholarship

cads@k-state.edu