Parallel iterative hybridized threshold clustering for massive data

Date

2020-12-11

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Iterative hybridized threshold clustering (IHTC) is a recently developed algorithm designed to decrease runtime and reduce memory usage of commonly used clustering algorithms under massive data settings. The IHTC pre-processes the data with iterative threshold instance selection (ITIS) to scale down the size of data before proceeding with the standard clustering analysis, such as k-means and hierarchical clustering. However, when dealing with massive amounts of data, for example, when the number of data points n > 10⁸, the computational cost of IHTC may still be prohibitive. Efficient parallel implementation may provide a pathway to further reduce computational cost and memory usage of IHTC. In this study, we partition the data points into batches and IHTC is performed on each batch. The prototypes generated from IHTC are collected for k-means clustering. We implement the parallelization using the R packages “Rdsm” and “parallel”, and test our implementation on simulated data on the Beocat high-performance cluster by varying the number of cores and batches. Performance is evaluated though accuracy, runtime in seconds, and memory usage in GB. We find that parallelization improves the runtime of IHTC substantially. For example, for a dataset of size n = 10⁹, dividing the data into 500 batches and applying paralellization through the parallel package on a node with 8 cores decreases runtime by a factor of 4.36. Additionally, Rdsm parallelism for small scale data (n = 10⁸) may decrease memory usage while preserving clustering accuracy. We conclude that a parallel programming design should create a proper number of threads to provide enough work for all cores to efficiently use the available computing resources.

Description

Keywords

Parallel, Iterative hybridized threshold clustering, Massive data

Graduation Month

December

Degree

Master of Science

Department

Department of Statistics

Major Professor

Michael J. Higgins

Date

2020

Type

Report

Citation