Instance selection with threshold clustering for support vector machines

Date

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Tremendous advances in computing power have allowed the size of datasets to grow massively. Many machine learning approaches have been developed to deal with massive data by reducing the number of features, observations, or both. Instance selection (IS) is a data mining process that relies on scaling down the number of observations of a dataset. In this research, we focus on IS methods that rely on clustering algorithms, particularly, on threshold clustering (TC). TC is a recent efficient clustering method. Given a fixed size threshold t*, TC forms clusters of t* or more units while ensuring that the maximum within- cluster dissimilarity is small. Unlike most traditional clustering methods, TC is designed to form many small clusters of units, making it ideal for IS. Support vector machines (SVM) is a powerful method for classification. However, train- ing SVM may be computationally infeasible for large datasets—training SVM requires O(N3) runtime, where N is size of the training data. In this dissertation, we propose a method for IS for training SVM under big data settings called support vector machines with threshold clustering (SVMTC). Our proposed method begins by clustering each class in the training set separately using TC. Then, centroids of all clusters are formed the re- duced set. If the data reduction is insufficient, TC may be repeated. SVM is then applied on the reduced dataset. In this way, our proposed method can reduce the training set for SVM by factor (t*)^r or more, where r is the number of iterations of TC, dramatically reducing the runtime required to train SVM. Furthermore, we prove under the Gaussian radial basis kernel, that the maximum distance between the Gram matrix for the original data—which is used to find support vectors—and the Gram matrix for the reduced data is bounded by a function of the maximum within-cluster distance for TC. Then, we show, via simulation and application to datasets, that SVMTC efficiently reduces the size of training sets with- out sacrificing the prediction accuracy of SVM. Moreover, it often outperforms competing methods for IS in terms of the runtime, memory usage, and prediction accuracy. Next, we explore best practices for applying feature reduction methods for SVMTC when the number of features is large. We investigate the usefulness of various feature selection and feature extraction methods, including principal component analysis (PCA), linear discriminant analysis (LDA), LASSO, and Fisher Scores, as an initial step of SVMTC. For feature reduction methods that select a linear combination of the original features— for example, PCA—we also investigate forming prototypes using the original features or the transformed features. We compare, via application to datasets, the performance of SVMTC under feature reduction methods. We find that LASSO tends to be an effective feature selection method, and overall, show that SVMTC is improved significantly under the proposed methods. Finally, we perform a comparative study of iterative threshold instance selection (ITIS) and other IS methods. ITIS is a recent extension method of TC that is used as IS. We use simulation to compare between ITIS and competing methods. The results illustrate that ITIS is effective in massive data settings when compared against other instance selection methods like k-means and its variations. In addition, we demonstrate the efficacy of hybrid clustering algorithms that utilize ITIS as an initial step, and show via simulation study that these methods outperform other hybrid clustering methods in terms of runtime and memory without sacrificing performance.

Description

Keywords

Instance Selection

Graduation Month

May

Degree

Doctor of Philosophy

Department

Department of Statistics

Major Professor

Michael J Higgins

Date

2023

Type

Dissertation

Citation