A classification for complex imbalanced data

Date

2022

Journal Title

Journal ISSN

Volume Title

Publisher

Kansas State University

Abstract

Imbalanced classification has drawn considerable attention in the statistics and machine learning literature. Typically, traditional classification methods, such as logistic regression and support vector machine (SVM), often perform poorly when a severely skewed class distribution is observed, not to mention under a high-dimensional longitudinal data structure. Given the ubiquity of big data in areas including modern health research, face recognition, and object identification, it is expected that imbalanced classification may encounter an additional level of difficulty that is imposed by such a complex data structure. In this dissertation, a nonparametric classification approach has been proposed for binary imbalanced data in longitudinal and high-dimensional settings. Technically, the proposed approach involves two stages. The functional principal component analysis (FPCA) is first applied for feature extraction under the sparse and irregular longitudinal data structure. The proposed univariate exponential loss function coupled with group LASSO penalty is then adopted into the classification procedure in high-dimensional settings. Along with the improvement in AUC and sensitivity under imbalanced classification, the proposed approach also provides a meaningful feature selection for interpretation while enjoying a remarkable computational efficiency. Finally, the proposed method is illustrated with the real data of Alzheimer’s disease, Pima Indians diabetes and Phoneme, and its empirical performance in finite sample size is extensively evaluated by simulations. Furthermore, the proposed method has been extended to multi-class scenario for which those aforementioned complications become more challenging. To accommodate the dense longitudinal/functional data, the use of natural cubic spline is adopted for feature extraction and dimension reduction, instead of using the FPCA. Functional biomarkers are efficiently characterized by spline coefficients which are treated as features for subsequent classification procedure. With these transformed features, a novel exponential loss function is then proposed to cast the multi-class classification task as a single optimization problem. Coupled with the group LASSO penalty, the proposed approach is also capable of performing variable selection for each class individually. Besides that, a simple weight-adjusted margin can be easily incorporated into the proposed loss function to address the issue of imbalance in multi-class data. The overall empirical performance of the proposed framework is evaluated by simulations in both high- and low-dimensional settings. Finally, the proposed multi-class classification framework is illustrated using real data of Alzheimer’s disease, gene expression, and human walking.

Description

Keywords

Imbalanced classification, AUC, Multi-class data, Group variable selection, Longitudinal structure, Alzheimer's disease

Graduation Month

August

Degree

Doctor of Philosophy

Department

Department of Statistics

Major Professor

Wei-Wen Hsu; Weixing Song

Date

Type

Dissertation

Citation