Li, Yiming2022-08-012022-08-012022https://hdl.handle.net/2097/42394Imbalanced classification has drawn considerable attention in the statistics and machine learning literature. Typically, traditional classification methods, such as logistic regression and support vector machine (SVM), often perform poorly when a severely skewed class distribution is observed, not to mention under a high-dimensional longitudinal data structure. Given the ubiquity of big data in areas including modern health research, face recognition, and object identification, it is expected that imbalanced classification may encounter an additional level of difficulty that is imposed by such a complex data structure. In this dissertation, a nonparametric classification approach has been proposed for binary imbalanced data in longitudinal and high-dimensional settings. Technically, the proposed approach involves two stages. The functional principal component analysis (FPCA) is first applied for feature extraction under the sparse and irregular longitudinal data structure. The proposed univariate exponential loss function coupled with group LASSO penalty is then adopted into the classification procedure in high-dimensional settings. Along with the improvement in AUC and sensitivity under imbalanced classification, the proposed approach also provides a meaningful feature selection for interpretation while enjoying a remarkable computational efficiency. Finally, the proposed method is illustrated with the real data of Alzheimer’s disease, Pima Indians diabetes and Phoneme, and its empirical performance in finite sample size is extensively evaluated by simulations. Furthermore, the proposed method has been extended to multi-class scenario for which those aforementioned complications become more challenging. To accommodate the dense longitudinal/functional data, the use of natural cubic spline is adopted for feature extraction and dimension reduction, instead of using the FPCA. Functional biomarkers are efficiently characterized by spline coefficients which are treated as features for subsequent classification procedure. With these transformed features, a novel exponential loss function is then proposed to cast the multi-class classification task as a single optimization problem. Coupled with the group LASSO penalty, the proposed approach is also capable of performing variable selection for each class individually. Besides that, a simple weight-adjusted margin can be easily incorporated into the proposed loss function to address the issue of imbalance in multi-class data. The overall empirical performance of the proposed framework is evaluated by simulations in both high- and low-dimensional settings. Finally, the proposed multi-class classification framework is illustrated using real data of Alzheimer’s disease, gene expression, and human walking.en-US© the author. This Item is protected by copyright and/or related rights. You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s).http://rightsstatements.org/vocab/InC/1.0/Imbalanced classificationAUCMulti-class dataGroup variable selectionLongitudinal structureAlzheimer's diseaseA classification for complex imbalanced dataDissertation