Prediction of university student attrition rate using Ridge and Lasso Regression
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
One of the major challenges faced by many institutions is the attrition rate. Institutional attrition is the phenomenon of individuals moving out of an institution, prior to completing term-limited programs; this term can apply to employees (e.g., postdoctoral fellows) or students (Bani, J., Haji, & M., https://pdfs.semanticscholar.org/94b1/, 2017). In the context of this project, which focuses on student attrition, it includes students who drop out, are dismissed or do not return to their studies before the completion of their degree. The student attrition rate at a university is often measured in terms of the net change in enrollment per year due to students discontinuing their studies at that university. One of the consequences of attrition is that students are unable to graduate despite significant investments in the form of funding from scholarship-granting institutions or governments. This project deals with the study of factors contributing to student attrition rate at a land-grant state university and predicting whether a student is going to drop out or not based on various factors such as gender, race, cumulative GPA, etc. One reason that this study is timely and necessary is that a predictive model may allow an institution to recognize factors contributing to dropping out and will help the institution retain students and prevent dropout and “stopout” in certain cases. A decrease in preventable attrition may similarly enable more students to earn the degrees they were pursuing at a point where they can realize more of the professional and economic benefits of that degree. The report comprises a brief review of the supporting literature for the task of student attrition rate prediction and describes a machine learning and data science project centered around further explorations of a previously-developed experimental test bed. These involve extraction of data from historical archives (raw data from the university registrar’s office and other sources), cleaning the data, building the testing and training data sets for the supervised learning algorithms, training, and evaluation of models, and review of the models to derive actionable insights. Logistic regression, a supervised inductive learning algorithm, is used to train a classification model, which in turn is used to predict student dropout on a case wise basis. Regression models that use L2 regularization (ridge regression) and L1 regularization (lasso regression) will also be used to predict student dropout. These algorithms are used in feature selection and in the creation of a flexible model when data consists of a large set of features. Performance metrics such as precision, accuracy, recall, and F1 score are used to compare the performance.