Prediction and variable selection in sparse ultrahigh dimensional additive models

dc.contributor.authorRamirez, Girly Manguba
dc.date.accessioned2013-07-17T16:34:11Z
dc.date.available2013-07-17T16:34:11Z
dc.date.graduationmonthAugusten_US
dc.date.issued2013-07-17
dc.date.published2013en_US
dc.description.abstractThe advance in technologies has enabled many fields to collect datasets where the number of covariates (p) tends to be much bigger than the number of observations (n), the so-called ultrahigh dimensionality. In this setting, classical regression methodologies are invalid. There is a great need to develop methods that can explain the variations of the response variable using only a parsimonious set of covariates. In the recent years, there have been significant developments of variable selection procedures. However, these available procedures usually result in the selection of too many false variables. In addition, most of the available procedures are appropriate only when the response variable is linearly associated with the covariates. Motivated by these concerns, we propose another procedure for variable selection in ultrahigh dimensional setting which has the ability to reduce the number of false positive variables. Moreover, this procedure can be applied when the response variable is continuous or binary, and when the response variable is linearly or non-linearly related to the covariates. Inspired by the Least Angle Regression approach, we develop two multi-step algorithms to select variables in sparse ultrahigh dimensional additive models. The variables go through a series of nonlinear dependence evaluation following a Most Significant Regression (MSR) algorithm. In addition, the MSR algorithm is also designed to implement prediction of the response variable. The first algorithm called MSR-continuous (MSRc) is appropriate for a dataset with a response variable that is continuous. Simulation results demonstrate that this algorithm works well. Comparisons with other methods such as greedy-INIS by Fan et al. (2011) and generalized correlation procedure by Hall and Miller (2009) showed that MSRc not only has false positive rate that is significantly less than both methods, but also has accuracy and true positive rate comparable with greedy-INIS. The second algorithm called MSR-binary (MSRb) is appropriate when the response variable is binary. Simulations demonstrate that MSRb is competitive in terms of prediction accuracy and true positive rate, and better than GLMNET in terms of false positive rate. Application of MSRb to real datasets is also presented. In general, MSR algorithm usually selects fewer variables while preserving the accuracy of predictions.en_US
dc.description.advisorHaiyan Wangen_US
dc.description.degreeDoctor of Philosophyen_US
dc.description.departmentDepartment of Statisticsen_US
dc.description.levelDoctoralen_US
dc.identifier.urihttp://hdl.handle.net/2097/15989
dc.language.isoen_USen_US
dc.publisherKansas State Universityen
dc.subjectVariable selectionen_US
dc.subjectPredictionen_US
dc.subjectSmoothingen_US
dc.subjectAdditive modelsen_US
dc.subjectParsityen_US
dc.subjectUltrahigh dimensionalen_US
dc.subject.umiStatistics (0463)en_US
dc.titlePrediction and variable selection in sparse ultrahigh dimensional additive modelsen_US
dc.typeDissertationen_US

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
GirlyRamirez2013.pdf
Size:
547.2 KB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.62 KB
Format:
Item-specific license agreed upon to submission
Description: