A pipeline for improved QSAR analysis of peptides: physiochemical property parameter selection via BMSF, near-neighbor sample selection via semivariogram, and weighted SVR regression and prediction

dc.citation.doidoi:10.1007/s00726-014-1667-5en_US
dc.citation.epage1119en_US
dc.citation.issue4en_US
dc.citation.jtitleAmino Acidsen_US
dc.citation.spage1105en_US
dc.citation.volume46en_US
dc.contributor.authorDai, Zhijun
dc.contributor.authorWang, Lifeng
dc.contributor.authorChen, Yuan
dc.contributor.authorWang, Haiyan
dc.contributor.authorBai, Lianyang
dc.contributor.authorYuan, Zheming
dc.contributor.authoreidhwangen_US
dc.date.accessioned2014-06-24T20:09:25Z
dc.date.available2014-06-24T20:09:25Z
dc.date.issued2014-06-24
dc.date.published2014en_US
dc.description.abstractIn this paper, we present a pipeline to perform improved QSAR analysis of peptides. The modeling involves a double selection procedure that first performs feature selection and then conducts sample selection before the final regression analysis. Five hundred and thirty-one physicochemical property parameters of amino acids were used as descriptors to characterize the structure of peptides. These high-dimensional descriptors then go through a feature selection process given by the Binary Matrix Shuffling Filter (BMSF) to obtain a set of important low dimensional features. Each descriptor that passed the BMSF filtering also receives a weight defined through its contribution to reduce the estimation error. These selected features were served as the predictors for subsequent sample selection and modeling. Based on the weighted Euclidean distances between samples, a common range was determined with high-dimensional semivariogram and then used as a threshold to select the near-neighbor samples from the training set. For each sample to be predicted, the QSAR model was established using SVR with the weighted, selected features based on the exclusive set of near-neighbor training samples. Prediction was conducted for each test sample accordingly. The performances of this pipeline are tested with the QSAR analysis of angiotensin-converting enzyme (ACE) inhibitors and HLA-A*0201 data sets. Improved prediction accuracy was obtained in both applications. This pipeline can optimize the QSAR modeling from both the feature selection and sample selection perspectives. This leads to improved accuracy over single selection methods. We expect this pipeline to have extensive application prospect in the field of regression prediction.en_US
dc.identifier.urihttp://hdl.handle.net/2097/17878
dc.language.isoen_USen_US
dc.relation.urihttp://link.springer.com/article/10.1007%2Fs00726-014-1667-5en_US
dc.rightsThe final publication is available at link.springer.comen_US
dc.subjectPeptidesen_US
dc.subjectQuantitative structure-activity regressionen_US
dc.subjectFeature selectionen_US
dc.subjectSemivariogramen_US
dc.subjectSupport vector regressionen_US
dc.titleA pipeline for improved QSAR analysis of peptides: physiochemical property parameter selection via BMSF, near-neighbor sample selection via semivariogram, and weighted SVR regression and predictionen_US
dc.typeArticle (author version)en_US

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
WangAminoAcids2014.pdf
Size:
250.16 KB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.62 KB
Format:
Item-specific license agreed upon to submission
Description: