Improved shrunken centroid method for better variable selection in cancer classification with high throughput molecular data

K-REx Repository

Show simple item record

dc.contributor.author Xukun, Li
dc.date.accessioned 2017-11-10T19:13:02Z
dc.date.available 2017-11-10T19:13:02Z
dc.date.issued 2017-12-01 en_US
dc.identifier.uri http://hdl.handle.net/2097/38188
dc.description.abstract Cancer type classification with high throughput molecular data has received much attention. Many methods have been published in this area. One of them is called PAM (nearest centroid shrunken algorithm), which is simple and efficient. It can give very good prediction accuracy. A problem with PAM is that this method selects too many genes, some of which may have no influence on cancer type. A reason for this phenomenon is that PAM assumes that all genes have identical distribution and give a common threshold parameter for genes selection. This may not hold in reality since expressions from different genes could have very different distributions due to complicated biological process. We propose a new method aimed to improve the ability of PAM to select informative genes. Keeping informative genes while reducing false positive variables can lead to more accurate classification result and help to pinpoint target genes for further studies. To achieve this goal, we introduce variable specific test based on Edgeworth expansion to select informative genes. We apply this test on each gene and select some genes based on the result of the test so that a large number of genes will be excluded. Afterward, soft thresholding with cross-validation can be further applied to decide a common threshold value. Simulation and real application show that our method can reduce the irrelevant information and select the informative genes more precisely. The simulation results give us more insight about where the newly proposed procedure could improve the accuracy, especially when the data set is skewed or unbalanced. The method can be applied to broad molecular data, including, for example, lipidomic data from mass spectrum, copy number data from genomics, eQLT analysis with GWAS data, etc. We expect the proposed method will help life scientists to accelerate discoveries with highthroughput data. en_US
dc.description.sponsorship This work was partially supported by a grant by Simons foundation (#246077) to Haiyan Wang. en_US
dc.language.iso en_US en_US
dc.publisher Kansas State University en
dc.subject Feature selection en_US
dc.subject High dimensional classification en_US
dc.subject Cornish-Fisher expansion en_US
dc.subject Shrunken centroid en_US
dc.title Improved shrunken centroid method for better variable selection in cancer classification with high throughput molecular data en_US
dc.type Report en_US
dc.description.degree Master of Science en_US
dc.description.level Masters en_US
dc.description.department Department of Statistics en_US
dc.description.advisor Haiyan Wang en_US
dc.date.published 2017 en_US
dc.date.graduationmonth December en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search K-REx


Advanced Search

Browse

My Account

Statistics








Center for the

Advancement of Digital

Scholarship

cads@k-state.edu