Domain adaptation algorithms for biological sequence classification

K-REx Repository

Show simple item record

dc.contributor.author Herndon, Nic
dc.date.accessioned 2017-02-17T20:04:43Z
dc.date.available 2017-02-17T20:04:43Z
dc.date.issued 2016-05-01 en_US
dc.identifier.uri http://hdl.handle.net/2097/35242
dc.description.abstract The large volume of data generated in the recent years has created opportunities for discoveries in various fields. In biology, next generation sequencing technologies determine faster and cheaper the exact order of nucleotides present within a DNA or RNA fragment. This large volume of data requires the use of automated tools to extract information and generate knowledge. Machine learning classification algorithms provide an automated means to annotate data but require some of these data to be manually labeled by human experts, a process that is costly and time consuming. An alternative to labeling data is to use existing labeled data from a related domain, the source domain, if any such data is available, to train a classifier for the domain of interest, the target domain. However, the classification accuracy usually decreases for the domain of interest as the distance between the source and target domains increases. Another alternative is to label some data and complement it with abundant unlabeled data from the same domain, and train a semi-supervised classifier, although the unlabeled data can mislead such classifier. In this work another alternative is considered, domain adaptation, in which the goal is to train an accurate classifier for a domain with limited labeled data and abundant unlabeled data, the target domain, by leveraging labeled data from a related domain, the source domain. Several domain adaptation classifiers are proposed, derived from a supervised discriminative classifier (logistic regression) or a supervised generative classifier (naïve Bayes), and some of the factors that influence their accuracy are studied: features, data used from the source domain, how to incorporate the unlabeled data, and how to combine all available data. The proposed approaches were evaluated on two biological problems -- protein localization and ab initio splice site prediction. The former is motivated by the fact that predicting where a protein is localized provides an indication for its function, whereas the latter is an essential step in gene prediction. en_US
dc.language.iso en_US en_US
dc.publisher Kansas State University en
dc.subject Machine learning en_US
dc.subject Biological sequence classification en_US
dc.subject Protein localization en_US
dc.subject Domain adapation en_US
dc.subject Splice site prediction en_US
dc.title Domain adaptation algorithms for biological sequence classification en_US
dc.type Dissertation en_US
dc.description.degree Doctor of Philosophy en_US
dc.description.level Doctoral en_US
dc.description.department Department of Computing and Information Sciences en_US
dc.description.advisor Doina Caragea en_US
dc.date.published 2016 en_US
dc.date.graduationmonth May en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search K-REx


Browse

My Account

Statistics








Center for the

Advancement of Digital

Scholarship

cads@k-state.edu