Domain adaptation algorithms for biological sequence classification

Herndon, Nic

Domain adaptation algorithms for biological sequence classification

dc.contributor.author	Herndon, Nic
dc.date.accessioned	2017-02-17T20:04:43Z
dc.date.available	2017-02-17T20:04:43Z
dc.date.graduationmonth	May
dc.date.issued	2016-05-01
dc.description.abstract	The large volume of data generated in the recent years has created opportunities for discoveries in various fields. In biology, next generation sequencing technologies determine faster and cheaper the exact order of nucleotides present within a DNA or RNA fragment. This large volume of data requires the use of automated tools to extract information and generate knowledge. Machine learning classification algorithms provide an automated means to annotate data but require some of these data to be manually labeled by human experts, a process that is costly and time consuming. An alternative to labeling data is to use existing labeled data from a related domain, the source domain, if any such data is available, to train a classifier for the domain of interest, the target domain. However, the classification accuracy usually decreases for the domain of interest as the distance between the source and target domains increases. Another alternative is to label some data and complement it with abundant unlabeled data from the same domain, and train a semi-supervised classifier, although the unlabeled data can mislead such classifier. In this work another alternative is considered, domain adaptation, in which the goal is to train an accurate classifier for a domain with limited labeled data and abundant unlabeled data, the target domain, by leveraging labeled data from a related domain, the source domain. Several domain adaptation classifiers are proposed, derived from a supervised discriminative classifier (logistic regression) or a supervised generative classifier (naïve Bayes), and some of the factors that influence their accuracy are studied: features, data used from the source domain, how to incorporate the unlabeled data, and how to combine all available data. The proposed approaches were evaluated on two biological problems -- protein localization and ab initio splice site prediction. The former is motivated by the fact that predicting where a protein is localized provides an indication for its function, whereas the latter is an essential step in gene prediction.
dc.description.advisor	Doina Caragea
dc.description.degree	Doctor of Philosophy
dc.description.department	Department of Computing and Information Sciences
dc.description.level	Doctoral
dc.identifier.uri	http://hdl.handle.net/2097/35242
dc.language.iso	en_US
dc.publisher	Kansas State University
dc.rights	© the author. This Item is protected by copyright and/or related rights. You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s).
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/
dc.subject	Machine learning
dc.subject	Biological sequence classification
dc.subject	Protein localization
dc.subject	Domain adapation
dc.subject	Splice site prediction
dc.title	Domain adaptation algorithms for biological sequence classification
dc.type	Dissertation

Files

Original bundle

Now showing 1 - 1 of 1

Name:: NicHerndon2016.pdf
Size:: 49.97 MB
Format:: Adobe Portable Document Format
Description:

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.62 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

K-State Electronic Theses, Dissertations, and Reports: 2004 -