Domain adaptation algorithms for biological sequence classification

dc.contributor.authorHerndon, Nic
dc.date.accessioned2017-02-17T20:04:43Z
dc.date.available2017-02-17T20:04:43Z
dc.date.graduationmonthMayen_US
dc.date.issued2016-05-01en_US
dc.date.published2016en_US
dc.description.abstractThe large volume of data generated in the recent years has created opportunities for discoveries in various fields. In biology, next generation sequencing technologies determine faster and cheaper the exact order of nucleotides present within a DNA or RNA fragment. This large volume of data requires the use of automated tools to extract information and generate knowledge. Machine learning classification algorithms provide an automated means to annotate data but require some of these data to be manually labeled by human experts, a process that is costly and time consuming. An alternative to labeling data is to use existing labeled data from a related domain, the source domain, if any such data is available, to train a classifier for the domain of interest, the target domain. However, the classification accuracy usually decreases for the domain of interest as the distance between the source and target domains increases. Another alternative is to label some data and complement it with abundant unlabeled data from the same domain, and train a semi-supervised classifier, although the unlabeled data can mislead such classifier. In this work another alternative is considered, domain adaptation, in which the goal is to train an accurate classifier for a domain with limited labeled data and abundant unlabeled data, the target domain, by leveraging labeled data from a related domain, the source domain. Several domain adaptation classifiers are proposed, derived from a supervised discriminative classifier (logistic regression) or a supervised generative classifier (naïve Bayes), and some of the factors that influence their accuracy are studied: features, data used from the source domain, how to incorporate the unlabeled data, and how to combine all available data. The proposed approaches were evaluated on two biological problems -- protein localization and ab initio splice site prediction. The former is motivated by the fact that predicting where a protein is localized provides an indication for its function, whereas the latter is an essential step in gene prediction.en_US
dc.description.advisorDoina Carageaen_US
dc.description.degreeDoctor of Philosophyen_US
dc.description.departmentDepartment of Computing and Information Sciencesen_US
dc.description.levelDoctoralen_US
dc.identifier.urihttp://hdl.handle.net/2097/35242
dc.language.isoen_USen_US
dc.publisherKansas State Universityen
dc.subjectMachine learningen_US
dc.subjectBiological sequence classificationen_US
dc.subjectProtein localizationen_US
dc.subjectDomain adapationen_US
dc.subjectSplice site predictionen_US
dc.titleDomain adaptation algorithms for biological sequence classificationen_US
dc.typeDissertationen_US

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
NicHerndon2016.pdf
Size:
49.97 MB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.62 KB
Format:
Item-specific license agreed upon to submission
Description: