LDA-based dimensionality reduction and domain adaptation with application to DNA sequence classification

dc.contributor.authorMungre, Surbhi
dc.date.accessioned2011-05-06T21:29:35Z
dc.date.available2011-05-06T21:29:35Z
dc.date.graduationmonthMayen_US
dc.date.issued2011-05-06
dc.date.published2011en_US
dc.description.abstractSeveral computational biology and bioinformatics problems involve DNA sequence classification using supervised machine learning algorithms. The performance of these algorithms is largely dependent on the availability of labeled data and the approach used to represent DNA sequences as {\it feature vectors}. For many organisms, the labeled DNA data is scarce, while the unlabeled data is easily available. However, for a small number of well-studied model organisms, large amounts of labeled data are available. This calls for {\it domain adaptation} approaches, which can transfer knowledge from a {\it source} domain, for which labeled data is available, to a {\it target} domain, for which large amounts of unlabeled data are available. Intuitively, one approach to domain adaptation can be obtained by extracting and representing the features that the source domain and the target domain sequences share. \emph{Latent Dirichlet Allocation} (LDA) is an unsupervised dimensionality reduction technique that has been successfully used to generate features for sequence data such as text. In this work, we explore the use of LDA for generating predictive DNA sequence features, that can be used in both supervised and domain adaptation frameworks. More precisely, we propose two dimensionality reduction approaches, LDA Words (LDAW) and LDA Distribution (LDAD) for DNA sequences. LDA is a probabilistic model, which is generative in nature, and is used to model collections of discrete data such as document collections. For our problem, a sequence is considered to be a ``document" and k-mers obtained from a sequence are ``document words". We use LDA to model our sequence collection. Given the LDA model, each document can be represented as a distribution over topics (where a topic can be seen as a distribution over k-mers). In the LDAW method, we use the top k-mers in each topic as our features (i.e., k-mers with the highest probability); while in the LDAD method, we use the topic distribution to represent a document as a feature vector. We study LDA-based dimensionality reduction approaches for both supervised DNA sequence classification, as well as domain adaptation approaches. We apply the proposed approaches on the splice site predication problem, which is an important DNA sequence classification problem in the context of genome annotation. In the supervised learning framework, we study the effectiveness of LDAW and LDAD methods by comparing them with a traditional dimensionality reduction technique based on the information gain criterion. In the domain adaptation framework, we study the effect of increasing the evolutionary distances between the source and target organisms, and the effect of using different weights when combining labeled data from the source domain and with labeled data from the target domain. Experimental results show that LDA-based features can be successfully used to perform dimensionality reduction and domain adaptation for DNA sequence classification problems.en_US
dc.description.advisorDoina Carageaen_US
dc.description.degreeMaster of Scienceen_US
dc.description.departmentDepartment of Computing and Information Sciencesen_US
dc.description.levelMastersen_US
dc.description.sponsorshipNational Science Foundation (NSF 0711396) and Arthropod Genomics Centeren_US
dc.identifier.urihttp://hdl.handle.net/2097/8846
dc.language.isoen_USen_US
dc.publisherKansas State Universityen
dc.subjectDomain Adaptationen_US
dc.subjectSplice Site Predictionen_US
dc.subjectLatent Dirichlet Allocationen_US
dc.subjectDNA Sequence Classificationen_US
dc.subjectDimentionality Reductionen_US
dc.subject.umiComputer Science (0984)en_US
dc.titleLDA-based dimensionality reduction and domain adaptation with application to DNA sequence classificationen_US
dc.typeThesisen_US

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
SurbhiMungre2011.pdf
Size:
4.14 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.61 KB
Format:
Item-specific license agreed upon to submission
Description: