LDA-based dimensionality reduction and domain adaptation with application to DNA sequence classification

K-REx Repository

Show simple item record

dc.contributor.author Mungre, Surbhi
dc.date.accessioned 2011-05-06T21:29:35Z
dc.date.available 2011-05-06T21:29:35Z
dc.date.issued 2011-05-06
dc.identifier.uri http://hdl.handle.net/2097/8846
dc.description.abstract Several computational biology and bioinformatics problems involve DNA sequence classification using supervised machine learning algorithms. The performance of these algorithms is largely dependent on the availability of labeled data and the approach used to represent DNA sequences as {\it feature vectors}. For many organisms, the labeled DNA data is scarce, while the unlabeled data is easily available. However, for a small number of well-studied model organisms, large amounts of labeled data are available. This calls for {\it domain adaptation} approaches, which can transfer knowledge from a {\it source} domain, for which labeled data is available, to a {\it target} domain, for which large amounts of unlabeled data are available. Intuitively, one approach to domain adaptation can be obtained by extracting and representing the features that the source domain and the target domain sequences share. \emph{Latent Dirichlet Allocation} (LDA) is an unsupervised dimensionality reduction technique that has been successfully used to generate features for sequence data such as text. In this work, we explore the use of LDA for generating predictive DNA sequence features, that can be used in both supervised and domain adaptation frameworks. More precisely, we propose two dimensionality reduction approaches, LDA Words (LDAW) and LDA Distribution (LDAD) for DNA sequences. LDA is a probabilistic model, which is generative in nature, and is used to model collections of discrete data such as document collections. For our problem, a sequence is considered to be a ``document" and k-mers obtained from a sequence are ``document words". We use LDA to model our sequence collection. Given the LDA model, each document can be represented as a distribution over topics (where a topic can be seen as a distribution over k-mers). In the LDAW method, we use the top k-mers in each topic as our features (i.e., k-mers with the highest probability); while in the LDAD method, we use the topic distribution to represent a document as a feature vector. We study LDA-based dimensionality reduction approaches for both supervised DNA sequence classification, as well as domain adaptation approaches. We apply the proposed approaches on the splice site predication problem, which is an important DNA sequence classification problem in the context of genome annotation. In the supervised learning framework, we study the effectiveness of LDAW and LDAD methods by comparing them with a traditional dimensionality reduction technique based on the information gain criterion. In the domain adaptation framework, we study the effect of increasing the evolutionary distances between the source and target organisms, and the effect of using different weights when combining labeled data from the source domain and with labeled data from the target domain. Experimental results show that LDA-based features can be successfully used to perform dimensionality reduction and domain adaptation for DNA sequence classification problems. en_US
dc.description.sponsorship National Science Foundation (NSF 0711396) and Arthropod Genomics Center en_US
dc.language.iso en_US en_US
dc.publisher Kansas State University en
dc.subject Domain Adaptation en_US
dc.subject Splice Site Prediction en_US
dc.subject Latent Dirichlet Allocation en_US
dc.subject DNA Sequence Classification en_US
dc.subject Dimentionality Reduction en_US
dc.title LDA-based dimensionality reduction and domain adaptation with application to DNA sequence classification en_US
dc.type Thesis en_US
dc.description.degree Master of Science en_US
dc.description.level Masters en_US
dc.description.department Department of Computing and Information Sciences en_US
dc.description.advisor Doina Caragea en_US
dc.subject.umi Computer Science (0984) en_US
dc.date.published 2011 en_US
dc.date.graduationmonth May en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search K-REx


Advanced Search

Browse

My Account

Statistics








Center for the

Advancement of Digital

Scholarship

118 Hale Library

Manhattan KS 66506


(785) 532-7444

cads@k-state.edu