LDA-based dimensionality reduction and domain adaptation with application to DNA sequence classification

Mungre, Surbhi

LDA-based dimensionality reduction and domain adaptation with application to DNA sequence classification

dc.contributor.author	Mungre, Surbhi
dc.date.accessioned	2011-05-06T21:29:35Z
dc.date.available	2011-05-06T21:29:35Z
dc.date.graduationmonth	May
dc.date.issued	2011-05-06
dc.date.published	2011
dc.description.abstract	Several computational biology and bioinformatics problems involve DNA sequence classification using supervised machine learning algorithms. The performance of these algorithms is largely dependent on the availability of labeled data and the approach used to represent DNA sequences as {\it feature vectors}. For many organisms, the labeled DNA data is scarce, while the unlabeled data is easily available. However, for a small number of well-studied model organisms, large amounts of labeled data are available. This calls for {\it domain adaptation} approaches, which can transfer knowledge from a {\it source} domain, for which labeled data is available, to a {\it target} domain, for which large amounts of unlabeled data are available. Intuitively, one approach to domain adaptation can be obtained by extracting and representing the features that the source domain and the target domain sequences share. \emph{Latent Dirichlet Allocation} (LDA) is an unsupervised dimensionality reduction technique that has been successfully used to generate features for sequence data such as text. In this work, we explore the use of LDA for generating predictive DNA sequence features, that can be used in both supervised and domain adaptation frameworks. More precisely, we propose two dimensionality reduction approaches, LDA Words (LDAW) and LDA Distribution (LDAD) for DNA sequences. LDA is a probabilistic model, which is generative in nature, and is used to model collections of discrete data such as document collections. For our problem, a sequence is considered to be a ``document" and k-mers obtained from a sequence are ``document words". We use LDA to model our sequence collection. Given the LDA model, each document can be represented as a distribution over topics (where a topic can be seen as a distribution over k-mers). In the LDAW method, we use the top k-mers in each topic as our features (i.e., k-mers with the highest probability); while in the LDAD method, we use the topic distribution to represent a document as a feature vector. We study LDA-based dimensionality reduction approaches for both supervised DNA sequence classification, as well as domain adaptation approaches. We apply the proposed approaches on the splice site predication problem, which is an important DNA sequence classification problem in the context of genome annotation. In the supervised learning framework, we study the effectiveness of LDAW and LDAD methods by comparing them with a traditional dimensionality reduction technique based on the information gain criterion. In the domain adaptation framework, we study the effect of increasing the evolutionary distances between the source and target organisms, and the effect of using different weights when combining labeled data from the source domain and with labeled data from the target domain. Experimental results show that LDA-based features can be successfully used to perform dimensionality reduction and domain adaptation for DNA sequence classification problems.
dc.description.advisor	Doina Caragea
dc.description.degree	Master of Science
dc.description.department	Department of Computing and Information Sciences
dc.description.level	Masters
dc.description.sponsorship	National Science Foundation (NSF 0711396) and Arthropod Genomics Center
dc.identifier.uri	http://hdl.handle.net/2097/8846
dc.language.iso	en_US
dc.publisher	Kansas State University
dc.rights	© the author. This Item is protected by copyright and/or related rights. You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s).
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/
dc.subject	Domain Adaptation
dc.subject	Splice Site Prediction
dc.subject	Latent Dirichlet Allocation
dc.subject	DNA Sequence Classification
dc.subject	Dimentionality Reduction
dc.subject.umi	Computer Science (0984)
dc.title	LDA-based dimensionality reduction and domain adaptation with application to DNA sequence classification
dc.type	Thesis

Files

Original bundle

Now showing 1 - 1 of 1

Name:: SurbhiMungre2011.pdf
Size:: 4.14 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.61 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

K-State Electronic Theses, Dissertations, and Reports: 2004 -