# LDA-based dimensionality reduction and domain adaptation with application to DNA sequence classification

## K-REx Repository

 dc.contributor.author Mungre, Surbhi dc.date.accessioned 2011-05-06T21:29:35Z dc.date.available 2011-05-06T21:29:35Z dc.date.issued 2011-05-06 dc.identifier.uri http://hdl.handle.net/2097/8846 dc.description.abstract Several computational biology and bioinformatics problems involve DNA sequence classification using supervised machine learning algorithms. The performance of these algorithms is largely dependent on the availability of labeled data and the approach used to represent DNA sequences as {\it feature vectors}. For many organisms, the labeled DNA data is scarce, while the unlabeled data is easily available. However, for a small number of well-studied model organisms, large amounts of labeled data are available. This calls for {\it domain adaptation} approaches, which can transfer knowledge from a {\it source} domain, for which labeled data is available, to a {\it target} domain, for which large amounts of unlabeled data are available. en_US Intuitively, one approach to domain adaptation can be obtained by extracting and representing the features that the source domain and the target domain sequences share. \emph{Latent Dirichlet Allocation} (LDA) is an unsupervised dimensionality reduction technique that has been successfully used to generate features for sequence data such as text. In this work, we explore the use of LDA for generating predictive DNA sequence features, that can be used in both supervised and domain adaptation frameworks. More precisely, we propose two dimensionality reduction approaches, LDA Words (LDAW) and LDA Distribution (LDAD) for DNA sequences. LDA is a probabilistic model, which is generative in nature, and is used to model collections of discrete data such as document collections. For our problem, a sequence is considered to be a document" and k-mers obtained from a sequence are document words". We use LDA to model our sequence collection. Given the LDA model, each document can be represented as a distribution over topics (where a topic can be seen as a distribution over k-mers). In the LDAW method, we use the top k-mers in each topic as our features (i.e., k-mers with the highest probability); while in the LDAD method, we use the topic distribution to represent a document as a feature vector. We study LDA-based dimensionality reduction approaches for both supervised DNA sequence classification, as well as domain adaptation approaches. We apply the proposed approaches on the splice site predication problem, which is an important DNA sequence classification problem in the context of genome annotation. In the supervised learning framework, we study the effectiveness of LDAW and LDAD methods by comparing them with a traditional dimensionality reduction technique based on the information gain criterion. In the domain adaptation framework, we study the effect of increasing the evolutionary distances between the source and target organisms, and the effect of using different weights when combining labeled data from the source domain and with labeled data from the target domain. Experimental results show that LDA-based features can be successfully used to perform dimensionality reduction and domain adaptation for DNA sequence classification problems. dc.description.sponsorship National Science Foundation (NSF 0711396) and Arthropod Genomics Center en_US dc.language.iso en_US en_US dc.publisher Kansas State University en dc.subject Domain Adaptation en_US dc.subject Splice Site Prediction en_US dc.subject Latent Dirichlet Allocation en_US dc.subject DNA Sequence Classification en_US dc.subject Dimentionality Reduction en_US dc.title LDA-based dimensionality reduction and domain adaptation with application to DNA sequence classification en_US dc.type Thesis en_US dc.description.degree Master of Science en_US dc.description.level Masters en_US dc.description.department Department of Computing and Information Sciences en_US dc.description.advisor Doina Caragea en_US dc.subject.umi Computer Science (0984) en_US dc.date.published 2011 en_US dc.date.graduationmonth May en_US
﻿

Center for the