Semi-supervised learning for biological sequence classification

Date

2015-08-01

Journal Title

Journal ISSN

Volume Title

Publisher

Kansas State University

Abstract

Successful advances in biochemical technologies have led to inexpensive, time-efficient production of massive volumes of data, DNA and protein sequences. As a result, numerous computational methods for genome annotation have emerged, including machine learning and statistical analysis approaches that practically and efficiently analyze and interpret data. Traditional machine learning approaches to genome annotation typically rely on large amounts of labeled data in order to build quality classifiers. The process of labeling data can be expensive and time consuming, as it requires domain knowledge and expert involvement. Semi-supervised learning approaches that can make use of unlabeled data, in addition to small amounts of labeled data, can help reduce the costs associated with labeling. In this context, we focus on semi-supervised learning approaches for biological sequence classification. Although an attractive concept, semi-supervised learning does not invariably work as intended. Since the assumptions made by learning algorithms cannot be easily verified without considerable domain knowledge or data exploration, semi-supervised learning is not always "safe" to use. Advantageous utilization of the unlabeled data is problem dependent, and more research is needed to identify algorithms that can be used to increase the effectiveness of semi-supervised learning, in general, and for bioinformatics problems, in particular. At a high level, we aim to identify semi-supervised algorithms and data representations that can be used to learn effective classifiers for genome annotation tasks such as cassette exon identification, splice site identification, and protein localization. In addition, one specific challenge that we address is the "data imbalance" problem, which is prevalent in many domains, including bioinformatics. The data imbalance phenomenon arises when one of the classes to be predicted is underrepresented in the data because instances belonging to that class are rare (noteworthy cases) or difficult to obtain. Ironically, minority classes are typically the most important to learn, because they may be associated with special cases, as in the case of splice site prediction. We propose two main techniques to deal with the data imbalance problem, namely a technique based on "dynamic balancing" (augmenting the originally labeled data only with positive instances during the semi-supervised iterations of the algorithms) and another technique based on ensemble approaches. The results show that with limited amounts of labeled data, semisupervised approaches can successfully leverage the unlabeled data, thereby surpassing their completely supervised counterparts. A type of semi-supervised learning, known as "transductive" learning aims to classify the unlabeled data without generalizing to new, previously not encountered instances. Theoretically, this aspect makes transductive learning particularly suitable for the task of genome annotation, in which an entirely sequenced genome is typically available, sometimes accompanied by limited annotation. We study and evaluate various transductive approaches (such as transductive support vector machines and graph based approaches) and sequence representations for the problems of cassette exon identification. The results obtained demonstrate the effectiveness of transductive algorithms in sequence annotation tasks.

Description

Keywords

Computer science, Artificial intelligence, Bioformatics, Machine learning

Graduation Month

August

Degree

Doctor of Philosophy

Department

Department of Computing and Information Sciences

Major Professor

Doina Caragea

Date

2015

Type

Dissertation

Citation