Unsupervised feature construction approaches for biological sequence classification

K-REx Repository

Show simple item record

dc.contributor.author Tangirala, Karthik
dc.date.accessioned 2015-04-24T21:15:17Z
dc.date.available 2015-04-24T21:15:17Z
dc.date.issued 2015-04-24
dc.identifier.uri http://hdl.handle.net/2097/19123
dc.description.abstract Recent advancements in biological sciences have resulted in the availability of large amounts of sequence data (DNA and protein sequences). Biological sequence data can be annotated using machine learning techniques, but most learning algorithms require data to be represented by a vector of features. In the absence of biologically informative features, k-mers generated using a sliding window-based approach are commonly used to represent biological sequences. A larger k value typically results in better features; however, the number of k-mer features is exponential in k, and many k-mers are not informative. Feature selection is widely used to reduce the dimensionality of the input feature space. Most feature selection techniques use feature-class dependency scores to rank the features. However, when the amount of available labeled data is small, feature selection techniques may not accurately capture feature-class dependency scores. Therefore, instead of working with all k-mers, this dissertation proposes the construction of a reduced set of informative k-mers that can be used to represent biological sequences. This work resulted in three novel unsupervised approaches to construct features: 1. Burrows Wheeler Transform-based approach, that uses the sorted permutations of a given sequence to construct sequential features (subsequences) that occur multiple times in a given sequence. 2. Community detection-based approach, that uses a community detection algorithm to group similar subsequences into communities and refines the communities to form motifs (group of similar subsequences). Motifs obtained using the community detection-based approach satisfy the ZOMOPS constraint (Zero, One or Multiple Occurrences of a Motif Per Sequence). All possible unique subsequences of the obtained motifs are then used as features to represent the sequences. 3. Hybrid-based approach, that combines the Burrows Wheeler Transform-based approach and the community detection-based approach to allow certain mismatches to the features constructed using the Burrows Wheeler Transform-based approach. To evaluate the predictive power of the features constructed using the proposed approaches, experiments were conducted in three learning scenarios: supervised, semi-supervised, and domain adaptation for both nucleotide and protein sequence classification problems. The performance of classifiers learned using features generated with the proposed approaches was compared with the performance of the classifiers learned using k-mers (with feature selection) and feature hashing (another unsupervised dimensionality reduction technique). Experimental results from the three learning scenarios showed that features constructed with the proposed approaches were typically more informative than k-mers and feature hashing. en_US
dc.language.iso en_US en_US
dc.publisher Kansas State University en
dc.subject Machine learning en_US
dc.subject Bioinformatics en_US
dc.subject Feature construction en_US
dc.subject Biological sequence classification en_US
dc.subject Unsupervised approaches en_US
dc.title Unsupervised feature construction approaches for biological sequence classification en_US
dc.type Dissertation en_US
dc.description.degree Doctor of Philosophy en_US
dc.description.level Doctoral en_US
dc.description.department Department of Computing and Information Sciences en_US
dc.description.advisor Doina Caragea en_US
dc.subject.umi Bioinformatics (0715) en_US
dc.subject.umi Computer Science (0984) en_US
dc.date.published 2015 en_US
dc.date.graduationmonth May en_US

Files in this item

This item appears in the following Collection(s)

Show simple item record

Search K-REx

Advanced Search


My Account


Center for the

Advancement of Digital