Unsupervised feature construction approaches for biological sequence classification

dc.contributor.authorTangirala, Karthik
dc.date.accessioned2015-04-24T21:15:17Z
dc.date.available2015-04-24T21:15:17Z
dc.date.graduationmonthMayen_US
dc.date.issued2015-04-24
dc.date.published2015en_US
dc.description.abstractRecent advancements in biological sciences have resulted in the availability of large amounts of sequence data (DNA and protein sequences). Biological sequence data can be annotated using machine learning techniques, but most learning algorithms require data to be represented by a vector of features. In the absence of biologically informative features, k-mers generated using a sliding window-based approach are commonly used to represent biological sequences. A larger k value typically results in better features; however, the number of k-mer features is exponential in k, and many k-mers are not informative. Feature selection is widely used to reduce the dimensionality of the input feature space. Most feature selection techniques use feature-class dependency scores to rank the features. However, when the amount of available labeled data is small, feature selection techniques may not accurately capture feature-class dependency scores. Therefore, instead of working with all k-mers, this dissertation proposes the construction of a reduced set of informative k-mers that can be used to represent biological sequences. This work resulted in three novel unsupervised approaches to construct features: 1. Burrows Wheeler Transform-based approach, that uses the sorted permutations of a given sequence to construct sequential features (subsequences) that occur multiple times in a given sequence. 2. Community detection-based approach, that uses a community detection algorithm to group similar subsequences into communities and refines the communities to form motifs (group of similar subsequences). Motifs obtained using the community detection-based approach satisfy the ZOMOPS constraint (Zero, One or Multiple Occurrences of a Motif Per Sequence). All possible unique subsequences of the obtained motifs are then used as features to represent the sequences. 3. Hybrid-based approach, that combines the Burrows Wheeler Transform-based approach and the community detection-based approach to allow certain mismatches to the features constructed using the Burrows Wheeler Transform-based approach. To evaluate the predictive power of the features constructed using the proposed approaches, experiments were conducted in three learning scenarios: supervised, semi-supervised, and domain adaptation for both nucleotide and protein sequence classification problems. The performance of classifiers learned using features generated with the proposed approaches was compared with the performance of the classifiers learned using k-mers (with feature selection) and feature hashing (another unsupervised dimensionality reduction technique). Experimental results from the three learning scenarios showed that features constructed with the proposed approaches were typically more informative than k-mers and feature hashing.en_US
dc.description.advisorDoina Carageaen_US
dc.description.degreeDoctor of Philosophyen_US
dc.description.departmentDepartment of Computing and Information Sciencesen_US
dc.description.levelDoctoralen_US
dc.identifier.urihttp://hdl.handle.net/2097/19123
dc.language.isoen_USen_US
dc.publisherKansas State Universityen
dc.subjectMachine learningen_US
dc.subjectBioinformaticsen_US
dc.subjectFeature constructionen_US
dc.subjectBiological sequence classificationen_US
dc.subjectUnsupervised approachesen_US
dc.subject.umiBioinformatics (0715)en_US
dc.subject.umiComputer Science (0984)en_US
dc.titleUnsupervised feature construction approaches for biological sequence classificationen_US
dc.typeDissertationen_US

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
KarthikTangirala2015.pdf
Size:
4 MB
Format:
Adobe Portable Document Format
Description:
Dissertation - Karthik Tangirala
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.62 KB
Format:
Item-specific license agreed upon to submission
Description: