Unsupervised feature construction approaches for biological sequence classification
dc.contributor.author | Tangirala, Karthik | |
dc.date.accessioned | 2015-04-24T21:15:17Z | |
dc.date.available | 2015-04-24T21:15:17Z | |
dc.date.graduationmonth | May | |
dc.date.issued | 2015-04-24 | |
dc.description.abstract | Recent advancements in biological sciences have resulted in the availability of large amounts of sequence data (DNA and protein sequences). Biological sequence data can be annotated using machine learning techniques, but most learning algorithms require data to be represented by a vector of features. In the absence of biologically informative features, k-mers generated using a sliding window-based approach are commonly used to represent biological sequences. A larger k value typically results in better features; however, the number of k-mer features is exponential in k, and many k-mers are not informative. Feature selection is widely used to reduce the dimensionality of the input feature space. Most feature selection techniques use feature-class dependency scores to rank the features. However, when the amount of available labeled data is small, feature selection techniques may not accurately capture feature-class dependency scores. Therefore, instead of working with all k-mers, this dissertation proposes the construction of a reduced set of informative k-mers that can be used to represent biological sequences. This work resulted in three novel unsupervised approaches to construct features: 1. Burrows Wheeler Transform-based approach, that uses the sorted permutations of a given sequence to construct sequential features (subsequences) that occur multiple times in a given sequence. 2. Community detection-based approach, that uses a community detection algorithm to group similar subsequences into communities and refines the communities to form motifs (group of similar subsequences). Motifs obtained using the community detection-based approach satisfy the ZOMOPS constraint (Zero, One or Multiple Occurrences of a Motif Per Sequence). All possible unique subsequences of the obtained motifs are then used as features to represent the sequences. 3. Hybrid-based approach, that combines the Burrows Wheeler Transform-based approach and the community detection-based approach to allow certain mismatches to the features constructed using the Burrows Wheeler Transform-based approach. To evaluate the predictive power of the features constructed using the proposed approaches, experiments were conducted in three learning scenarios: supervised, semi-supervised, and domain adaptation for both nucleotide and protein sequence classification problems. The performance of classifiers learned using features generated with the proposed approaches was compared with the performance of the classifiers learned using k-mers (with feature selection) and feature hashing (another unsupervised dimensionality reduction technique). Experimental results from the three learning scenarios showed that features constructed with the proposed approaches were typically more informative than k-mers and feature hashing. | |
dc.description.advisor | Doina Caragea | |
dc.description.degree | Doctor of Philosophy | |
dc.description.department | Department of Computing and Information Sciences | |
dc.description.level | Doctoral | |
dc.identifier.uri | http://hdl.handle.net/2097/19123 | |
dc.language.iso | en_US | |
dc.publisher | Kansas State University | |
dc.rights | © the author. This Item is protected by copyright and/or related rights. You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s). | |
dc.rights.uri | http://rightsstatements.org/vocab/InC/1.0/ | |
dc.subject | Machine learning | |
dc.subject | Bioinformatics | |
dc.subject | Feature construction | |
dc.subject | Biological sequence classification | |
dc.subject | Unsupervised approaches | |
dc.subject.umi | Bioinformatics (0715) | |
dc.subject.umi | Computer Science (0984) | |
dc.title | Unsupervised feature construction approaches for biological sequence classification | |
dc.type | Dissertation |
Files
Original bundle
1 - 1 of 1
Loading...
- Name:
- KarthikTangirala2015.pdf
- Size:
- 4 MB
- Format:
- Adobe Portable Document Format
- Description:
- Dissertation - Karthik Tangirala
License bundle
1 - 1 of 1
No Thumbnail Available
- Name:
- license.txt
- Size:
- 1.62 KB
- Format:
- Item-specific license agreed upon to submission
- Description: