Cross-domain sentiment classification using grams derived from syntax trees and an adapted naive Bayes approach

Cheeti, Srilaxmi

Cross-domain sentiment classification using grams derived from syntax trees and an adapted naive Bayes approach

dc.contributor.author	Cheeti, Srilaxmi
dc.date.accessioned	2012-04-27T19:51:16Z
dc.date.available	2012-04-27T19:51:16Z
dc.date.graduationmonth	May
dc.date.issued	2012-04-27
dc.date.published	2012
dc.description.abstract	There is an increasing amount of user-generated information in online documents, includ- ing user opinions on various topics and products such as movies, DVDs, kitchen appliances, etc. To make use of such opinions, it is useful to identify the polarity of the opinion, in other words, to perform sentiment classification. The goal of sentiment classification is to classify a given text/document as either positive, negative or neutral based on the words present in the document. Supervised learning approaches have been successfully used for sentiment classification in domains that are rich in labeled data. Some of these approaches make use of features such as unigrams, bigrams, sentiment words, adjective words, syntax trees (or variations of trees obtained using pruning strategies), etc. However, for some domains the amount of labeled data can be relatively small and we cannot train an accurate classifier using the supervised learning approach. Therefore, it is useful to study domain adaptation techniques that can transfer knowledge from a source domain that has labeled data to a target domain that has little or no labeled data, but a large amount of unlabeled data. We address this problem in the context of product reviews, specifically reviews of movies, DVDs and kitchen appliances. Our approach uses an Adapted Naive Bayes classifier (ANB) on top of the Expectation Maximization (EM) algorithm to predict the sentiment of a sentence. We use grams derived from complete syntax trees or from syntax subtrees as features, when training the ANB classifier. More precisely, we extract grams from syntax trees correspond- ing to sentences in either the source or target domains. To be able to transfer knowledge from source to target, we identify generalized features (grams) using the frequently co-occurring entropy (FCE) method, and represent the source instances using these generalized features. The target instances are represented with all grams occurring in the target, or with a reduced grams set obtained by removing infrequent grams. We experiment with different types of grams in a supervised framework in order to identify the most predictive types of gram, and further use those grams in the domain adaptation framework. Experimental results on several cross-domains task show that domain adaptation approaches that combine source and target data (small amount of labeled and some unlabeled data) can help learn classifiers for the target that are better than those learned from the labeled target data alone.
dc.description.advisor	Doina Caragea
dc.description.degree	Master of Science
dc.description.department	Department of Computing and Information Sciences
dc.description.level	Masters
dc.identifier.uri	http://hdl.handle.net/2097/13733
dc.language.iso	en_US
dc.publisher	Kansas State University
dc.rights	© the author. This Item is protected by copyright and/or related rights. You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s).
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/
dc.subject	Adapted naive bayes algorithm
dc.subject	Cross domain sentiment classification
dc.subject	Grams
dc.subject	Domain adaptation
dc.subject	Syntax subtrees
dc.subject.umi	Computer Engineering (0464)
dc.subject.umi	Computer Science (0984)
dc.subject.umi	Information Science (0723)
dc.title	Cross-domain sentiment classification using grams derived from syntax trees and an adapted naive Bayes approach
dc.type	Thesis

Files

Original bundle

Now showing 1 - 1 of 1

Name:: SrilaxmiCheeti2012.pdf
Size:: 331.59 KB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.71 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

K-State Electronic Theses, Dissertations, and Reports: 2004 -