Disaster tweet classification using parts-of-speech tags: a domain adaptation approach

Date

2016-12-01

Journal Title

Journal ISSN

Volume Title

Publisher

Kansas State University

Abstract

Twitter is one of the most active social media sites today. Almost everyone is using it, as it is a medium by which people stay in touch and inform others about events in their lives. Among many other types of events, people tweet about disaster events. Both man made and natural disasters, unfortunately, occur all the time. When these tragedies transpire, people tend to cope in their own ways. One of the most popular ways people convey their feelings towards disaster events is by offering or asking for support, providing valuable information about the disaster, and voicing their disapproval towards those who may be the cause. However, not all of the tweets posted during a disaster are guaranteed to be useful or informative to authorities nor to the general public. As the number of tweets that are posted during a disaster can reach the hundred thousands range, it is necessary to automatically distinguish tweets that provide useful information from those that don't. Manual annotation cannot scale up to the large number of tweets, as it takes significant time and effort, which makes it unsuitable for real-time disaster tweet annotation. Alternatively, supervised machine learning has been traditionally used to learn classifiers that can quickly annotate new unseen tweets. But supervised machine learning algorithms make use of labeled training data from the disaster of interest, which is presumably not available for a current target disaster. However, it is reasonable to assume that some amount of labeled data is available for a prior source disaster. Therefore, domain adaptation algorithms that make use of labeled data from a source disaster to learn classifiers for the target disaster provide a promising direction in the area of tweet classification for disaster management. In prior work, domain adaptation algorithms have been trained based on tweets represented as bag-of-words. In this research, I studied the effect of Part of Speech (POS) tag unigrams and bigrams on the performance of the domain adaptation classifiers. Specifically, I used POS tag unigram and bigram features in conjunction with a Naive Bayes Domain Adaptation algorithm to learn classifiers from source labeled data together with target unlabeled data, and subsequently used the resulting classifiers to classify target disaster tweets. The main research question addressed through this work was if the POS tags can help improve the performance of the classifiers learned from tweet bag-of-words representations only. Experimental results have shown that the POS tags can improve the performance of the classifiers learned from words only, but not always. Furthermore, the results of the experiments show that POS tag bigrams contain more information as compared to POS tag unigrams, as the classifiers learned from bigrams have better performance than those learned from unigrams.

Description

Keywords

Domain Adaptation, Text classification, Tweet, Disaster management, Part of speech, Naive Bayes

Graduation Month

December

Degree

Master of Science

Department

Department of Computer Science

Major Professor

Doina Caragea

Date

2016

Type

Thesis

Citation