Domain adaptation approaches for classifying social media crisis data

Date

2021-05-01

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Social media platforms such as Twitter provide valuable information for aiding first response during emergency events. Machine learning could be used to build automatic tools for filtering and categorizing useful information from the flood of information posted by eyewitnesses during a disaster. However, supervised learning algorithms rely on labeled data, which is not readily available for an emerging target disaster. While labeled data might be available for a prior source disaster (or a set of prior source disasters), supervised classifiers learned only from the source disaster(s) may not perform well on the target disaster, as each event has unique characteristics (e.g., type, location, culture), and may cause different social media responses. Therefore, domain adaptation approaches, which address the above limitation by learning classifiers from unlabeled target data in addition to source labeled data, represent a promising direction for social media crisis data classification tasks. This thesis focuses on disaster tweet classification tasks, including classification of tweets as relevant to a disaster or not relevant, and classification of tweets as informative to disaster response teams or not informative. In the single-source setting, we propose several domain adaptation approaches for such tasks. More precisely, we first propose approaches based on Expectation Maximization and Self-training, performed on top of supervised Naive Bayes classifiers to classify tweets in categories of interest. We also employ a feature adaptation method (called Correlation Alignment) and combine it with Self-training to train weighted Naive Bayes classifiers. Experimental results on the task of identifying tweets relevant to a disaster of interest show that the domain adaptation classifiers are better as compared to the supervised baselines learned only from labeled source data. In addition to the single-source setting, we also consider a multi-source setting, where several source disasters are used to transfer knowledge to a target disaster. Under the multi-source domain adaptation setting, we evaluate how different representations based on pre-trained word embeddings and sentence encoding models perform when used with supervised classifiers. The word-embeddings are pre-trained on very large unlabeled corpora, and can thus capture semantic information (e.g., similar words are close in the embedding space). We use the pre-trained word embeddings and sentence encoding models to design simple but effective representation-based adaptation approaches for disaster tweet classification. We further apply the Self-training approach on top of these models, and obtain domain adaptation models that are shown experimentally to perform better than the supervised models on the task of identifying relevant versus irrelevant tweets. We also train crisis specific word embeddings with our own crisis tweet corpora. The resulting embeddings can be used for a variety crisis tweet classification tasks. Finally, we design domain adaptation models on top of state-of-the-art pre-trained language models (e.g., BERT) for social media crisis data classification, and show the effectiveness of such model for disaster tweet classification. This thesis contributes to the crisis informatics research by introducing domain adaptation approaches for social media crisis data classification. The proposed approaches have the potential to be used in practice and help with the information overload problem that disaster response teams face currently.

Description

Keywords

Domain adaptation, Transfer learning, Text classification, Social media, Crisis tweets classification

Graduation Month

May

Degree

Doctor of Philosophy

Department

Department of Computer Science

Major Professor

Doina Caragea

Date

2021

Type

Dissertation

Citation