Domain adaptation approaches for classifying social media crisis data

Li, Hongmin

Domain adaptation approaches for classifying social media crisis data

dc.contributor.author	Li, Hongmin
dc.date.accessioned	2020-12-03T22:52:03Z
dc.date.available	2020-12-03T22:52:03Z
dc.date.graduationmonth	May
dc.date.issued	2021-05-01
dc.description.abstract	Social media platforms such as Twitter provide valuable information for aiding first response during emergency events. Machine learning could be used to build automatic tools for filtering and categorizing useful information from the flood of information posted by eyewitnesses during a disaster. However, supervised learning algorithms rely on labeled data, which is not readily available for an emerging target disaster. While labeled data might be available for a prior source disaster (or a set of prior source disasters), supervised classifiers learned only from the source disaster(s) may not perform well on the target disaster, as each event has unique characteristics (e.g., type, location, culture), and may cause different social media responses. Therefore, domain adaptation approaches, which address the above limitation by learning classifiers from unlabeled target data in addition to source labeled data, represent a promising direction for social media crisis data classification tasks. This thesis focuses on disaster tweet classification tasks, including classification of tweets as relevant to a disaster or not relevant, and classification of tweets as informative to disaster response teams or not informative. In the single-source setting, we propose several domain adaptation approaches for such tasks. More precisely, we first propose approaches based on Expectation Maximization and Self-training, performed on top of supervised Naive Bayes classifiers to classify tweets in categories of interest. We also employ a feature adaptation method (called Correlation Alignment) and combine it with Self-training to train weighted Naive Bayes classifiers. Experimental results on the task of identifying tweets relevant to a disaster of interest show that the domain adaptation classifiers are better as compared to the supervised baselines learned only from labeled source data. In addition to the single-source setting, we also consider a multi-source setting, where several source disasters are used to transfer knowledge to a target disaster. Under the multi-source domain adaptation setting, we evaluate how different representations based on pre-trained word embeddings and sentence encoding models perform when used with supervised classifiers. The word-embeddings are pre-trained on very large unlabeled corpora, and can thus capture semantic information (e.g., similar words are close in the embedding space). We use the pre-trained word embeddings and sentence encoding models to design simple but effective representation-based adaptation approaches for disaster tweet classification. We further apply the Self-training approach on top of these models, and obtain domain adaptation models that are shown experimentally to perform better than the supervised models on the task of identifying relevant versus irrelevant tweets. We also train crisis specific word embeddings with our own crisis tweet corpora. The resulting embeddings can be used for a variety crisis tweet classification tasks. Finally, we design domain adaptation models on top of state-of-the-art pre-trained language models (e.g., BERT) for social media crisis data classification, and show the effectiveness of such model for disaster tweet classification. This thesis contributes to the crisis informatics research by introducing domain adaptation approaches for social media crisis data classification. The proposed approaches have the potential to be used in practice and help with the information overload problem that disaster response teams face currently.
dc.description.advisor	Doina Caragea
dc.description.degree	Doctor of Philosophy
dc.description.department	Department of Computer Science
dc.description.level	Doctoral
dc.description.sponsorship	National Science Foundation
dc.identifier.uri	https://hdl.handle.net/2097/40987
dc.language.iso	en_US
dc.publisher	Kansas State University
dc.rights	© the author. This Item is protected by copyright and/or related rights. You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s).
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/
dc.subject	Domain adaptation
dc.subject	Transfer learning
dc.subject	Text classification
dc.subject	Social media
dc.subject	Crisis tweets classification
dc.title	Domain adaptation approaches for classifying social media crisis data
dc.type	Dissertation

Files

Original bundle

Now showing 1 - 1 of 1

Name:: HongminLi2021.pdf
Size:: 811.55 KB
Format:: Adobe Portable Document Format
Description:

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.62 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

K-State Electronic Theses, Dissertations, and Reports: 2004 -