Domain adaptation approaches for classifying social media crisis data

dc.contributor.authorLi, Hongmin
dc.date.accessioned2020-12-03T22:52:03Z
dc.date.available2020-12-03T22:52:03Z
dc.date.graduationmonthMayen_US
dc.date.issued2021-05-01
dc.date.published2021en_US
dc.description.abstractSocial media platforms such as Twitter provide valuable information for aiding first response during emergency events. Machine learning could be used to build automatic tools for filtering and categorizing useful information from the flood of information posted by eyewitnesses during a disaster. However, supervised learning algorithms rely on labeled data, which is not readily available for an emerging target disaster. While labeled data might be available for a prior source disaster (or a set of prior source disasters), supervised classifiers learned only from the source disaster(s) may not perform well on the target disaster, as each event has unique characteristics (e.g., type, location, culture), and may cause different social media responses. Therefore, domain adaptation approaches, which address the above limitation by learning classifiers from unlabeled target data in addition to source labeled data, represent a promising direction for social media crisis data classification tasks. This thesis focuses on disaster tweet classification tasks, including classification of tweets as relevant to a disaster or not relevant, and classification of tweets as informative to disaster response teams or not informative. In the single-source setting, we propose several domain adaptation approaches for such tasks. More precisely, we first propose approaches based on Expectation Maximization and Self-training, performed on top of supervised Naive Bayes classifiers to classify tweets in categories of interest. We also employ a feature adaptation method (called Correlation Alignment) and combine it with Self-training to train weighted Naive Bayes classifiers. Experimental results on the task of identifying tweets relevant to a disaster of interest show that the domain adaptation classifiers are better as compared to the supervised baselines learned only from labeled source data. In addition to the single-source setting, we also consider a multi-source setting, where several source disasters are used to transfer knowledge to a target disaster. Under the multi-source domain adaptation setting, we evaluate how different representations based on pre-trained word embeddings and sentence encoding models perform when used with supervised classifiers. The word-embeddings are pre-trained on very large unlabeled corpora, and can thus capture semantic information (e.g., similar words are close in the embedding space). We use the pre-trained word embeddings and sentence encoding models to design simple but effective representation-based adaptation approaches for disaster tweet classification. We further apply the Self-training approach on top of these models, and obtain domain adaptation models that are shown experimentally to perform better than the supervised models on the task of identifying relevant versus irrelevant tweets. We also train crisis specific word embeddings with our own crisis tweet corpora. The resulting embeddings can be used for a variety crisis tweet classification tasks. Finally, we design domain adaptation models on top of state-of-the-art pre-trained language models (e.g., BERT) for social media crisis data classification, and show the effectiveness of such model for disaster tweet classification. This thesis contributes to the crisis informatics research by introducing domain adaptation approaches for social media crisis data classification. The proposed approaches have the potential to be used in practice and help with the information overload problem that disaster response teams face currently.en_US
dc.description.advisorDoina Carageaen_US
dc.description.degreeDoctor of Philosophyen_US
dc.description.departmentDepartment of Computer Scienceen_US
dc.description.levelDoctoralen_US
dc.description.sponsorshipNational Science Foundationen_US
dc.identifier.urihttps://hdl.handle.net/2097/40987
dc.language.isoen_USen_US
dc.subjectDomain adaptationen_US
dc.subjectTransfer learningen_US
dc.subjectText classificationen_US
dc.subjectSocial mediaen_US
dc.subjectCrisis tweets classificationen_US
dc.titleDomain adaptation approaches for classifying social media crisis dataen_US
dc.typeDissertationen_US

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
HongminLi2021.pdf
Size:
811.55 KB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.62 KB
Format:
Item-specific license agreed upon to submission
Description: