Learning representations for information mining from text corpora with applications to cyber threat intelligence

dc.contributor.authorBose, Avishek
dc.date.accessioned2023-01-06T19:20:41Z
dc.date.available2023-01-06T19:20:41Z
dc.date.graduationmonthMayen_US
dc.date.published2023en_US
dc.description.abstractThis research develops learning representations and architectures for natural language understanding, within an information mining framework for analysis of open-source cyber threat intelligence (CTI). Both contextual (sequential) and topological (graph-based) encodings of short text documents are modeled. To accomplish this goal, a series of machine learning tasks are defined, and learning representations are developed to detect crucial information in these documents: cyber threat entities, types, and events. Using hybrid transformer-based implementations of these learning models, CTI-relevant key phrases are identified, and specific cyber threats are classified using classification models based upon graph neural networks (GNNs). The central scientific goal here is to learn features from corpora consisting of short texts for multiple document categorization and information extraction sub-tasks to improve the accuracy, precision, recall, and F1 score of a multimodal framework. To address a performance gap (e.g., classification accuracy) for text classification, a novel multi-dimensional Feature Attended Parametric Kernel Graph Neural Network (APKGNN) layer is introduced to construct a GNN model in this dissertation where the text classification task is transformed into a graph node classification task. To extract key phrases, contextual semantic tagging with text sequences as input to transformers is used which improves a transformer's learning representation. By deriving a set of characteristics ranging from low-level (lexical) natural language features to summative extracts, this research focuses on reducing human effort by adopting a combination of semi-supervised approaches for learning syntactic, semantic, and topological feature representation. The following central research questions are addressed: can CTI-relevant key phrases be identified effectively with reduced human effort; whether threats be classified into different types; and can threat events be detected and ranked from social media like Twitter data and other benchmark data sets. Developing an integrated system to answer these research questions showed that user-specific information in shared social media content, and connections (followers and followees) are effective and crucial for algorithmically tracing active CTI user accounts from open-source social network data. All these components, used in combination, facilitate the understanding of key analytical tasks and objectives of open-source cyber-threat intelligence.en_US
dc.description.advisorWilliam H Hsuen_US
dc.description.degreeDoctor of Philosophyen_US
dc.description.departmentDepartment of Computer Scienceen_US
dc.description.levelDoctoralen_US
dc.identifier.urihttps://hdl.handle.net/2097/42898
dc.language.isoen_USen_US
dc.subjectCyber threat intelligenceen_US
dc.subjectGraph neural networksen_US
dc.subjectLearning representationsen_US
dc.subjectTransformer modelsen_US
dc.subjectEvent detection and rankingen_US
dc.titleLearning representations for information mining from text corpora with applications to cyber threat intelligenceen_US
dc.typeDissertationen_US

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
AvishekBose2023.pdf
Size:
5.43 MB
Format:
Adobe Portable Document Format
Description:
Main article
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.62 KB
Format:
Item-specific license agreed upon to submission
Description: