Learning representations for information mining from text corpora with applications to cyber threat intelligence

Bose, Avishek

Learning representations for information mining from text corpora with applications to cyber threat intelligence

dc.contributor.author	Bose, Avishek
dc.date.accessioned	2023-01-06T19:20:41Z
dc.date.available	2023-01-06T19:20:41Z
dc.date.graduationmonth	May
dc.date.issued	2023
dc.description.abstract	This research develops learning representations and architectures for natural language understanding, within an information mining framework for analysis of open-source cyber threat intelligence (CTI). Both contextual (sequential) and topological (graph-based) encodings of short text documents are modeled. To accomplish this goal, a series of machine learning tasks are defined, and learning representations are developed to detect crucial information in these documents: cyber threat entities, types, and events. Using hybrid transformer-based implementations of these learning models, CTI-relevant key phrases are identified, and specific cyber threats are classified using classification models based upon graph neural networks (GNNs). The central scientific goal here is to learn features from corpora consisting of short texts for multiple document categorization and information extraction sub-tasks to improve the accuracy, precision, recall, and F1 score of a multimodal framework. To address a performance gap (e.g., classification accuracy) for text classification, a novel multi-dimensional Feature Attended Parametric Kernel Graph Neural Network (APKGNN) layer is introduced to construct a GNN model in this dissertation where the text classification task is transformed into a graph node classification task. To extract key phrases, contextual semantic tagging with text sequences as input to transformers is used which improves a transformer's learning representation. By deriving a set of characteristics ranging from low-level (lexical) natural language features to summative extracts, this research focuses on reducing human effort by adopting a combination of semi-supervised approaches for learning syntactic, semantic, and topological feature representation. The following central research questions are addressed: can CTI-relevant key phrases be identified effectively with reduced human effort; whether threats be classified into different types; and can threat events be detected and ranked from social media like Twitter data and other benchmark data sets. Developing an integrated system to answer these research questions showed that user-specific information in shared social media content, and connections (followers and followees) are effective and crucial for algorithmically tracing active CTI user accounts from open-source social network data. All these components, used in combination, facilitate the understanding of key analytical tasks and objectives of open-source cyber-threat intelligence.
dc.description.advisor	William H. Hsu
dc.description.degree	Doctor of Philosophy
dc.description.department	Department of Computer Science
dc.description.level	Doctoral
dc.identifier.uri	https://hdl.handle.net/2097/42898
dc.language.iso	en_US
dc.publisher	Kansas State University
dc.rights	© the author. This Item is protected by copyright and/or related rights. You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s).
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/
dc.subject	Cyber threat intelligence
dc.subject	Graph neural networks
dc.subject	Learning representations
dc.subject	Transformer models
dc.subject	Event detection and ranking
dc.title	Learning representations for information mining from text corpora with applications to cyber threat intelligence
dc.type	Dissertation

Files

Original bundle

Now showing 1 - 1 of 1

Name:: AvishekBose2023.pdf
Size:: 5.43 MB
Format:: Adobe Portable Document Format
Description:: Main article

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.62 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

K-State Electronic Theses, Dissertations, and Reports: 2004 -