Explainable identification of hidden language patterns across documents using NLP and statistical techniques

Date

2024

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

In the field of Natural Language Processing (NLP), extracting detailed linguistic patterns and domain-specific insights is crucial for analyzing complex text data. Existing tools and large language models (LLM) like ChatGPT, often perform the task to partial degree of success, and does not provide explainable insights on the nature of the similarities they identify. The primary goal is to create a versatile framework capable of conducting comprehensive analyses across diverse fields, including biographies, historical events, scientific literature and even when dealing with unfamiliar content. This research addresses this gap by developing an automated framework that uses n-gram analysis combined with NLP libraries and a custom formula developed through multiple iterations, which calculates a weighted score based on n-gram frequency differences and the presence of key linguistic features which allowed us to prioritize meaningful n-grams. By focusing on recurring linguistic patterns, the framework uncovers subtle conceptual relationships between texts that are not immediately obvious. Visualization tools like word clouds further enhance the representation of key terms and patterns. By surpassing the limitations of current AI technologies and other tools, our framework empowers researchers and analysts to explore complex textual data with greater depth and precision, offering a scalable and reliable solution for analyzing large datasets in various domains.

Description

Keywords

Natural Language Processing (NLP), Linguistic patterns, Explainable AI, N-gram analysis, Textual similarity, AI limitations

Graduation Month

December

Degree

Master of Science

Department

Department of Computer Science

Major Professor

Lior Shamir

Date

Type

Thesis

Citation