Deep integrative information extraction from scientific literature

Date

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

This dissertation presents deep integrative methods from both visual and textual perspectives to address the challenges of extracting information from documents, particularly scientific literature. The number of publications in the academic literature has soared. Published literature includes large amounts of valuable information that can help scientists and researchers develop new directions in their fields of interest. Moreover, this information can be used in many applications, among them scholar search engines, relevant paper recommendations, and citation analysis. However, the increased production of scientific literature makes the process of literature review laborious and time-consuming, especially when large amounts of data are stored in heterogeneous unstructured formats, both numerical and image-based text, both of which are challenging to read and analyze. Thus, the ability to automatically extract information from the scientific literature is necessary.

In this dissertation, we present integrative information extraction from scientific literature using deep learning approaches. We first investigated a vision-based approach for understanding layout and extracting metadata from scanned scientific literature images. We tried convolutional neural network and transformer-based approaches to document layout. Furthermore, for vision-based metadata information extraction, we proposed a trainable recurrent convolutional neural network that integrated scientific document layout detection and character recognition to extract metadata information from the scientific literature. In doing so, we addressed the problem of existing methods that cannot combine the techniques of layout extraction and text recognition efficiently because different publishers use different formats to present information. This framework requires no additional text features added into the network during the training process and will generate text content and appropriate labels of major sections of scientific documents.

We then extracted key-information from unstructured texts in the scientific literature using technologies based on Natural Language Processing (NLP). Key-information could include the named entity and the relationship between pairs of entities in the scientific literature. This information can help provide researchers with key insights into the scientific literature. We proposed the attention-based deep learning method to extract key-information with limited annotated data sets. This method enhances contextualized word representations using pre-trained language models like a Bidirectional Encoder Representations from Transformers (BERT) that, unlike conventional machine learning approaches, does not require hand-crafted features or training with massive data. The dissertation concludes by identifying additional challenges and future work in extracting information from the scientific literature.

Description

Keywords

Deep learning, Information extraction, Artificial intelligence, Natural language processing, Pattern recognition, Machine learning

Graduation Month

May

Degree

Doctor of Philosophy

Department

Department of Computer Science

Major Professor

William H Hsu

Date

2022

Type

Dissertation

Citation