Deep integrative information extraction from scientific literature

dc.contributor.authorYang, Huichen
dc.date.accessioned2022-04-13T19:23:44Z
dc.date.available2022-04-13T19:23:44Z
dc.date.graduationmonthMayen_US
dc.date.published2022en_US
dc.description.abstractThis dissertation presents deep integrative methods from both visual and textual perspectives to address the challenges of extracting information from documents, particularly scientific literature. The number of publications in the academic literature has soared. Published literature includes large amounts of valuable information that can help scientists and researchers develop new directions in their fields of interest. Moreover, this information can be used in many applications, among them scholar search engines, relevant paper recommendations, and citation analysis. However, the increased production of scientific literature makes the process of literature review laborious and time-consuming, especially when large amounts of data are stored in heterogeneous unstructured formats, both numerical and image-based text, both of which are challenging to read and analyze. Thus, the ability to automatically extract information from the scientific literature is necessary. In this dissertation, we present integrative information extraction from scientific literature using deep learning approaches. We first investigated a vision-based approach for understanding layout and extracting metadata from scanned scientific literature images. We tried convolutional neural network and transformer-based approaches to document layout. Furthermore, for vision-based metadata information extraction, we proposed a trainable recurrent convolutional neural network that integrated scientific document layout detection and character recognition to extract metadata information from the scientific literature. In doing so, we addressed the problem of existing methods that cannot combine the techniques of layout extraction and text recognition efficiently because different publishers use different formats to present information. This framework requires no additional text features added into the network during the training process and will generate text content and appropriate labels of major sections of scientific documents. We then extracted key-information from unstructured texts in the scientific literature using technologies based on Natural Language Processing (NLP). Key-information could include the named entity and the relationship between pairs of entities in the scientific literature. This information can help provide researchers with key insights into the scientific literature. We proposed the attention-based deep learning method to extract key-information with limited annotated data sets. This method enhances contextualized word representations using pre-trained language models like a Bidirectional Encoder Representations from Transformers (BERT) that, unlike conventional machine learning approaches, does not require hand-crafted features or training with massive data. The dissertation concludes by identifying additional challenges and future work in extracting information from the scientific literature.en_US
dc.description.advisorWilliam H Hsuen_US
dc.description.degreeDoctor of Philosophyen_US
dc.description.departmentDepartment of Computer Scienceen_US
dc.description.levelDoctoralen_US
dc.identifier.urihttps://hdl.handle.net/2097/42110
dc.language.isoen_USen_US
dc.subjectDeep learningen_US
dc.subjectInformation extractionen_US
dc.subjectArtificial intelligenceen_US
dc.subjectNatural language processingen_US
dc.subjectPattern recognitionen_US
dc.subjectMachine learningen_US
dc.titleDeep integrative information extraction from scientific literatureen_US
dc.typeDissertationen_US

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
HuichenYang2022.pdf
Size:
15.37 MB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.62 KB
Format:
Item-specific license agreed upon to submission
Description: