A semi-supervised clustering method for payload extraction

dc.contributor.authorLee, Kyu Seok
dc.date.accessioned2020-05-11T16:12:50Z
dc.date.available2020-05-11T16:12:50Z
dc.date.graduationmonthMayen_US
dc.date.issued2020-05-01
dc.date.issued2020-05-01
dc.date.modified2020-05-13
dc.date.published2020en_US
dc.description.abstractThis thesis addresses payload extraction, the information extraction task of capturing the text of an article from a formatted document such as a PDF file, and focuses on the application and improvement of density-based clustering algorithms as an alternative or supplement to rule-based methods for this task domain. While supervised learning performs well on classification-based subtasks of payload extraction such as relevance filtering of documents or sections in a collection, the labeled data which it requires for training are often prohibitively expensive (in terms of the time resources of annotators and developers) to obtain. On the other hand, unlabeled data is often relatively easily available without cost in large quantities, but there have not been many ways to exploit them. Semi-supervised learning addresses this problem by using large amounts of unlabeled data, together with the labeled data, to build better classifiers. In this thesis, I present a semi-supervised learning-driven approach for the analysis of scientific literature which either already contains unlabeled metadata, or from which this metadata can be computed. Furthermore, machine learning-based analysis techniques are exploited to make this system robust and flexible to its data environment. The overall goal of this research is to develop a methodology to support the document analysis functions of layout-based document segmentation and section classification. This is implemented within an information extraction system within which the empirical evaluation and engineering objectives of this work are framed. As an example application, my implementation supports detection and classification of titles, authors, additional author information, abstract, and the titles and body of subsections such as ‘Introduction’, ‘Method’, ‘Result’, ’Discussion’, ‘Acknowledgement’, ’Reference’, etc. The novel contribution of this work also includes payload extraction as an intermediate functional stage within a pipeline for procedural information extraction from the scientific literature. My experimental results show that this approach outperforms a state-of-the-field heuristic pattern analysis system on a corpus from the domain of nanomaterials synthesis.en_US
dc.description.advisorDon M. Gruenbacheren_US
dc.description.advisorWilliam H. Hsuen_US
dc.description.degreeMaster of Scienceen_US
dc.description.departmentDepartment of Electrical and Computer Engineeringen_US
dc.description.levelMastersen_US
dc.identifier.urihttps://hdl.handle.net/2097/40657
dc.language.isoen_USen_US
dc.subjectStructured information extractionen_US
dc.subjectDocument analysisen_US
dc.subjectText analyticsen_US
dc.subjectSection classificationen_US
dc.subjectDBSCANen_US
dc.titleA semi-supervised clustering method for payload extractionen_US
dc.typeThesisen_US

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
KyuseokLee2020.pdf
Size:
3.8 MB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.62 KB
Format:
Item-specific license agreed upon to submission
Description: