A semi-supervised clustering method for payload extraction

K-REx Repository

Show simple item record

dc.contributor.author Lee, Kyu Seok
dc.date.accessioned 2020-05-11T16:12:50Z
dc.date.available 2020-05-11T16:12:50Z
dc.date.issued 2020-05-01
dc.date.issued 2020-05-01
dc.identifier.uri https://hdl.handle.net/2097/40657
dc.description.abstract This thesis addresses payload extraction, the information extraction task of capturing the text of an article from a formatted document such as a PDF file, and focuses on the application and improvement of density-based clustering algorithms as an alternative or supplement to rule-based methods for this task domain. While supervised learning performs well on classification-based subtasks of payload extraction such as relevance filtering of documents or sections in a collection, the labeled data which it requires for training are often prohibitively expensive (in terms of the time resources of annotators and developers) to obtain. On the other hand, unlabeled data is often relatively easily available without cost in large quantities, but there have not been many ways to exploit them. Semi-supervised learning addresses this problem by using large amounts of unlabeled data, together with the labeled data, to build better classifiers. In this thesis, I present a semi-supervised learning-driven approach for the analysis of scientific literature which either already contains unlabeled metadata, or from which this metadata can be computed. Furthermore, machine learning-based analysis techniques are exploited to make this system robust and flexible to its data environment. The overall goal of this research is to develop a methodology to support the document analysis functions of layout-based document segmentation and section classification. This is implemented within an information extraction system within which the empirical evaluation and engineering objectives of this work are framed. As an example application, my implementation supports detection and classification of titles, authors, additional author information, abstract, and the titles and body of subsections such as ‘Introduction’, ‘Method’, ‘Result’, ’Discussion’, ‘Acknowledgement’, ’Reference’, etc. The novel contribution of this work also includes payload extraction as an intermediate functional stage within a pipeline for procedural information extraction from the scientific literature. My experimental results show that this approach outperforms a state-of-the-field heuristic pattern analysis system on a corpus from the domain of nanomaterials synthesis. en_US
dc.language.iso en_US en_US
dc.subject Structured information extraction en_US
dc.subject Document analysis en_US
dc.subject Text analytics en_US
dc.subject Section classification en_US
dc.subject DBSCAN en_US
dc.title A semi-supervised clustering method for payload extraction en_US
dc.type Thesis en_US
dc.description.degree Master of Science en_US
dc.description.level Masters en_US
dc.description.department Department of Electrical and Computer Engineering en_US
dc.description.advisor Don M. Gruenbacher en_US
dc.description.advisor William H. Hsu en_US
dc.date.published 2020 en_US
dc.date.graduationmonth May en_US
dc.date.modified 2020-05-13

Files in this item

This item appears in the following Collection(s)

Show simple item record

Search K-REx

Advanced Search


My Account


Center for the

Advancement of Digital