Keyphrase extraction and its applications to digital libraries

dc.contributor.authorPatel, Krutarth Indubhai
dc.date.accessioned2021-03-26T21:58:22Z
dc.date.available2021-03-26T21:58:22Z
dc.date.graduationmonthMay
dc.date.issued2021
dc.description.abstractScholarly digital libraries provide access to scientific publications and comprise useful resources for researchers. Moreover, they are very useful in many applications such as document and citation recommendation, expert search, scientific paper summarization, collaborator recommendation, topic classification, and keyphrase extraction. Despite the advancements in search engine features, ranking methods, technologies, and the availability of programmable APIs, current-day open-access digital libraries still rely on crawl-based approaches for acquiring their underlying document collections. Furthermore, keyphrases associated with research papers provide an effective way to find useful information in the large and growing scholarly digital collections. Keyphrases are useful in many applications such as document indexing and summarization, topic tracking, contextual advertising, and opinion mining. However, keyphrases are not always provided with the papers, but they need to be extracted from their content. A growing number of scholarly digital libraries, museums, and archives around the world are embracing web archiving as a mechanism to collect born-digital material made available via the web. To create the specialized collection from the Web archived data, there is a substantial need for automatic approaches that can distinguish the documents of interest for a collection. In this dissertation, we first explore keyphrase extraction as a supervised task and formulated as sequence labeling and utilize the power of Conditional Random Fields in capturing label dependencies through a transition parameter matrix consisting of the transition probabilities from one label to the neighboring label. Our proposed CRF-based supervised approach exploits word embeddings as features along with traditional, document-specific features. Our results on five datasets of research papers show that the word embeddings combined with document-specific features achieve high performance and outperform strong baselines for this task. We also propose KPRank, an unsupervised graph-based algorithm for keyphrase extraction that exploits both positional information and contextual word embeddings into a biased PageRank. Our experimental results on five benchmark datasets show that KPRank that uses contextual word embeddings with additional position signal outperforms previous approaches and strong baselines for this task. Furthermore, we investigate and contrast three supervised keyphrase extraction models to explore their deployment in CiteSeerX digital library for extracting high-quality keyphrases. Further, we propose a novel search-driven framework for acquiring documents for such scientific portals. Within our framework, publicly-available research paper titles and author names are used as queries to a Web search engine. We were able to obtain ≈ 267,000 unique research papers through our fully-automated framework using ≈ 76,000 queries, resulting in almost 200,000 more papers than the number of queries. Furthermore, We propose a novel search-driven approach to build and maintain a large collection of homepages that can be used as seed URLs in any digital library including CiteSeerX to crawl scientific documents. We use Self-Training in order to reduce the labeling effort and to utilize the unlabeled data to train the efficient researcher homepage classifier. Our experiments on a large-scale dataset highlight the effectiveness of our approach, and position Web search as an effective method for acquiring authors' homepages. Finally, we explore different learning models and feature representations to determine the best-performing ones for identifying the documents of interest from the web archived data. Specifically, we study both machine learning and deep learning models and "bag of words" (BoW) features extracted from the entire document or from specific portions of the document, as well as structural features that capture the structure of documents. Moreover, we explore dynamic fusion models to find, on the fly, the model or combination of models that perform best on a variety of document types. We proposed two dynamic classifier selection algorithms: Dynamic Classifier Selection for Document Classification (or DCSDC), and Dynamic Decision level Fusion for Document Classification (or DDFC). Our experimental results show that the approach that fuses different models outperforms individual models and other ensemble methods on all three datasets.
dc.description.advisorCornelia Caragea
dc.description.advisorDoina Caragea
dc.description.degreeDoctor of Philosophy
dc.description.departmentDepartment of Computer Science
dc.description.levelDoctoral
dc.identifier.urihttps://hdl.handle.net/2097/41306
dc.language.isoen_US
dc.publisherKansas State University
dc.rights© the author. This Item is protected by copyright and/or related rights. You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s).
dc.rights.urihttp://rightsstatements.org/vocab/InC/1.0/
dc.subjectDigital libraries
dc.subjectKeyphrase extraction
dc.subjectMachine learning
dc.subjectDeep learning
dc.subjectWeb archiving
dc.subjectText classification
dc.titleKeyphrase extraction and its applications to digital libraries
dc.typeDissertation

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
KrutarthPatel2021.pdf
Size:
10.39 MB
Format:
Adobe Portable Document Format
Description:

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.62 KB
Format:
Item-specific license agreed upon to submission
Description: