Keyphrase extraction and its applications to digital libraries

Patel, Krutarth Indubhai

Keyphrase extraction and its applications to digital libraries

dc.contributor.author	Patel, Krutarth Indubhai
dc.date.accessioned	2021-03-26T21:58:22Z
dc.date.available	2021-03-26T21:58:22Z
dc.date.graduationmonth	May
dc.date.issued	2021
dc.description.abstract	Scholarly digital libraries provide access to scientific publications and comprise useful resources for researchers. Moreover, they are very useful in many applications such as document and citation recommendation, expert search, scientific paper summarization, collaborator recommendation, topic classification, and keyphrase extraction. Despite the advancements in search engine features, ranking methods, technologies, and the availability of programmable APIs, current-day open-access digital libraries still rely on crawl-based approaches for acquiring their underlying document collections. Furthermore, keyphrases associated with research papers provide an effective way to find useful information in the large and growing scholarly digital collections. Keyphrases are useful in many applications such as document indexing and summarization, topic tracking, contextual advertising, and opinion mining. However, keyphrases are not always provided with the papers, but they need to be extracted from their content. A growing number of scholarly digital libraries, museums, and archives around the world are embracing web archiving as a mechanism to collect born-digital material made available via the web. To create the specialized collection from the Web archived data, there is a substantial need for automatic approaches that can distinguish the documents of interest for a collection. In this dissertation, we first explore keyphrase extraction as a supervised task and formulated as sequence labeling and utilize the power of Conditional Random Fields in capturing label dependencies through a transition parameter matrix consisting of the transition probabilities from one label to the neighboring label. Our proposed CRF-based supervised approach exploits word embeddings as features along with traditional, document-specific features. Our results on five datasets of research papers show that the word embeddings combined with document-specific features achieve high performance and outperform strong baselines for this task. We also propose KPRank, an unsupervised graph-based algorithm for keyphrase extraction that exploits both positional information and contextual word embeddings into a biased PageRank. Our experimental results on five benchmark datasets show that KPRank that uses contextual word embeddings with additional position signal outperforms previous approaches and strong baselines for this task. Furthermore, we investigate and contrast three supervised keyphrase extraction models to explore their deployment in CiteSeerX digital library for extracting high-quality keyphrases. Further, we propose a novel search-driven framework for acquiring documents for such scientific portals. Within our framework, publicly-available research paper titles and author names are used as queries to a Web search engine. We were able to obtain ≈ 267,000 unique research papers through our fully-automated framework using ≈ 76,000 queries, resulting in almost 200,000 more papers than the number of queries. Furthermore, We propose a novel search-driven approach to build and maintain a large collection of homepages that can be used as seed URLs in any digital library including CiteSeerX to crawl scientific documents. We use Self-Training in order to reduce the labeling effort and to utilize the unlabeled data to train the efficient researcher homepage classifier. Our experiments on a large-scale dataset highlight the effectiveness of our approach, and position Web search as an effective method for acquiring authors' homepages. Finally, we explore different learning models and feature representations to determine the best-performing ones for identifying the documents of interest from the web archived data. Specifically, we study both machine learning and deep learning models and "bag of words" (BoW) features extracted from the entire document or from specific portions of the document, as well as structural features that capture the structure of documents. Moreover, we explore dynamic fusion models to find, on the fly, the model or combination of models that perform best on a variety of document types. We proposed two dynamic classifier selection algorithms: Dynamic Classifier Selection for Document Classification (or DCSDC), and Dynamic Decision level Fusion for Document Classification (or DDFC). Our experimental results show that the approach that fuses different models outperforms individual models and other ensemble methods on all three datasets.
dc.description.advisor	Cornelia Caragea
dc.description.advisor	Doina Caragea
dc.description.degree	Doctor of Philosophy
dc.description.department	Department of Computer Science
dc.description.level	Doctoral
dc.identifier.uri	https://hdl.handle.net/2097/41306
dc.language.iso	en_US
dc.publisher	Kansas State University
dc.rights	© the author. This Item is protected by copyright and/or related rights. You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s).
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/
dc.subject	Digital libraries
dc.subject	Keyphrase extraction
dc.subject	Machine learning
dc.subject	Deep learning
dc.subject	Web archiving
dc.subject	Text classification
dc.title	Keyphrase extraction and its applications to digital libraries
dc.type	Dissertation

Files

Original bundle

Now showing 1 - 1 of 1

Name:: KrutarthPatel2021.pdf
Size:: 10.39 MB
Format:: Adobe Portable Document Format
Description:

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.62 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

K-State Electronic Theses, Dissertations, and Reports: 2004 -