Machine learning for text categorization: experiments using clustering and classification

Bikki, Poojitha

Machine learning for text categorization: experiments using clustering and classification

dc.contributor.author	Bikki, Poojitha
dc.date.accessioned	2018-04-23T13:47:21Z
dc.date.available	2018-04-23T13:47:21Z
dc.date.graduationmonth	May	en_US
dc.date.issued	2018-05-01	en_US
dc.date.published	2018	en_US
dc.description.abstract	This work describes a comparative study of empirical methods for categorization of new articles within text corpora: unsupervised learning for an unlabeled corpus of text documents and supervised learning for hand-labeled corpus. The goal of text categorization is to organize natural language (i.e. human language) documents into categories that are either predefined or that are inherently grouped by similar meaning. The first approach, automatic classification of texts, can be handy when handling massive amounts of data and has many applications such as automated indexing of scientific articles, spam filtering, classification of news articles etc. Classification using supervised or semi-supervised inductive learning involves labeled data, which can be expensive to acquire and may require semantically deep understanding of the meaning of texts. The second approach falls under the general rubric of document clustering, based on the statistical distribution and co-occurrence of words in a full-text document. Developing a full pipeline for document categorization draws on methods from information retrieval (IR), natural language processing (NLP), and machine learning (ML). In this project, experiments are conducted on two text corpora: news aggregator data, which contains news headlines collected from a web aggregator and a news data set consisting of original news articles from the British Broadcasting Corporation (BBC). First, the training data is developed from these corpora. Next, common types of supervised classifiers, such as linear, Bayesian, ensemble models and support vector machines (SVM) are trained, on the labelled data and the trained classification models are used to predict the category of an article, given the related text. The results obtained are analyzed and compared to determine the best performing model. Then, two unsupervised learning techniques – k-means and Latent Dirichlet Allocation (LDA) are applied to obtain clusters of data points. k-means separates the documents into disjoint clusters of similar news. Additionally, LDA was used, which treats documents as a mixture of topics, to find latent topics in text. Finally, visualizations of the results are produced for evaluation: to allow qualitative assessment of cluster separation in the case of unsupervised learning, or to understand the confusion matrix for the supervised classification task by heat map visualization as well as precision, recall, and other holistic metrics. From an application standpoint, the unsupervised techniques applied can be used to find news that are similar in content and can be categorized under a specific topic.	en_US
dc.description.advisor	William Hsu	en_US
dc.description.degree	Master of Science	en_US
dc.description.department	Department of Computer Science	en_US
dc.description.level	Masters	en_US
dc.identifier.uri	http://hdl.handle.net/2097/38889
dc.language.iso	en_US	en_US
dc.subject	Text categorization	en_US
dc.subject	Machine learning	en_US
dc.subject	Classification	en_US
dc.subject	Clustering	en_US
dc.title	Machine learning for text categorization: experiments using clustering and classification	en_US
dc.type	Report	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: PoojithaBikki2018.pdf
Size:: 2.13 MB
Format:: Adobe Portable Document Format
Description:

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.62 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

K-State Electronic Theses, Dissertations, and Reports: 2004 -