Machine learning for text categorization: experiments using clustering and classification

K-REx Repository

Show simple item record Bikki, Poojitha 2018-04-23T13:47:21Z 2018-04-23T13:47:21Z 2018-05-01 en_US
dc.description.abstract This work describes a comparative study of empirical methods for categorization of new articles within text corpora: unsupervised learning for an unlabeled corpus of text documents and supervised learning for hand-labeled corpus. The goal of text categorization is to organize natural language (i.e. human language) documents into categories that are either predefined or that are inherently grouped by similar meaning. The first approach, automatic classification of texts, can be handy when handling massive amounts of data and has many applications such as automated indexing of scientific articles, spam filtering, classification of news articles etc. Classification using supervised or semi-supervised inductive learning involves labeled data, which can be expensive to acquire and may require semantically deep understanding of the meaning of texts. The second approach falls under the general rubric of document clustering, based on the statistical distribution and co-occurrence of words in a full-text document. Developing a full pipeline for document categorization draws on methods from information retrieval (IR), natural language processing (NLP), and machine learning (ML). In this project, experiments are conducted on two text corpora: news aggregator data, which contains news headlines collected from a web aggregator and a news data set consisting of original news articles from the British Broadcasting Corporation (BBC). First, the training data is developed from these corpora. Next, common types of supervised classifiers, such as linear, Bayesian, ensemble models and support vector machines (SVM) are trained, on the labelled data and the trained classification models are used to predict the category of an article, given the related text. The results obtained are analyzed and compared to determine the best performing model. Then, two unsupervised learning techniques – k-means and Latent Dirichlet Allocation (LDA) are applied to obtain clusters of data points. k-means separates the documents into disjoint clusters of similar news. Additionally, LDA was used, which treats documents as a mixture of topics, to find latent topics in text. Finally, visualizations of the results are produced for evaluation: to allow qualitative assessment of cluster separation in the case of unsupervised learning, or to understand the confusion matrix for the supervised classification task by heat map visualization as well as precision, recall, and other holistic metrics. From an application standpoint, the unsupervised techniques applied can be used to find news that are similar in content and can be categorized under a specific topic. en_US
dc.language.iso en_US en_US
dc.subject Text categorization en_US
dc.subject Machine learning en_US
dc.subject Classification en_US
dc.subject Clustering en_US
dc.title Machine learning for text categorization: experiments using clustering and classification en_US
dc.type Report en_US Master of Science en_US
dc.description.level Masters en_US
dc.description.department Department of Computer Science en_US
dc.description.advisor William Hsu en_US 2018 en_US May en_US

Files in this item

This item appears in the following Collection(s)

Show simple item record

Search K-REx

Advanced Search


My Account


Center for the

Advancement of Digital