Web genre classification using feature selection and semi-supervised learning

Chetry, Roshan

Web genre classification using feature selection and semi-supervised learning

dc.contributor.author	Chetry, Roshan
dc.date.accessioned	2011-05-09T14:16:48Z
dc.date.available	2011-05-09T14:16:48Z
dc.date.graduationmonth	May	en_US
dc.date.issued	2011-05-09
dc.date.published	2011	en_US
dc.description.abstract	As the web pages continuously change and their number grows exponentially, the need for genre classification of web pages also increases. One simple reason for this is given by the need to group web pages into various genre categories in order to reduce the complexities of various web tasks (e.g., search). Experts unanimously agree on the huge potential of genre classification of web pages. However, while everybody agrees that genre classification of web pages is necessary, researchers face problems in finding enough labeled data to perform supervised classification of web pages into various genres. The high cost of skilled manual labor, rapid changing nature of web and never ending growth of web pages are the main reasons for the limited amount of labeled data. On the contrary unlabeled data can be acquired relatively inexpensively in comparison to labeled data. This suggests the use of semi-supervised learning approaches for genre classification, instead of using supervised approaches. Semi-supervised learning makes use of both labeled and unlabeled data for training - typically a small amount of labeled data and a large amount of unlabeled data. Semi-supervised learning have been extensively used in text classification problems. Given the link structure of the web, for web-page classification one can use link features in addition to the content features that are used for general text classification. Hence, the feature set corresponding to web-pages can be easily divided into two views, namely content and link based feature views. Intuitively, the two feature views are conditionally independent given the genre category and have the ability to predict the class on their own. The scarcity of labeled data, availability of large amounts of unlabeled data, richer set of features as compared to the conventional text classification tasks (specifically complementary and sufficient views of features) have encouraged us to use co-training as a tool to perform semi-supervised learning. During co-training labeled examples represented using the two views are used to learn distinct classifiers, which keep improving at each iteration by sharing the most confident predictions on the unlabeled data. In this work, we classify web-pages of .eu domain consisting of 1232 labeled host and 20000 unlabeled hosts (provided by the European Archive Foundation [Benczur et al., 2010]) into six different genres, using co-training. We compare our results with the results produced by standard supervised methods. We find that co-training can be an effective and cheap alternative to costly supervised learning. This is mainly due to the two independent and complementary feature sets of web: content based features and link based features.	en_US
dc.description.advisor	Doina Caragea	en_US
dc.description.degree	Master of Science	en_US
dc.description.department	Department of Computing and Information Sciences	en_US
dc.description.level	Masters	en_US
dc.identifier.uri	http://hdl.handle.net/2097/8855
dc.language.iso	en_US	en_US
dc.publisher	Kansas State University	en
dc.subject	Web genre classification	en_US
dc.subject	Co-training	en_US
dc.subject	Semi-supervised learning	en_US
dc.subject	Feature selection	en_US
dc.subject	Roshan Chetry	en_US
dc.subject.umi	Computer Science (0984)	en_US
dc.subject.umi	Information Technology (0489)	en_US
dc.subject.umi	Web Studies (0646)	en_US
dc.title	Web genre classification using feature selection and semi-supervised learning	en_US
dc.type	Report	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: RoshanChetry2011.pdf
Size:: 718.95 KB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.61 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

K-State Electronic Theses, Dissertations, and Reports: 2004 -