Web genre classification using feature selection and semi-supervised learning

dc.contributor.authorChetry, Roshan
dc.date.accessioned2011-05-09T14:16:48Z
dc.date.available2011-05-09T14:16:48Z
dc.date.graduationmonthMayen_US
dc.date.issued2011-05-09
dc.date.published2011en_US
dc.description.abstractAs the web pages continuously change and their number grows exponentially, the need for genre classification of web pages also increases. One simple reason for this is given by the need to group web pages into various genre categories in order to reduce the complexities of various web tasks (e.g., search). Experts unanimously agree on the huge potential of genre classification of web pages. However, while everybody agrees that genre classification of web pages is necessary, researchers face problems in finding enough labeled data to perform supervised classification of web pages into various genres. The high cost of skilled manual labor, rapid changing nature of web and never ending growth of web pages are the main reasons for the limited amount of labeled data. On the contrary unlabeled data can be acquired relatively inexpensively in comparison to labeled data. This suggests the use of semi-supervised learning approaches for genre classification, instead of using supervised approaches. Semi-supervised learning makes use of both labeled and unlabeled data for training - typically a small amount of labeled data and a large amount of unlabeled data. Semi-supervised learning have been extensively used in text classification problems. Given the link structure of the web, for web-page classification one can use link features in addition to the content features that are used for general text classification. Hence, the feature set corresponding to web-pages can be easily divided into two views, namely content and link based feature views. Intuitively, the two feature views are conditionally independent given the genre category and have the ability to predict the class on their own. The scarcity of labeled data, availability of large amounts of unlabeled data, richer set of features as compared to the conventional text classification tasks (specifically complementary and sufficient views of features) have encouraged us to use co-training as a tool to perform semi-supervised learning. During co-training labeled examples represented using the two views are used to learn distinct classifiers, which keep improving at each iteration by sharing the most confident predictions on the unlabeled data. In this work, we classify web-pages of .eu domain consisting of 1232 labeled host and 20000 unlabeled hosts (provided by the European Archive Foundation [Benczur et al., 2010]) into six different genres, using co-training. We compare our results with the results produced by standard supervised methods. We find that co-training can be an effective and cheap alternative to costly supervised learning. This is mainly due to the two independent and complementary feature sets of web: content based features and link based features.en_US
dc.description.advisorDoina Carageaen_US
dc.description.degreeMaster of Scienceen_US
dc.description.departmentDepartment of Computing and Information Sciencesen_US
dc.description.levelMastersen_US
dc.identifier.urihttp://hdl.handle.net/2097/8855
dc.language.isoen_USen_US
dc.publisherKansas State Universityen
dc.subjectWeb genre classificationen_US
dc.subjectCo-trainingen_US
dc.subjectSemi-supervised learningen_US
dc.subjectFeature selectionen_US
dc.subjectRoshan Chetryen_US
dc.subject.umiComputer Science (0984)en_US
dc.subject.umiInformation Technology (0489)en_US
dc.subject.umiWeb Studies (0646)en_US
dc.titleWeb genre classification using feature selection and semi-supervised learningen_US
dc.typeReporten_US

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
RoshanChetry2011.pdf
Size:
718.95 KB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.61 KB
Format:
Item-specific license agreed upon to submission
Description: