International Journal of Computer Applications |
Foundation of Computer Science (FCS), NY, USA |
Volume 88 - Number 13 |
Year of Publication: 2014 |
Authors: K. Pranitha Kumari, A. Venugopal Reddy, S. Sameen Fatima |
10.5120/15412-3907 |
K. Pranitha Kumari, A. Venugopal Reddy, S. Sameen Fatima . Web Page Genre Classification: Impact of n-Gram Lengths. International Journal of Computer Applications. 88, 13 ( February 2014), 13-17. DOI=10.5120/15412-3907
Web pages are discriminated based on their topic and genre. Web page genres are capable to improve the modern search engines to focus on the user's information need. In this paper, web pages are represented using character n-grams. Character n-gram representation is language independent and allows automatic extraction of features from a web page. Character n-gram representation of a web page can be used efficiently to classify a web page by genre. Support Vector Machine (SVM) classification model is used for classification and experiments were carried out on 7-Genre corpus by varying the length of n-grams. It is observed that the performance in terms of F-measure improves as n-gram lengths are varied from 3 to 5 and it is also observed that performance degrades as the n-gram length is further increased.