CFP last date
20 December 2024
Reseach Article

Web Page Genre Classification: Impact of n-Gram Lengths

by K. Pranitha Kumari, A. Venugopal Reddy, S. Sameen Fatima
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 88 - Number 13
Year of Publication: 2014
Authors: K. Pranitha Kumari, A. Venugopal Reddy, S. Sameen Fatima
10.5120/15412-3907

K. Pranitha Kumari, A. Venugopal Reddy, S. Sameen Fatima . Web Page Genre Classification: Impact of n-Gram Lengths. International Journal of Computer Applications. 88, 13 ( February 2014), 13-17. DOI=10.5120/15412-3907

@article{ 10.5120/15412-3907,
author = { K. Pranitha Kumari, A. Venugopal Reddy, S. Sameen Fatima },
title = { Web Page Genre Classification: Impact of n-Gram Lengths },
journal = { International Journal of Computer Applications },
issue_date = { February 2014 },
volume = { 88 },
number = { 13 },
month = { February },
year = { 2014 },
issn = { 0975-8887 },
pages = { 13-17 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume88/number13/15412-3907/ },
doi = { 10.5120/15412-3907 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T22:07:31.307887+05:30
%A K. Pranitha Kumari
%A A. Venugopal Reddy
%A S. Sameen Fatima
%T Web Page Genre Classification: Impact of n-Gram Lengths
%J International Journal of Computer Applications
%@ 0975-8887
%V 88
%N 13
%P 13-17
%D 2014
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Web pages are discriminated based on their topic and genre. Web page genres are capable to improve the modern search engines to focus on the user's information need. In this paper, web pages are represented using character n-grams. Character n-gram representation is language independent and allows automatic extraction of features from a web page. Character n-gram representation of a web page can be used efficiently to classify a web page by genre. Support Vector Machine (SVM) classification model is used for classification and experiments were carried out on 7-Genre corpus by varying the length of n-grams. It is observed that the performance in terms of F-measure improves as n-gram lengths are varied from 3 to 5 and it is also observed that performance degrades as the n-gram length is further increased.

References
  1. Aidan Finn and Nicholas Kushmerick, Learning to classify documents according to genre, Journal of the American Society for Information Science and Technology, volume 57, pages 1506-1518,2006.
  2. Vidulin V. , Lustre, M. , Gams M. , "Training the Genre Classifier for Automatic Classification of Web Pages", in Proceedings of the 29th International Conference on Information Technology Interfaces, pp. 93-98, 2007.
  3. Akira Maeda and Yukinori Hayashi, Automatic Genre Classification of Web Documents Using Discriminant Analysis for Feature Selection.
  4. Alistair Kennedy and Michael Shepherd, Automatic Identification of Home Pages on the Web, Proceedings of the 38th Hawaii International Conference on System Sciences – 2005.
  5. Carina Ihlström and Maria Åkesson, Genre Characteristics - a Front Page Analysis of 85 Swedish Online Newspapers, Proceedings of the 37th Hawaii International Conference on System Sciences – 2004.
  6. Jebari Chaker and Ounelli Habib, Genre categorization of web pages, Seventh IEEE International Conference on Data Mining – Workshops, 2007.
  7. P Majumder, M Mitra, B. B. ChaudhuriN-gram: a language independent approach to IR and NLP
  8. German Aquino, Waldo Hasperue1, Cesar Estrebou1 and Laura Lanzarini, A Novel, Language-Independent Keyword Extraction Method, 2013.
  9. Ioannis Kanaris and Efstathios Stamatatos, Learning to Recognize Webpage Genres, 2009.
  10. Meyer zu Eissen, S. and B. Stein "Genre Classification of Web Pages: User Study and Feasibility Analysis". In Biundo S. , Fruhwirth T. and Palm G. (eds. ). KI 2004: Advances in Artificial Intelligence, Springer, pp. 256-269, 2004.
  11. Sanitni, M. Automatic Identification of Genre in Webpages. Ph. D. Thesis, University of Brighton, 2007.
  12. Kim Y. and Ross S. "Examining Variations of Prominent Features in Genre Classification, In Proc. of the 41st Annual Hawaiian International Conference on System Sciences (HICSS), 2008.
  13. Boese, E and A. Howe, "Effects of Web Document Evolution on Genre Classification", Proc. of the ACM 14th Conference on Information and Knowledge Management, 2005.
  14. K. Pranitha Kumari and A. Venugopal Reddy, Performance provement of Web Page Genre Classification, International Journal of Computer Applications (0975 – 8887) Volume 53– No. 10, September 2012.
  15. M. Nelson and J. S. Downie. Informetric Analysis of a Music Database. Scien-tometrics, 54(2):243{255, 2002.
  16. I. S. H. Suyoto and A. L. Uitdenbogerd. Simple e±cient n-gram indexing for effective melody retrieval. In Proceedings of the First Annual Music Information Retrieval Evaluation eXchange, September 2005.
  17. Mason, J. E. , M. Shepherd, and J. Duffy (2009). "An N-gram Based Approach to Automatically Identifying Web Page Genre". In Proc. of the 42nd Hawaii International Conference on System Sciences.
  18. Pollock, J. J. and Zamora, A. : System design for detection and correction of spelling errors in scientifc and scholarly text, Journal of American Society for Information Science, 35 (1984)104-109
  19. Artur •Silic, Jean-Hugues Chauchat, Bojana Dalbelo Basic, and Annie Morin, N-grams and Morphological Normalization in Text Classification: a Comparison on a Croatian-English Parallel Corpus, Progress in Artificial Intelligence, 671—682, Springer, 2007.
Index Terms

Computer Science
Information Sciences

Keywords

Character n-gram feature extraction n-gram length web page representation SVM classifier and term frequency