CFP last date
20 January 2025
Reseach Article

Web Page Structure Enhanced Feature Selection for Classification of Web Pages

by B. Leela Devi, A. Sankar
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 69 - Number 2
Year of Publication: 2013
Authors: B. Leela Devi, A. Sankar
10.5120/11818-7494

B. Leela Devi, A. Sankar . Web Page Structure Enhanced Feature Selection for Classification of Web Pages. International Journal of Computer Applications. 69, 2 ( May 2013), 41-47. DOI=10.5120/11818-7494

@article{ 10.5120/11818-7494,
author = { B. Leela Devi, A. Sankar },
title = { Web Page Structure Enhanced Feature Selection for Classification of Web Pages },
journal = { International Journal of Computer Applications },
issue_date = { May 2013 },
volume = { 69 },
number = { 2 },
month = { May },
year = { 2013 },
issn = { 0975-8887 },
pages = { 41-47 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume69/number2/11818-7494/ },
doi = { 10.5120/11818-7494 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T21:29:42.333230+05:30
%A B. Leela Devi
%A A. Sankar
%T Web Page Structure Enhanced Feature Selection for Classification of Web Pages
%J International Journal of Computer Applications
%@ 0975-8887
%V 69
%N 2
%P 41-47
%D 2013
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Web page classification is achieved using text classification techniques. Web page classification is different from traditional text classification due to additional information, provided by web page structure which provides much information on content importance. HTML tags provide visual web page representation and can be considered a parameter to highlight content importance. Textual keywords are base on which Information retrieval systems rely to index and retrieve documents. Keyword-based retrieval returns inaccurate/incomplete results when differing keywords describe the same document and queries concept. Concept-based retrieval tried to tackle this by using manual thesauri with term co-occurrence data, or by extracting latent word relationships and concepts from a corpus. Semantic search motivates Semantic Web from inception for classification and retrieval processes. In this paper, a model for the exploitation of semantic-based feature selection is proposed to improve search and retrieval of web pages over large document repositories. The features are classified using Support Vector Machine (SVM) using different kernels. The experimental results show improved precision and recall with the proposed method with respect to keyword-based search. .

References
  1. Stojanovic, N. (2005). Ontology-based information retrieval: methods and tools for cooperative query answering (Doctoral dissertation, PhD thesis, University of Karlsruhe.
  2. Thomas R. Gruber. Toward principles for the design of ontologies used for knowledge sharing. Int. J. Hum. -Comput. Stud. , 43(5-6):907–928, 1995.
  3. Chekuri, C. , M. Goldwasser, P. Raghavan, and E. Upfal (1997, April). Web search using automated classification. In Proceedings of the Sixth International World Wide Web Conference, Santa Clara, CA. Poster POS725.
  4. M. Fernández, V. López, M. Sabou, V. Uren, D. Vallet, E. Motta, and P. Castells. Semantic Search meets the Web. 2nd IEEE International Conference on Semantic Computing (ICSC 2008). Santa Clara, CA, USA, August 2008.
  5. V. López, M. Fernández, E. Motta, M. Sabou, V. Uren. Question Answering on the Real Semantic Web. Poster and demo at the 6th International Semantic Web Conference (ISWC 2007). Busan, Korea, November 2007.
  6. Victoria Uren, Yuangui Lei, Vanessa Lopez, Haiming Liu, Enrico Motta, and Marina Giordanino. The usability of semantic search tools: A review. Knowl. Eng. Rev. , 22(4):361–377, 2007.
  7. Du, T. C. , Li, F. , & King, I. (2009). Managing knowledge on the Web–Extracting ontology from HTML Web. Decision Support Systems, 47(4), 319-331.
  8. Riboni, D. (2002). Feature selection for web page classification. In EURASIA-ICT 2002 Proceedings of the Workshop (pp. 473-477).
  9. Qi, X. , & Davison, B. D. (2009). Web page classification: Features and algorithms. ACM Computing Surveys (CSUR), 41(2), 12.
  10. Zubiaga, A. , Martínez, R. , & Fresno, V. (2009, September). Getting the most out of social annotations for web page classification. In Proceedings of the 9th ACM symposium on Document engineering (pp. 74-83). ACM.
  11. d'Amato, C. , Fanizzi, N. , Fazzinga, B. , Gottlob, G. , & Lukasiewicz, T. (2010). Combining Semantic Web search with the power of inductive reasoning. Scalable Uncertainty Management, 137-150.
  12. Nigam, K. , McCallum, A. K. , Thrun, S. , & Mitchell, T. (2000). Text classification from labeled and unlabeled documents using EM. Machine learning, 39(2), 103-134.
  13. Robertson, S. (2004). Understanding inverse document frequency: on theoretical arguments for IDF. Journal of Documentation, 60(5), 503-520.
  14. Papineni, K. (2001, June). Why inverse document frequency?. In Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies (pp. 1-8). Association for Computational Linguistics.
  15. Steinbach, M. , Karypis, G. , & Kumar, V. (2000, August). A comparison of document clustering techniques. In KDD workshop on text mining (Vol. 400, pp. 525-526).
  16. Golub, K. and A. Ardo (2005, September). Importance of HTML structural elements and metadata in automated subject classification. In Proceedings of the 9th European Conference on Research and Advanced Technology for Digital Libraries (ECDL), Volume 3652 of LNCS, Berlin, pp. 368–378. Springer.
  17. Suykens, J. A. , & Vandewalle, J. (1999). Least squares support vector machine classifiers. Neural processing letters, 9(3), 293-300.
  18. Gunn, S. R. (1998). Support vector machines for classification and regression. ISIS technical report, 14.
  19. Zhang, L. , Lin, F. , & Zhang, B. (2001, October). Support vector machine learning for image retrieval. In Image Processing, 2001. Proceedings. 2001 International Conference on (Vol. 2, pp. 721-724). IEEE.
Index Terms

Computer Science
Information Sciences

Keywords

Web Mining Feature extraction Inverse document frequency HTML Tag Support Vector Machines