CFP last date
20 January 2025
Reseach Article

Detecting and Removing Noisy Data on Web Document using Text Density Approach

by Hassan F. Eldirdiery, A. H. Ahmed
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 112 - Number 5
Year of Publication: 2015
Authors: Hassan F. Eldirdiery, A. H. Ahmed
10.5120/19663-1328

Hassan F. Eldirdiery, A. H. Ahmed . Detecting and Removing Noisy Data on Web Document using Text Density Approach. International Journal of Computer Applications. 112, 5 ( February 2015), 32-36. DOI=10.5120/19663-1328

@article{ 10.5120/19663-1328,
author = { Hassan F. Eldirdiery, A. H. Ahmed },
title = { Detecting and Removing Noisy Data on Web Document using Text Density Approach },
journal = { International Journal of Computer Applications },
issue_date = { February 2015 },
volume = { 112 },
number = { 5 },
month = { February },
year = { 2015 },
issn = { 0975-8887 },
pages = { 32-36 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume112/number5/19663-1328/ },
doi = { 10.5120/19663-1328 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T22:48:39.426995+05:30
%A Hassan F. Eldirdiery
%A A. H. Ahmed
%T Detecting and Removing Noisy Data on Web Document using Text Density Approach
%J International Journal of Computer Applications
%@ 0975-8887
%V 112
%N 5
%P 32-36
%D 2015
%I Foundation of Computer Science (FCS), NY, USA
Abstract

The web documents content are useful resources for many applications. However, this content could be classified into relevant content and irrelevant content with respect to the involved application. The irrelevant content, like advertisements banner, copyright information, and navigation menus assumed as noisy data. Noisy data that found among the content of the web document affects negatively the performance of most of applications that deals with the content of web pages. The process of detecting and removing noisy data is an important pre-processing step in many applications such as web page classifications, clustering of web pages and information retrieval tasks. We developed a unified algorithm able to detect automatically the noisy data and eliminate them out of the web page and produce a clear web document that could be used effectively in later steps. The suggested approach examined using a dataset composed of different classes. The results of the conducted experiments showed a significant enhancement in the problem of detecting and removing noisy.

References
  1. David Gibson, Kunal Punera, and Andrew Tomkins. 2005. The Volume and Evolution of Web Page Templates, IW3C2, ACM Press, pp. 830-839.
  2. K. Viera, A. S. da Silva, N. Pinto, Edleno S. de Moura, J. M. B. Cavalcanti, J. Freire. 2006. A Fast and Robust Method for Web Page Template Detection and Removal. In Proceedings of the International Conference on Information and Knowledge Management, ACM Press, pp. 258-267.
  3. Lan Yi, Bing Liu, and Xiaoli Li. 2003. Eliminating Noisy Information in Web Pages for Data Mining, SIGKDD, ACM Press, pp. 296-305.
  4. E. Akpinar and Y. Yesilada. 2012. Vision based page segmentation: Extended and improved algorithm. eMINE Technical Report Deliverable 2 (D2), Middle East Technical University, Ankara, Turkey.
  5. C. Kohlschutter, W. Nejdl. 2008. A Densitometric Approach to Web Page Segmentation. In Proceeding of the 17th ACM conference on Information and knowledge management, ACM Press, pp. 1173-1182.
  6. Z. Bar-Yossef and S. Rajagopalan. 2002. Template detection via data mining and its applications. In proceedings of the International Conference on the World Wide Web, ACM Press, pp. 580-591.
  7. S. Debnath, P. Mitra, and C. L. Giles. 2005. Automatic Extraction of Informative Blocks from Web pages. In ACM Symposium on Applied Computing, pp. 1722-1726.
  8. R. Song, H. Lui, J. -R. Wen, and W. -Y. Ma. 2004. Learning block importance models for web pages. In proceedings of the International Conference on the World Wide Web, ACM Press, pp. 203-211.
  9. K. Viera, A. S. da Silva, N. Pinto, Edleno S. de Moura, J. M. B. Cavalcanti, J. Freire. 2006. A Fast and Robust Method for Web Page Template Detection and Removal. In Proceedings of the International Conference on Information and Knowledge Management, ACM Press, pp. 258-267.
  10. E. S. Laber, C. Souza, I. Jbour, E. Amorim, E. Cardoso, R. Renteria, L. Tinoco, C. Dias. 2009. Fast and Simple Method for Extracting Relevant Content from News Webpages. In Proceedings of the ACM Conference on Information and Knowledge Management, ACM Press, pp. 1685-1688.
  11. R. Sharma and M. Bhatia. 2014. Eliminating the Noise from Web Pages using Page Replacement Algorithm. International Journal of Computer Science and Information Technologies, Vol. 5 (3) , IJCSIT, pp. 3066-3068.
  12. Amit Dutta, Sudipta Paria,Tanmoy Golui, Dipak Kumar Kole. 2014. Noise Elimination from Web Page Based on Regular Expressions for Web Content Mining. Advanced Computing, Networking and Informatics, Vol. 1, Springer.
  13. Neeraj Raheja and V. K. Katiyar. 2013. Noise Reduction Approach based on n x 1 Table and XSL Display Method for Efficient Web Data Extraction. International Journal of Computer Applications, Vol. 64.
  14. N. Pappas, G. Katsimpras and E. Stamatatos. 2012. Extracting Informative Textual Parts from Web Pages Containing User-Generated Content. In Proceedings of the 12th International Conference on Knowledge Management and Knowledge Technologies, Article No. 4, ACM.
Index Terms

Computer Science
Information Sciences

Keywords

Web Page Segmentation Noise Removal Information Extraction