Detecting and Removing Noisy Data on Web Document using Text Density Approach

Hassan F. Eldirdiery; A. H. Ahmed

Call for Paper

March Edition

IJCA solicits high quality original research papers for the upcoming March edition of the journal. The last date of research paper submission is 20 February 2026

Submit your paper

Know more

The week's pick

A Knowledge-Graph–Driven Multimodal Large Model for Semantic Understanding and Controllable Generation of Intangible Cultural Heritage

Jundi Yang Heng Yao

Random Articles

Reseach Article

Detecting and Removing Noisy Data on Web Document using Text Density Approach

by Hassan F. Eldirdiery, A. H. Ahmed

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 112 - Number 5

Year of Publication: 2015

Authors: Hassan F. Eldirdiery, A. H. Ahmed

10.5120/19663-1328

Hassan F. Eldirdiery, A. H. Ahmed . Detecting and Removing Noisy Data on Web Document using Text Density Approach. International Journal of Computer Applications. 112, 5 ( February 2015), 32-36. DOI=10.5120/19663-1328

@article{ 10.5120/19663-1328,

author = { Hassan F. Eldirdiery, A. H. Ahmed },

title = { Detecting and Removing Noisy Data on Web Document using Text Density Approach },

journal = { International Journal of Computer Applications },

issue_date = { February 2015 },

volume = { 112 },

number = { 5 },

month = { February },

year = { 2015 },

issn = { 0975-8887 },

pages = { 32-36 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume112/number5/19663-1328/ },

doi = { 10.5120/19663-1328 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2024-02-06T22:48:39.426995+05:30

%A Hassan F. Eldirdiery

%A A. H. Ahmed

%T Detecting and Removing Noisy Data on Web Document using Text Density Approach

%J International Journal of Computer Applications

%@ 0975-8887

%V 112

%N 5

%P 32-36

%D 2015

%I Foundation of Computer Science (FCS), NY, USA

Abstract

The web documents content are useful resources for many applications. However, this content could be classified into relevant content and irrelevant content with respect to the involved application. The irrelevant content, like advertisements banner, copyright information, and navigation menus assumed as noisy data. Noisy data that found among the content of the web document affects negatively the performance of most of applications that deals with the content of web pages. The process of detecting and removing noisy data is an important pre-processing step in many applications such as web page classifications, clustering of web pages and information retrieval tasks. We developed a unified algorithm able to detect automatically the noisy data and eliminate them out of the web page and produce a clear web document that could be used effectively in later steps. The suggested approach examined using a dataset composed of different classes. The results of the conducted experiments showed a significant enhancement in the problem of detecting and removing noisy.

References

David Gibson, Kunal Punera, and Andrew Tomkins. 2005. The Volume and Evolution of Web Page Templates, IW3C2, ACM Press, pp. 830-839.
K. Viera, A. S. da Silva, N. Pinto, Edleno S. de Moura, J. M. B. Cavalcanti, J. Freire. 2006. A Fast and Robust Method for Web Page Template Detection and Removal. In Proceedings of the International Conference on Information and Knowledge Management, ACM Press, pp. 258-267.
Lan Yi, Bing Liu, and Xiaoli Li. 2003. Eliminating Noisy Information in Web Pages for Data Mining, SIGKDD, ACM Press, pp. 296-305.
E. Akpinar and Y. Yesilada. 2012. Vision based page segmentation: Extended and improved algorithm. eMINE Technical Report Deliverable 2 (D2), Middle East Technical University, Ankara, Turkey.
C. Kohlschutter, W. Nejdl. 2008. A Densitometric Approach to Web Page Segmentation. In Proceeding of the 17th ACM conference on Information and knowledge management, ACM Press, pp. 1173-1182.
Z. Bar-Yossef and S. Rajagopalan. 2002. Template detection via data mining and its applications. In proceedings of the International Conference on the World Wide Web, ACM Press, pp. 580-591.
S. Debnath, P. Mitra, and C. L. Giles. 2005. Automatic Extraction of Informative Blocks from Web pages. In ACM Symposium on Applied Computing, pp. 1722-1726.
R. Song, H. Lui, J. -R. Wen, and W. -Y. Ma. 2004. Learning block importance models for web pages. In proceedings of the International Conference on the World Wide Web, ACM Press, pp. 203-211.
K. Viera, A. S. da Silva, N. Pinto, Edleno S. de Moura, J. M. B. Cavalcanti, J. Freire. 2006. A Fast and Robust Method for Web Page Template Detection and Removal. In Proceedings of the International Conference on Information and Knowledge Management, ACM Press, pp. 258-267.
E. S. Laber, C. Souza, I. Jbour, E. Amorim, E. Cardoso, R. Renteria, L. Tinoco, C. Dias. 2009. Fast and Simple Method for Extracting Relevant Content from News Webpages. In Proceedings of the ACM Conference on Information and Knowledge Management, ACM Press, pp. 1685-1688.
R. Sharma and M. Bhatia. 2014. Eliminating the Noise from Web Pages using Page Replacement Algorithm. International Journal of Computer Science and Information Technologies, Vol. 5 (3) , IJCSIT, pp. 3066-3068.
Amit Dutta, Sudipta Paria,Tanmoy Golui, Dipak Kumar Kole. 2014. Noise Elimination from Web Page Based on Regular Expressions for Web Content Mining. Advanced Computing, Networking and Informatics, Vol. 1, Springer.
Neeraj Raheja and V. K. Katiyar. 2013. Noise Reduction Approach based on n x 1 Table and XSL Display Method for Efficient Web Data Extraction. International Journal of Computer Applications, Vol. 64.
N. Pappas, G. Katsimpras and E. Stamatatos. 2012. Extracting Informative Textual Parts from Web Pages Containing User-Generated Content. In Proceedings of the 12th International Conference on Knowledge Management and Knowledge Technologies, Article No. 4, ACM.

Index Terms

Computer Science

Information Sciences

Keywords

Web Page Segmentation Noise Removal Information Extraction