CFP last date
20 January 2025
Reseach Article

An Efficient Method of Web Page Noise Cleaning for Effective Web Mining

by S. S. Bhamare, B. V. Pawar
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 146 - Number 3
Year of Publication: 2016
Authors: S. S. Bhamare, B. V. Pawar
10.5120/ijca2016910657

S. S. Bhamare, B. V. Pawar . An Efficient Method of Web Page Noise Cleaning for Effective Web Mining. International Journal of Computer Applications. 146, 3 ( Jul 2016), 18-22. DOI=10.5120/ijca2016910657

@article{ 10.5120/ijca2016910657,
author = { S. S. Bhamare, B. V. Pawar },
title = { An Efficient Method of Web Page Noise Cleaning for Effective Web Mining },
journal = { International Journal of Computer Applications },
issue_date = { Jul 2016 },
volume = { 146 },
number = { 3 },
month = { Jul },
year = { 2016 },
issn = { 0975-8887 },
pages = { 18-22 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume146/number3/25378-2016910657/ },
doi = { 10.5120/ijca2016910657 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T23:49:18.348442+05:30
%A S. S. Bhamare
%A B. V. Pawar
%T An Efficient Method of Web Page Noise Cleaning for Effective Web Mining
%J International Journal of Computer Applications
%@ 0975-8887
%V 146
%N 3
%P 18-22
%D 2016
%I Foundation of Computer Science (FCS), NY, USA
Abstract

In the huge network of World Wide Web, web pages contained large amount of information. Web researches are always requiring main content (e.g., an article text) from the web pages to be gathered, processed and stored quickly and efficiently. Mining the data on the Web has become a major task for locating useful information from the Web. The Web information‘s that are considered as useful information usually has huge amounts of noise data‘s such as navigation bars, links, advertisements, copyright notices etc. Performance of Web mining can be improved by identifying and removing noises from Web pages. In this paper new method is proposed for removing noise content tag and extracts the information of main content tag from web pages.

References
  1. R. Kosala and H. Blockheel. Web Mining Research: A Survey. In SIGKDD Explorations, Vol. 2, No. 1, pp 1-15, 2000.
  2. Bing Liu, Web Data Mining (Exploring Hyperlinks, Contents, and Usage Data), Springer.
  3. L. Yi, B. Liu, and X. Li. Eliminating noisy information in web pages for data mining. In Proceedings of the International ACM Conference on Knowledge Discovery and Data Mining, pages 296–305, 2003.
  4. Hu Fei, Yang Huaqian, Wei Pengcheng, Pu Changjiu, Lei Yang, Web Page Noise Reduction Algorithm Using Non-template Approach in International Journal of Digital Content Technology and its Applications(JDCTA)Volume6, Number20, November 2012
  5. Kushmerick, 1999] Nicholas Kushmerick. Learning to remove Internet advertisements. Agnets-1999, 1999.
  6. Kao et al., 2002] Hung-Yu Kao, Ming-Syan Chen Shian-Hua Lin, and Jan-Ming Ho, Entropy-Based Link Analysis for Mining Web Informative Structures. CIKM-2002, 2002.
  7. H. Y. Kao, J. M. Ho, and M. S. Chen, Wisdom Web intrapage informative structure mining based on document object model in IEEE Trans KDD, 2005.
  8. Diao, Y., Lu, H., Chen, S., and Tian, Z., TowardLearningBased Web Query Processing, In Proceedings of International Conference on Very Large Databases, 2000, pp. 317-328.
  9. Kaasinen, E., Aaltonen, M., Kolari, J., Melakoski, S., and Laakko, T., Two Approaches to Bringing Internet Services to WAP Devices, In Proceedings of 9th International World-Wide Web Conference, 2000, pp. 231-246.
  10. Wong, W. and Fu, A. W., Finding Structure and Characteristics of Web Documents for Classification, In ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD), Dallas, TX., USA, 2000.
  11. S. S. Bhamare, Dr. B. V. Pawar “Survey on Web Page Noise Cleaning for Web Mining” in International Journal of Computer Science and Information Technologies (IJCSIT) Volume 4 Issue 6, Nov-Dec. 2013, ISSN: 0975-9646.
  12. The HTML DOM Parser Library Version 2.0, [Online] Available: http://thehtmldom.sourceforge.net
  13. Dandan Song, Fei Sun, Lejian Liao.‖ A hybrid approach for content extraction with text density and visual importance of DOM nodes‖. In the proceedings of Springer Knowl Inf Syst, DOI 10.1007/s10115-013-0687-x, Verlag London 2013.
  14. YI L. et LIU B. (2003), “Web Page Cleaning for Web Mining through Feature Weighting”, in Proceedings of Eighteenth International Joint Conference on Artificial Intelligence (IJCAI-03).
  15. A. Rahman, H. Alam, and R. Hartono. Content extraction from html documents. In 1st Int. Workshop on Web Document Analysis (WDA2001).
  16. B.D. Davision. Recognizing Nepotistic links on the Web. Proceeding of AAAI 2000.
  17. Hu Fei, Li Ming, Ma Yan” Eliminating Noisy Information in Web Pages based on Source Code Shrinking”, International Journal of Advancements in Computing Technology (IJACT), Vol.4, No. 18, October 2012.
Index Terms

Computer Science
Information Sciences

Keywords

WPNC Noise Block HTML Tag White Listed tags HDT LDT Black Listed Tags.