CFP last date
20 December 2024
Reseach Article

Noise Reduction and Content Retrieval from Web Pages

by Surabhi Lingwal
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 73 - Number 4
Year of Publication: 2013
Authors: Surabhi Lingwal
10.5120/12729-9573

Surabhi Lingwal . Noise Reduction and Content Retrieval from Web Pages. International Journal of Computer Applications. 73, 4 ( July 2013), 24-30. DOI=10.5120/12729-9573

@article{ 10.5120/12729-9573,
author = { Surabhi Lingwal },
title = { Noise Reduction and Content Retrieval from Web Pages },
journal = { International Journal of Computer Applications },
issue_date = { July 2013 },
volume = { 73 },
number = { 4 },
month = { July },
year = { 2013 },
issn = { 0975-8887 },
pages = { 24-30 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume73/number4/12729-9573/ },
doi = { 10.5120/12729-9573 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T21:39:10.555941+05:30
%A Surabhi Lingwal
%T Noise Reduction and Content Retrieval from Web Pages
%J International Journal of Computer Applications
%@ 0975-8887
%V 73
%N 4
%P 24-30
%D 2013
%I Foundation of Computer Science (FCS), NY, USA
Abstract

The World Wide Web is the most increasingly growing and accessible source of information. Web contents of different fields which can offer important information to users are available in the Web like multimedia data, structured, semi-structured and unstructured data. But only a part of the information is useful for a particular application and the remaining information are considered as noises. Data on web pages contain formatting code, advertisement, navigation links, etc. This collection of unwanted noise with the real content in a web page complicates the task of automatic information extraction and processing. This requires the extraction of useful noise-free information. Otherwise, it can ruin the effectiveness of Web mining techniques. This paper proposes a novel method to filter web pages and retrieve the actual content of a web page. This research work proposed an approach for removing the noises from a given web page which will improve the performance of web content mining. At first, the web page information is divided into various blocks which then tokenized to separate the informative content from noises. This paper presents algorithm for removing noises from the web page and automatically extract important web content. This paper also presents the algorithm for global noise removal.

References
  1. A. K. Tripathy and A. K. Singh. 2004. An Efficient Method of Eliminating Noisy Information in Web Pages for Data Mining. In Proceedings of the Fourth International Conference on Computer and Information Technology (CIT'04), pp. 978 – 985, September 14-16, Wuhan, China.
  2. C. Li, J. Dong, J. Chen. 2010. Extraction of informative blocks from Web pages based on VIPS. Journal of Computational Information Systems 6 (1) ,271–277.
  3. D. Alassi, R. Alhajj. 2013. Effectiveness of template detection on noise reduction and websites summarization, pp 41-72, Information Sciences 219.
  4. D. Cai1, S. Yu, Ji-Rong Wen and Wei-Ying Ma. 2003. Extracting Content Structure for Web Pages based on Visual Representation. In Proceedings of the 5th Asia-Pacific Web Conference on Web Technologies and Applications, pp. 406-417, Xian, China.
  5. D. Fernandes, E. Moura, B. Ribiero-Neto, A. Silva, M. Goncalves. 2007. Computing block importance for searching on Web sites, in: CIKM 2007, pp. 165–174.
  6. D. Gibson, K. Punera, A. Tomkins. 2005. The volume and evolution of Web page template, in: International World Wide Web Conference, ACM, Chiba, Japan, pp. 830–839.
  7. F. Akthar, C. Hahne,. 2012. RapidMiner 5 Operator Reference. August 2012. www. rapid-i. com.
  8. G. Poonkuzhali , G. V. Uma, K. Sarukesi. 2010. Detection and Removal of redundant web content through rectangular and signed approach, International Journal of Engineering Science and Technology ,pp 4026-4032, Vol. 2(9).
  9. G. Poonkuzhali, K. Thiagarajan, K. Sarukesi and G. V. Uma. 2009. Signed Approach for Mining Web Content Outliers. World Academy of Science, Engineering and Technology, Vol. 56, pp. 820- 824.
  10. J. Kang and J. Choi. 2007. Detecting Informative Web Page Blocks for Efficient Information Extraction Using Visual Block Segmentation. International Symposium on Information Technology Convergence, IEEE.
  11. L. Yi, B. Liu, X. Li. 2003. Eliminating Noisy Information in Web Pages for Data Mining, SIGKDD . 03, August 24-27, IEEE.
  12. L. Yi and B. Liu. 2003. Web Page Cleaning for Web Mining Through Feature Weighting. In Proceedings of the 18th International Joint Conference on Artificial Intelligence,Vol. 18, pp. 43-50, August 09 - 15, Acapulco, Mexico.
  13. M. Agyemang, K. Barker and R. S. Alhajj. 2005. Mining Web Content Outliers using Structure Oriented Weighting Techniques and N-Grams. In Proceedings of the ACM Annual Symposium on Applied Computing, pp. 482-487, New Mexico.
  14. P. Sivakumar, R. M. S Parvathi. 2011. An Efficient Approach of Noise Removal from Web Page for Effectual Web Content Mining. European Journal of Scientific Research, pp. 340-351, Vol. 50 No. 3. http://www. eurojournals. com/ejsr. htm
  15. S. Akbar, L. Slaughter, Ø. Nytrø. 2010. Extracting main content-blocks from blog posts. In: Proceedings of the International Conference on Knowledge Discovery and Information Retrieval, pp. 438–443.
  16. S. Debnath, P. Mitra, N. Pal, and C. Lee Giles. 2005. Automatic Identification of Informative Sections of Web Pages, IEEE Transactions on knowledge and data engineering, vol. 17, no. 9.
  17. S. Gupta, G. Kaiser, D. Neistadt, P. Grimm. 2003. DOM-based Content Extraction of HTML Documents, ACM, Budapest, Hungary.
  18. S. Gupta, G. E. Kaiser. 2005. Automating Content Extraction of HTML Documents. World Wide Web: Internet and Web Information Systems, 8, 179–224, Springer.
  19. Z. Cheng-li and Y. Dong-yun. 2004. A Method of Eliminating Noises in Web Pages by Style Tree Model and Its Applications. Wuhan University Journal of Natural Sciences, Vol. 9, No. 5, pp. 611-616.
Index Terms

Computer Science
Information Sciences

Keywords

Web content mining content retrieval noises outlier redundancy precision recall accuracy