CFP last date
20 February 2025
Reseach Article

A Survey of Web Information Extraction Tools

by Noha Negm, Passent Elkafrawy, Abdel Badea Salem
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 43 - Number 7
Year of Publication: 2012
Authors: Noha Negm, Passent Elkafrawy, Abdel Badea Salem

Noha Negm, Passent Elkafrawy, Abdel Badea Salem . A Survey of Web Information Extraction Tools. International Journal of Computer Applications. 43, 7 ( April 2012), 19-27. DOI=10.5120/6115-8296

@article{ 10.5120/6115-8296,
author = { Noha Negm, Passent Elkafrawy, Abdel Badea Salem },
title = { A Survey of Web Information Extraction Tools },
journal = { International Journal of Computer Applications },
issue_date = { April 2012 },
volume = { 43 },
number = { 7 },
month = { April },
year = { 2012 },
issn = { 0975-8887 },
pages = { 19-27 },
numpages = {9},
url = { },
doi = { 10.5120/6115-8296 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
%0 Journal Article
%1 2024-02-06T20:32:48.210711+05:30
%A Noha Negm
%A Passent Elkafrawy
%A Abdel Badea Salem
%T A Survey of Web Information Extraction Tools
%J International Journal of Computer Applications
%@ 0975-8887
%V 43
%N 7
%P 19-27
%D 2012
%I Foundation of Computer Science (FCS), NY, USA

The access to huge amount of information sources on the internet has been limited to browsing and searching due to the heterogeneity and the lack of structure of the web information sources. This has resulted in the need for automated Web Information Extraction (IE) tools that analyze the Web pages and harvest useful information from noisy content for any further analysis. The goal of this survey is to provide a comprehensive review of the major Web IE tools that used for Web text and based on Document Object Model for representing the web pages. This paper compares them in three dimensions: (1) the source of content extraction, (2) the techniques used, and (3) the features of the tools, moreover the advantages and disadvantages for each tool. Based on this survey, we can decide which suitable Web IE tool will be integrated in our future work in Web Text Mining.

  1. H. Sleiman, "Information extraction from the World Wide Web", Actas de los Talleres de las Jornadas de Ingeniería del Softwarey Bases de Datos, 2009.
  2. Lin, S. and Ho, J. 2002. Discovering informative content blocks from web documents. In Proceeding of the 8th International KD and DM Conference.
  3. Gupta, S. , Kaiser, G. , Neistadt, D. and Grimm, P. 2003. Dom-based content extraction of html documents. In Proceeding of the 12th International Conference on World Wide Web.
  4. Debnath, S. , Mitra, P. and Giles, C. 2005. Automatic Extraction of Informative Blocks from WebPages. In Proceeding of the 20th Annual ACM SAC'05.
  5. H. Geng, Q. Gao, and J. Pan, "Extracting Content for News Web Pages based on DOM", International Journal of Computer Science and Network Security, 2007.
  6. Gibson, J. , Wellner, B. and Lubar, S. 2008. CoreEx: content extraction from online news articles. In Proceeding of the 17th ACM IKM Conference.
  7. Louvan, S. 2009 Extracting the Main Content from HTML Documents. [Online]. Available: http://wwwis. win. tue. nl/bnaic2009/papers/bnaic2009_paper_113. pdf
  8. Guo, Y. , Tang, H. , Song, L. , Wang, Y. and Ding, G. 2010. ECON: An Approach to Extract Content from Web News Page. In Proceeding of the 12th International Asia-Pacific Web Conference.
  9. Spengler, A. and Gallinari, P. 2009. Learning to Extract Content from News WebPages. In Proceeding of the International Conference on Advanced Information Networking and Applications Workshops.
  10. Pasternack, J. and Roth, D. 2009. Extracting Article Text from the Web with Maximum Subsequence Segmentation. In Proceeding of the International World Wide Web Conference Committee (IW3C2).
  11. Wpar homepage on Sourceforge 2012. [Online]Available: http://sourceforge. net/projects/wpar
  12. The Webwiper website 2012. [Online]. Available: http://www. webwiper. com
  13. The junkbusters website 2012. [Online]. Available: http://www. junkbusters. com
  14. The Opera website 2012. [Online]. Available: http://www. opera. com
  15. Finn, A. , Kushmerick, N. and Smyth, B. 2001. Fact or fiction: Content classification for digital libraries. In Joint DELOS-NSF Workshop on Personalization and Recommender Systems in Digital Libraries.
  16. McKeown, K. , Barzilay, R. , Evans, D. , and Hatzivassiloglou, V. , 2001. Columbia Multi-document Summarization: Approach and Evaluation. In Proceeding of the Document Understanding Conference
  17. Wacholder, N. , Evans, D. and Klavans, J. 2001. Automatic Identification and Organization of Index Terms for Interactive Browsing. In Proceeding of the Joint Conference on Digital Libraries '01.
  18. Rahman, A. , Alam, H. and Hartono, R. 2001. Content Extraction from HTML Documents. In 1st International Workshop on Web Document Analysis WDA.
  19. Buyukkokten, O. , Molina, H. and Paepcke, A. 2001. Accordion Summarization for End-Game Browsing on PDAs and Cellular Phones. In Proceeding of the Conference on Human Factors in Computing Systems.
  20. Buyukkokten, O. , Molina, H. and Paepcke, A. 2001. Text Summarization for Web Browsing on Handheld Devices. In Proceeding of 10th International World-Wide Web Conference.
  21. Kaasinen, E. , Aaltonen, M. , Kolari, J. , Melakoski, S. and Laakko, T. 2000. Two Approaches to Bringing Internet Services to WAP Devices. In Proceeding of 9th International World-Wide Web Conference.
  22. laender, A. , Ribeiro-Neto, B. , Silva, A. and Teixeira, J. 2002. A brief survey of web data extraction tools. In Proceeding of SIGMOD Conference.
  23. C. Chang, M. Kayed, M. Girgis, and K. Shaalan, "A Survey of Web Information Extraction Systems", Journal of IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2006.
  24. Fiumara, G. 2007. Automated Information Extraction from Web Sources: a Survey. In Proceeding of the 3rd International Conference on Communities and Technology.
  25. Liu, B. 2011 Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data. Second Edition. Springer-Verlag Berlin Heidelberg.
  26. Soderland, S. 1997. Learning to extract text-based information from the World Wide Web. In Proceeding of 3rd International Conference on Knowledge Discovery and Data Mining (KDD).
  27. Elmasri, R. and Navathe, S. 2003. Fundamentals of Database Systems. In Proceeding of the 4th Edition Addison Wesley.
  28. Gupta, S. , Kaiser, G. , Neistadt, D. , Chiang, M. , Starren, J. and Grimm, P. 2005. Automating Content Extraction of HTML Documents. Technical Report. University of Columbia at Computer Science.
  29. Yi, L. , Liu, B. and Li, X. 2003. Eliminating noisy information in web pages for data mining. In Proceeding of the 9th ACM SIGKDD International Conference.
  30. S. Debnath, P. Mitra, N. Pal, and C. Giles, "Automatic Identi?cation of Informative Sections of Web-pages", In Journal IEEE Transactions on Knowledge and Data Engineering, 2005.
  31. Toman, M. 2008. Comparison of Approaches for Information Extraction from the Web. In Proceeding of the 9th International PhD Workshop on Systems and Control: Young Generation Viewpoint. Slovenia.
  32. Joshi, P. , Liu, S. 2009. Web Document Text and Images Extraction using DOM Analysis and Natural Language Processing. In Proceeding of the 9th ACM SDE Conference.
  33. M. Asfia, M. Pedram and A. Rahmani, "Main Content Extraction from Detailed Web Pages", In Proceeding of International Journal of Computer Applications, 2010.
Index Terms

Computer Science
Information Sciences


Knowledge Engineering Document Engineering Information Extraction Document Object Model Web Documents