A Survey of Web Information Extraction Tools

Noha Negm; Passent Elkafrawy; Abdel Badea Salem

Call for Paper

May Edition

IJCA solicits high quality original research papers for the upcoming May edition of the journal. The last date of research paper submission is 20 April 2026

Submit your paper

Know more

The week's pick

A Unified NIST SP 800-90B Validation Framework for CMOS True Random Number Generators and Quantum Random Number Generators

Che-Ping Lin

Random Articles

Reseach Article

A Survey of Web Information Extraction Tools

by Noha Negm, Passent Elkafrawy, Abdel Badea Salem

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 43 - Number 7

Year of Publication: 2012

Authors: Noha Negm, Passent Elkafrawy, Abdel Badea Salem

10.5120/6115-8296

Noha Negm, Passent Elkafrawy, Abdel Badea Salem . A Survey of Web Information Extraction Tools. International Journal of Computer Applications. 43, 7 ( April 2012), 19-27. DOI=10.5120/6115-8296

@article{ 10.5120/6115-8296,

author = { Noha Negm, Passent Elkafrawy, Abdel Badea Salem },

title = { A Survey of Web Information Extraction Tools },

journal = { International Journal of Computer Applications },

issue_date = { April 2012 },

volume = { 43 },

number = { 7 },

month = { April },

year = { 2012 },

issn = { 0975-8887 },

pages = { 19-27 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume43/number7/6115-8296/ },

doi = { 10.5120/6115-8296 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2024-02-06T20:32:48.210711+05:30

%A Noha Negm

%A Passent Elkafrawy

%A Abdel Badea Salem

%T A Survey of Web Information Extraction Tools

%J International Journal of Computer Applications

%@ 0975-8887

%V 43

%N 7

%P 19-27

%D 2012

%I Foundation of Computer Science (FCS), NY, USA

Abstract

The access to huge amount of information sources on the internet has been limited to browsing and searching due to the heterogeneity and the lack of structure of the web information sources. This has resulted in the need for automated Web Information Extraction (IE) tools that analyze the Web pages and harvest useful information from noisy content for any further analysis. The goal of this survey is to provide a comprehensive review of the major Web IE tools that used for Web text and based on Document Object Model for representing the web pages. This paper compares them in three dimensions: (1) the source of content extraction, (2) the techniques used, and (3) the features of the tools, moreover the advantages and disadvantages for each tool. Based on this survey, we can decide which suitable Web IE tool will be integrated in our future work in Web Text Mining.

References

H. Sleiman, "Information extraction from the World Wide Web", Actas de los Talleres de las Jornadas de Ingeniería del Softwarey Bases de Datos, 2009.
Lin, S. and Ho, J. 2002. Discovering informative content blocks from web documents. In Proceeding of the 8th International KD and DM Conference.
Gupta, S. , Kaiser, G. , Neistadt, D. and Grimm, P. 2003. Dom-based content extraction of html documents. In Proceeding of the 12th International Conference on World Wide Web.
Debnath, S. , Mitra, P. and Giles, C. 2005. Automatic Extraction of Informative Blocks from WebPages. In Proceeding of the 20th Annual ACM SAC'05.
H. Geng, Q. Gao, and J. Pan, "Extracting Content for News Web Pages based on DOM", International Journal of Computer Science and Network Security, 2007.
Gibson, J. , Wellner, B. and Lubar, S. 2008. CoreEx: content extraction from online news articles. In Proceeding of the 17th ACM IKM Conference.
Louvan, S. 2009 Extracting the Main Content from HTML Documents. [Online]. Available: http://wwwis. win. tue. nl/bnaic2009/papers/bnaic2009_paper_113. pdf
Guo, Y. , Tang, H. , Song, L. , Wang, Y. and Ding, G. 2010. ECON: An Approach to Extract Content from Web News Page. In Proceeding of the 12th International Asia-Pacific Web Conference.
Spengler, A. and Gallinari, P. 2009. Learning to Extract Content from News WebPages. In Proceeding of the International Conference on Advanced Information Networking and Applications Workshops.
Pasternack, J. and Roth, D. 2009. Extracting Article Text from the Web with Maximum Subsequence Segmentation. In Proceeding of the International World Wide Web Conference Committee (IW3C2).
Wpar homepage on Sourceforge 2012. [Online]Available: http://sourceforge. net/projects/wpar
The Webwiper website 2012. [Online]. Available: http://www. webwiper. com
The junkbusters website 2012. [Online]. Available: http://www. junkbusters. com
The Opera website 2012. [Online]. Available: http://www. opera. com
Finn, A. , Kushmerick, N. and Smyth, B. 2001. Fact or fiction: Content classification for digital libraries. In Joint DELOS-NSF Workshop on Personalization and Recommender Systems in Digital Libraries.
McKeown, K. , Barzilay, R. , Evans, D. , and Hatzivassiloglou, V. , 2001. Columbia Multi-document Summarization: Approach and Evaluation. In Proceeding of the Document Understanding Conference
Wacholder, N. , Evans, D. and Klavans, J. 2001. Automatic Identification and Organization of Index Terms for Interactive Browsing. In Proceeding of the Joint Conference on Digital Libraries '01.
Rahman, A. , Alam, H. and Hartono, R. 2001. Content Extraction from HTML Documents. In 1st International Workshop on Web Document Analysis WDA.
Buyukkokten, O. , Molina, H. and Paepcke, A. 2001. Accordion Summarization for End-Game Browsing on PDAs and Cellular Phones. In Proceeding of the Conference on Human Factors in Computing Systems.
Buyukkokten, O. , Molina, H. and Paepcke, A. 2001. Text Summarization for Web Browsing on Handheld Devices. In Proceeding of 10th International World-Wide Web Conference.
Kaasinen, E. , Aaltonen, M. , Kolari, J. , Melakoski, S. and Laakko, T. 2000. Two Approaches to Bringing Internet Services to WAP Devices. In Proceeding of 9th International World-Wide Web Conference.
laender, A. , Ribeiro-Neto, B. , Silva, A. and Teixeira, J. 2002. A brief survey of web data extraction tools. In Proceeding of SIGMOD Conference.
C. Chang, M. Kayed, M. Girgis, and K. Shaalan, "A Survey of Web Information Extraction Systems", Journal of IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2006.
Fiumara, G. 2007. Automated Information Extraction from Web Sources: a Survey. In Proceeding of the 3rd International Conference on Communities and Technology.
Liu, B. 2011 Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data. Second Edition. Springer-Verlag Berlin Heidelberg.
Soderland, S. 1997. Learning to extract text-based information from the World Wide Web. In Proceeding of 3rd International Conference on Knowledge Discovery and Data Mining (KDD).
Elmasri, R. and Navathe, S. 2003. Fundamentals of Database Systems. In Proceeding of the 4th Edition Addison Wesley.
Gupta, S. , Kaiser, G. , Neistadt, D. , Chiang, M. , Starren, J. and Grimm, P. 2005. Automating Content Extraction of HTML Documents. Technical Report. University of Columbia at Computer Science.
Yi, L. , Liu, B. and Li, X. 2003. Eliminating noisy information in web pages for data mining. In Proceeding of the 9th ACM SIGKDD International Conference.
S. Debnath, P. Mitra, N. Pal, and C. Giles, "Automatic Identi?cation of Informative Sections of Web-pages", In Journal IEEE Transactions on Knowledge and Data Engineering, 2005.
Toman, M. 2008. Comparison of Approaches for Information Extraction from the Web. In Proceeding of the 9th International PhD Workshop on Systems and Control: Young Generation Viewpoint. Slovenia.
Joshi, P. , Liu, S. 2009. Web Document Text and Images Extraction using DOM Analysis and Natural Language Processing. In Proceeding of the 9th ACM SDE Conference.
M. Asfia, M. Pedram and A. Rahmani, "Main Content Extraction from Detailed Web Pages", In Proceeding of International Journal of Computer Applications, 2010.

Index Terms

Computer Science

Information Sciences

Keywords

Knowledge Engineering Document Engineering Information Extraction Document Object Model Web Documents