CFP last date
20 January 2025
Reseach Article

Trend of Supervised Web Data Extraction

by Galih Hendro Martono, Azhari Azhari, Khabib Mustafa
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 180 - Number 19
Year of Publication: 2018
Authors: Galih Hendro Martono, Azhari Azhari, Khabib Mustafa
10.5120/ijca2018916431

Galih Hendro Martono, Azhari Azhari, Khabib Mustafa . Trend of Supervised Web Data Extraction. International Journal of Computer Applications. 180, 19 ( Feb 2018), 13-20. DOI=10.5120/ijca2018916431

@article{ 10.5120/ijca2018916431,
author = { Galih Hendro Martono, Azhari Azhari, Khabib Mustafa },
title = { Trend of Supervised Web Data Extraction },
journal = { International Journal of Computer Applications },
issue_date = { Feb 2018 },
volume = { 180 },
number = { 19 },
month = { Feb },
year = { 2018 },
issn = { 0975-8887 },
pages = { 13-20 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume180/number19/29039-2018916431/ },
doi = { 10.5120/ijca2018916431 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-07T01:01:06.224274+05:30
%A Galih Hendro Martono
%A Azhari Azhari
%A Khabib Mustafa
%T Trend of Supervised Web Data Extraction
%J International Journal of Computer Applications
%@ 0975-8887
%V 180
%N 19
%P 13-20
%D 2018
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Website has evolved since it was first developed in 1990. Since then, the website grows rapidly. Based on the information provided by http://www.worldwidewebsize.com the number of websites is currently at least 4.54 billion pages. With a very large number, the website stores a lot of information that can be used. That problem brings up the concept of data extraction. Web data extraction aims to retrieve the contents of the website so that it can be easy to use for other purposes. The utilization of web data extraction can be used in a product catalog, news, bookstore, travel, etc. There are many systems build by different technique such as manual, supervised, un-supervised, and semi-supervised. This paper discuss supervised learning technique for web data extraction. Several previous surveys have overviewed the wrapper induction system using the concept of supervised techniques to extracted web data up to 2007. The aim of this paper is to present a comprehensive overview of the research in supervised web extraction data by providing the latest research results

References
  1. Y. Zhai and B. Liu, “Structured Data Extraction from the Web Based on Partial Tree Alignment,” IEEE Trans. Knowl. Data Eng., vol. 18, no. 12, pp. 1614–1628, 2006.
  2. Chia-Hui Chang, M. Kayed, M. R. Girgis, and K. F. Shaalan, “A Survey of Web Information Extraction Systems,” IEEE Trans. Knowl. Data Eng., vol. 18, no. 10, pp. 1411–1428, 2006.
  3. J. Hammer, J. McHugh, and H. Garcia-Molina, “Semistructured data: the TSIMMIS experience,” in In Proceedings of the 1st East-European Symposium on Advances in Databases and Information Systems (ADBIS), 1997, pp. 1–8.
  4. V. Crescenzi and G. Mecca, “Grammars have exceptions,” Inf. Syst., vol. 23, no. 8, pp. 539–565, 1998.
  5. G. O. Arocena and A. O. Mendelzon, “WebOQL: Restructuring documents, databases, and webs,” in Proceedings of the 14th IEEE International Conference on Data Engineering (ICDE), Orlando, Florida, 1998, vol. 5, no. 3, pp. 24–33.
  6. A. Sahuguet and A. Fabien, “Building Intelligent Web Applications Using Lightweight Wrappers,” Data Knowl. Eng., pp. 283–316, 2001.
  7. L. Liu, C. Pu, and W. Han, “XWRAP: an XML-enabled wrapper construction system for Web information sources,” Proc. 16th Int. Conf. Data Eng. (Cat. No.00CB37073), no. February, pp. 611–621, 2000.
  8. Z. Li and W. N.k, “WICCAP : From Semi-structured Data to Structured Data,” in Proc. 14th Int’l Conf. World Wide Web, 2004, pp. 66–75.
  9. J. Raposo, A. Pan, M. Alvarez, J. Hidalgo, and A. Vina, “The Wargo system: Semi-automatic wrapper generation in presence of complex data access modes,” Proc. - Int. Work. Database Expert Syst. Appl. DEXA, pp. 313–317, 2002.
  10. Y. Kim and S. Lee, “SVM-based web content mining with leaf classification unit from DOM-tree,” in 2017 9th International Conference on Knowledge and Smart Technology: Crunching Information of Everything, KST 2017, 2017, pp. 359–364.
  11. N. Kushmerick, “Wrapper Induction for Information Extraction,” in Proceedings of the Fifteenth International Conference on Artificial Intelligence (IJCAI), 1997, pp. 729–735.
  12. C. N. Hsu and M. T. Dung, “Generating finite-state transducers for semi-structured data extraction from the Web,” Inf. Syst., vol. 23, no. 8, pp. 521–538, 1998.
  13. I. Muslea, S. Minton, and C. Knoblock, “A Hierarchical Approach to Wrapper Induction,” in Proceedings of the Third International Conference on Autonomous Agents, 1999.
  14. W. W. Cohen, M. Hurst, W. W. Cohen, M. Hurst, L. S. Jensen, and L. S. Jensen, “A flexible learning system for wrapping tables and lists in HTML documents,” Proc. 11th Int. Conf. World Wide Web, no. July, pp. 232–241, 2002.
  15. Y. Zhai and B. Liu, “Extracting Web data using instance-based learning,” J. World Wide Web, vol. 10, no. 2, pp. 113–132, 2007.
  16. D. Freitag, “Information Extraction From HTML: Application of A General Learning Approach,” in Proceedings of the Fifteenth Conference on Artificial Intelligence (AAAI-98), no. 0.
  17. M. E. Califf and R. J. Mooney, “Relational learning of pattern-match rules for information extraction,” Comput. Nat. Lang. Learn., vol. 4, pp. 9–15, 1997.
  18. S. Soderland, “Leaning Information Extraction Rules for Semi-Structured and Free Text,” J. Mach. Learn., pp. 233–272, 1999.
  19. N. Ashish and C. Knoblock, “Wrapper Generation for Semi-Structured Internet Sources,” SIGMOD Rec., pp. 8–15, 1997.
  20. B. Doorenbos, “A Scalable Comparison-Shopping Agent for the World-Wide-Web,” in In Proceedings of the First International Conference on Autonomous Agents, 1997, pp. 39–48.
  21. C. Knoblock, K. Lerman, S. Minton, and I. Muslea, “Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach,” Bull. Tech. Comm. Data Eng., vol. 23, no. 4, pp. 35–43, 2000.
  22. V. Crescenzi, G. Mecca, and P. Merialdo, “Roadrunner: Towards automatic data extraction from large web sites,” Proc. 27th Int. Conf. Very Large Data Bases, pp. 109–118, 2001.
  23. A. Arasu and Garcia-Molina, “Extracting structured data from Web pages,” 2003 ACM SIGMOD Int. Conf. Manag. Data, pp. 337–348, 2003.
  24. H. A. Sleiman and R. Corchuelo, “Trinity: On using trinary trees for unsupervised web data extraction,” IEEE Trans. Knowl. Data Eng., vol. 26, no. 6, pp. 1544–1556, 2014.
  25. Y. Zhai and B. Liu, “Web data extraction based on partial tree alignment,” Proc. 14th Int. Conf. World Wide Web - WWW ’05, pp. 76–85, 2005.
  26. C. Chang and S.-C. Lui, “IEPAD: Information Extraction Based on Pattern Discovery,” Proc. 10th Int. Conf. World Wide Web - WWW, pp. 681–688, 2001.
  27. C. H. Chang and S. C. Kuo, “OLERA: Semisupervised Web-data extraction with visual support,” IEEE Intell. Syst., vol. 19, no. 6, pp. 56–64, 2004.
  28. A. Hogue and D. Karger, “Thresher: automating the unwrapping of semantic content from the World Wide Web,” Proc. 14th Int. Conf. World Wide Web - WWW ’05, no. January 2005, pp. 86–95, 2005.
  29. C. Chang, M. Kayed, M. Girgis, and K. Shaalan, “Criteria for Evaluating Information Extraction Systems,” 3rd Int. Conf. Informatics Syst., 2005.
  30. B. Silva and J. Cardoso, “Semantic data extraction for B2B integration,” in Proceedings - International Conference on Distributed Computing Systems, 2006.
  31. H. Chen, M. Chau, D. D. Zeng, H. Chen, M. Chau, and D. Zeng, “CI Spider : A tool for competitive intelligence on the Web CI Spider : a tool for competitive intelligence on the Web,” Deci Support Syst., no. April 2014, pp. 1–17, 2002.
  32. J. L. Hong, “Automated data extraction with multiple ontologies,” Int. J. Grid Distrib. Comput., vol. 9, no. 6, pp. 381–392, 2016.
  33. Y. Wang and Zhou L, “A Hybrid Method for Web Data Extraction.pdf,” in Web Intelligence, 2003. WI 2003. Proceedings. IEEE/WIC International Conference on, 2003.
  34. J. Robinson, “Data extraction from Web data sources,” Proceedings. 15th Int. Work. Database Expert Syst. Appl. 2004., pp. 282–288, 2004.
  35. L. P. B. Vuong, X. Gao, and M. Zhang, “Data extraction from semi-structured Web pages by clustering,” in Proceedings - 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings), WI’06, 2006, pp. 374–377.
  36. S. Tan, J. Fan, and Y. Jiang, “Web Data Extraction Based on Label Library,” in 2009 World Congress on Computer Science and Information Engineering, 2008.
  37. H. Hong, X. Chen, G. Wu, and J. Li, “Web Data Extraction Based on Tree Structure Analysis and Template Generation,” in E-Product E-Service and E-Entertainment (ICEEE), 2010 International Conference on, 2010.
  38. N. K. Tran, K. C. Pham, and Q. T. Ha, “XPath-wrapper induction for data extraction,” in Proceedings - 2010 International Conference on Asian Language Processing, IALP 2010, 2010, pp. 150–153.
  39. K. A. Pakojwar, R. S. Mangrulkar, and V. G. Bhujade, “Web data extraction and alignment using tag and value similarity,” in ICIIECS 2015 - 2015 IEEE International Conference on Innovations in Information, Embedded and Communication Systems, 2015, pp. 1–4.
  40. A. Manjaramkar and R. L. Lokhande, “DEPTA: An efficient technique for web data extraction and alignment,” in 2016 International Conference on Advances in Computing, Communications and Informatics, ICACCI 2016, 2016, pp. 2307–2310.
  41. B. Mehta and M. Narvekar, “DOM tree based approach for Web content extraction,” in 2015 International Conference on Communication, Information & Computing Technology (ICCICT), 2015.
  42. B. Liu, R. Grossman, and Y. Zhai, “Mining data records in Web pages,” in Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, 2003, pp. 601–606.
  43. H. Zhao, W. Meng, Z. Wu, V. Raghavan, and C. Yu, “Fully automatic wrapper generation for search engines,” in International World Wide Web Conference, 2005, p. 66.
  44. A. Blum and T. Mitchell, “Combining labeled and unlabeled data with co-training,” Proc. Elev. Annu. Conf. Comput. Learn. theory - COLT’ 98, pp. 92–100, 1998.
  45. A. H. F. Laender, A. S. Silva, B. a Ribeiro-neto, and J. S. Teixeira, “A Brief Survey of Web Data Extraction Tools,” pp. 0–9.
  46. E. Ferrara, P. De Meo, G. Fiumara, and R. Baumgartner, “Web data extraction, applications and techniques: A survey,” Knowledge-Based Syst., vol. 70, no. June, pp. 301–323, 2014.
  47. S. H. Muggleton and L. De Raedt, “Inductive Logic Programming: Theory and Methods,” J. Log. Program., vol. 19,20, pp. 629–679, 1994.
  48. L. M. Yusuf, M. S. Othman, and J. Salim, “Web classification using extraction and machine learning techniques,” Proc. 2010 Int. Symp. Inf. Technol. - Eng. Technol. ITSim’10, vol. 2, pp. 765–770, 2010.
  49. T. Ahmad, H. Akhtar, A. Chopra, and M. W. Akhtar, “Satire Detection from Web Documents Using Machine Learning Methods,” 2014 Int. Conf. Soft Comput. Mach. Intell., pp. 102–105, 2014.
  50. N. G. Ali and N. Omar, “Arabic Keyphrases Extraction Using a Hybrid of Statistical and Machine Learning Methods,” Int. Conf. Inf. Technol. Multimed., pp. 281–286, 2014.
  51. M. Mayilvaganan and Sakthivel, “Extraction of Web Information with Implementation of Internet Intelligent Agent System Via Supervised Learning Approach,” Int. J. Comput. Trends Technol., vol. 6, no. 1, pp. 42–51, 2013.
  52. A. Talwar and Y. Kumar, “Machine Learning: An artificial intelligence methodology,” Int. J. Eng. Comput. Sci., vol. 2, no. 12, pp. 3400–3405, 2013.
  53. S. Aruna and L. V Nandakishore, “An Empirical Comparison of Supervised Learning Algorithms in Disease Detection,” Int. J. Inf. Technol. Converg. Serv., vol. 1, no. 4, pp. 81–92, 2011.
  54. R. Amami, D. Ben Ayed, and N. Ellouze, “An Empirical Comparison of SVM and Some Supervised Learning Algorithms for Vowel recognition,” Int. J. Intell. Inf. Process., vol. 3, no. 1, pp. 63–70, 2012.
  55. T. Joachims, “Making Large-Scale SVM Learning Practical,” 1998.
Index Terms

Computer Science
Information Sciences

Keywords

Extraction web data supervised learning wrapper machine learning wrapper induction system