CFP last date
20 December 2024
Reseach Article

Unsupervised Technique for Web Data Extraction: Trinity

by Sayali Khodade, Nilav Mukherjee
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 115 - Number 19
Year of Publication: 2015
Authors: Sayali Khodade, Nilav Mukherjee
10.5120/20263-2668

Sayali Khodade, Nilav Mukherjee . Unsupervised Technique for Web Data Extraction: Trinity. International Journal of Computer Applications. 115, 19 ( April 2015), 43-48. DOI=10.5120/20263-2668

@article{ 10.5120/20263-2668,
author = { Sayali Khodade, Nilav Mukherjee },
title = { Unsupervised Technique for Web Data Extraction: Trinity },
journal = { International Journal of Computer Applications },
issue_date = { April 2015 },
volume = { 115 },
number = { 19 },
month = { April },
year = { 2015 },
issn = { 0975-8887 },
pages = { 43-48 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume115/number19/20263-2668/ },
doi = { 10.5120/20263-2668 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T22:55:20.833578+05:30
%A Sayali Khodade
%A Nilav Mukherjee
%T Unsupervised Technique for Web Data Extraction: Trinity
%J International Journal of Computer Applications
%@ 0975-8887
%V 115
%N 19
%P 43-48
%D 2015
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Search engine is a program which searches specific information from huge amount of data . So for getting results in an effective manner and within less time this technique is used. This article is having a technique which depends on two or more web documents which are generated from same server-side template. The technique does not provide any relevant data but searches for shared pattern and separates it into three sub parts then apply different ranking functions and stored it into database. When comparing our technique with other techniques we can see that input documents are not having any negative impact on its effectiveness, also it gives results in less time and in the exact form.

References
  1. Hassan A, Sleiman, Trinity: On Using Trinary Trees for Unsupervised Web Data Extraction IEEE Transactions On Knowledge And Data Engineering, VOL. 26, NO. 6, JUNE 2014.
  2. V. crescenzi, G. Meca, RoadRunner: Towards automatic data extraction from large web sites Technical Report Rt-DIA-64-2001,D. I. A. University Roma Tre, March 2011.
  3. V. Kadam,G. Pakle, A Survey on HTML Structure Aware and Tree Based Web Data Scraping Technique International Journal of Computer Science and Information Technologies, Vol. 5 (2) 2014, 1655-1658 .
  4. S. Rajanandini, M. Mekalai, Quality Analysis in Web Applications to Develop Specification and Duplication Mining, Proceedings of National Conference on New Horizons in IT - NCNHIT 2013.
  5. W. W. Cohen, M. Hurst, and L. S. Jensen, A flexible learning system for wrapping tables and lists in HTML documents,in Proc. 11th Int. Conf. WWW, 2002, pp. 232241.
  6. V. Crescenzi and G. MeccaAutomatic information extraction from large websites,J. ACM, vol. 51, no. 5, pp. 731779, Sept. 2004.
  7. D. Freitag Information extraction from HTML: Application of general machine learning approach,In Proc. 15th Nat/10th Conf. AAAI/IAAI, Menlo Park, CA, USA, 1998, pp. 517523.
  8. A. Arasu and H. Garcia-Molina Extracting structured data from web pages,In Proc. 2003 ACM SIGMOD, San Diego, CA, USA, pp. 337348.
  9. V. Crescenzi, G. Mecca, and P. Merialdo,Road runner: Towards auto-matic data extraction from large web sites,in Proc. 27th Int. Conf. VLDB, Rome, Italy, 2001, pp. 109118.
  10. A. Machanavajjhala, A. S. Iyer, P. Bohannon, and S. MeruguCollective extraction from heterogeneous web lists,in Proc. 4th ACM Int. Conf. WSDM, Hong Kong, China, 2011, pp. 445454.
  11. M. Kayed and C. -H. Chang FiVaTech: Page-level web data extraction from template pages,IEEE Trans. Knowl. Data Eng. , vol. 22, no. 2, pp. 249263, Feb. 2010.
  12. C. -H. Chang, M. Kayed, M. R. Girgis, and K. F. Shaalan A survey of web information extraction systems,IEEE Trans. Knowl. Data Eng. , vol. 18, no. 10, pp. 14111428, Oct. 2006.
  13. C. -H. Chang and S. -C. Lui IEPAD: Information extraction based on pattern discovery,in Proc. 10th Int. Conf. WWW, Hong Kong, China, 2001, pp. 681688
  14. J. L. Hong, E. -G. Siew, and S. EgertonInformationextraction for search engines using fast heuristic techniques ,DataKnowl. Eng. ,Vol. 69, no. 2, pp. 169196, Feb. 2010.
  15. H. A. Sleiman and R. Corchuelo A survey on region extractors from web documents,IEEE Trans. Knowl. Data Eng. , vol. 25, no. 9, pp. 19601981, Sept. 2012.
  16. W. W. Cohen, M. Hurst, and L. S. Jensen A flexible learning system for wrapping tables and lists in HTML documents ,in Proc. 11th Int. Conf. WWW, 2002, pp. 232241.
Index Terms

Computer Science
Information Sciences

Keywords

Web Data extractor Automatic Wrapper Generation Wrapper Unsupervised Technique