CFP last date
20 December 2024
Reseach Article

Duplicate Detection of Query Results from Multiple Web Databases

Published on None 2011 by Hemalatha S, Raja K, Tholkappia Arasu
Computational Science - New Dimensions & Perspectives
Foundation of Computer Science USA
NCCSE - Number 2
None 2011
Authors: Hemalatha S, Raja K, Tholkappia Arasu
60c81e50-9a38-472e-86e8-30f81ad5d977

Hemalatha S, Raja K, Tholkappia Arasu . Duplicate Detection of Query Results from Multiple Web Databases. Computational Science - New Dimensions & Perspectives. NCCSE, 2 (None 2011), 71-75.

@article{
author = { Hemalatha S, Raja K, Tholkappia Arasu },
title = { Duplicate Detection of Query Results from Multiple Web Databases },
journal = { Computational Science - New Dimensions & Perspectives },
issue_date = { None 2011 },
volume = { NCCSE },
number = { 2 },
month = { None },
year = { 2011 },
issn = 0975-8887,
pages = { 71-75 },
numpages = 5,
url = { /specialissues/nccse/number2/1863-165/ },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Special Issue Article
%1 Computational Science - New Dimensions & Perspectives
%A Hemalatha S
%A Raja K
%A Tholkappia Arasu
%T Duplicate Detection of Query Results from Multiple Web Databases
%J Computational Science - New Dimensions & Perspectives
%@ 0975-8887
%V NCCSE
%N 2
%P 71-75
%D 2011
%I International Journal of Computer Applications
Abstract

The results from multiple databases compose the deep or hidden Web, which is estimated to contain a much larger amount of high quality, usually structured information and to have a faster growth rate than the static Web.The system that helps users integrate and, more importantly, compare the query results returned from multiple Web databases, an important task is to match the different sources’ records that refer to the same real-world entity. Most state-of-the-art record matching methods are supervised, which requires the user to provide training data. In the Web databases, the records are not available in hand as they are query- dependent, they can be obtained only after the user submits the query. After removal of the same-source duplicates, the assumed non duplicate records from the same source can be used as training examples. The method uses the classifiers the weighted component similarity summing classifier (WCSS) and Support Vector Machine (SVM) classifier that works along with the Gaussian mixture model (GMM) to iteratively to identify the duplicates. The classifiers work cooperatively to identify the duplicate records. The complete GMM is parameterized by the mean vectors, covariance matrices and mixture weights from all the records.

References
  1. W.Su, J. Wang, and Frederick H. Lochovsky, “Record Matching over Query Results from Multiple Web Databases” IEEE Transactions on knowledge and data engineering.
  2. R. Baxter, P. Christen, and T. Churches, “A Comparison of Fast Blocking Methods for Record Linkage,” Proc. KDD Workshop Data Cleaning, Record Linkage, and Object Consolidation, pp. 25-27, 2003
  3. S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani, “Robust and Efficient Fuzzy Match for Online Data Cleaning,” Proc. ACM SIGMOD, pp. 313-324, 2003.
  4. P. Christen, T. Churches, and M. Hegland, “Febrl—A Parallel Open Source Data Linkage System,” Advances in Knowledge Discovery and Data Mining, pp. 638-647, Springer, 2004.
  5. O. Bennjelloun, H. Garcia-Molina, D. Menestrina, Q. Su,S.E.Whang, and J. Widom, “Swoosh: A Generic Approach to Entity Resolution,” The VLDB J., vol. 18, no. 1, pp. 255-276, 2009.
  6. M. Bilenko and R.J. Mooney, “Adaptive Duplicate Detection Using Learnable String Similarity Measures,” Proc. ACM SIGKDD, pp. 39-48, 2003.
  7. P. Christen, “Automatic Record Linkage Using Seeded Nearest Neighbour and Support Vector Machine Classification,” Proc. ACM SIGKDD, pp. 151-159, 2008.
  8. W.E. Winkler, “Using the EM Algorithm for Weight Computationin the Fellegi-Sunter Model of Record Linkage,” Proc. Section Survey Research Methods, pp. 667-671, 1988
  9. S. Chaudhuri, V. Ganti, and R. Motwani, “Robust Identification of Fuzzy Duplicates,” Proc. 21st IEEE Int’l Conf. Data Eng. (ICDE ’05),pp. 865-876, 2005 .
  10. V.S. Verykios, A.K. Elmagarmid, and E.N. Houstis, “Automating the Approximate Record Matching Process,” Information Sciences, vol. 126, nos. 1-4, pp. 83-98, July 2000.
  11. A. Blum and T. Mitchell, “Combining Labeled and Unlabeled Data with Co-Training,” COLT ’98: Proc. 11th Ann. Conf. Computational Learning Theory, pp. 92-100, 1998.
  12. W.W. Cohen and J. Richman, “Learning to Match and Cluster Large High-Dimensional Datasets for Data Integration,” Proc. ACM SIGKDD, pp. 475-480, 2002.
  13. P. Christen and K. Goiser, “Quality and Complexity Measures for Data Linkage and Deduplication,” Quality Measures in Data Mining, F. Guillet and H. Hamilton, eds., vol. 43, pp. 127-151, Springer, 2007.
  14. W.W. Cohen, H. Kautz, and D. McAllester, “Hardening Soft Information Sources,” Proc. ACM SIGKDD, pp. 255-259, 2000.
  15. Gaussian Mixture Models Douglas Reynolds MIT Lincoln Laboratory, 244 Wood St., Lexington, MA 02140, USA
  16. W.E. Winkler, “Using the EM Algorithm for Weight Computation in the Fellegi-Sunter Model of Record Linkage,” Proc. Section Survey Research Methods, pp. 667-671, 1988.
Index Terms

Computer Science
Information Sciences

Keywords

Gaussian Mixture Model supervised learning Support Vector Machine