We apologize for a recent technical issue with our email system, which temporarily affected account activations. Accounts have now been activated. Authors may proceed with paper submissions. PhDFocusTM
CFP last date
20 December 2024
Reseach Article

Review on Duplicate Detection in Hierarchical Data Using Network Pruning Algorithm

Published on December 2014 by Amita Fulsundar
Innovations and Trends in Computer and Communication Engineering
Foundation of Computer Science USA
ITCCE - Number 1
December 2014
Authors: Amita Fulsundar
c832fa94-03bc-4e18-9c56-e6d36ec0fb85

Amita Fulsundar . Review on Duplicate Detection in Hierarchical Data Using Network Pruning Algorithm. Innovations and Trends in Computer and Communication Engineering. ITCCE, 1 (December 2014), 1-4.

@article{
author = { Amita Fulsundar },
title = { Review on Duplicate Detection in Hierarchical Data Using Network Pruning Algorithm },
journal = { Innovations and Trends in Computer and Communication Engineering },
issue_date = { December 2014 },
volume = { ITCCE },
number = { 1 },
month = { December },
year = { 2014 },
issn = 0975-8887,
pages = { 1-4 },
numpages = 4,
url = { /proceedings/itcce/number1/19037-2001/ },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Proceeding Article
%1 Innovations and Trends in Computer and Communication Engineering
%A Amita Fulsundar
%T Review on Duplicate Detection in Hierarchical Data Using Network Pruning Algorithm
%J Innovations and Trends in Computer and Communication Engineering
%@ 0975-8887
%V ITCCE
%N 1
%P 1-4
%D 2014
%I International Journal of Computer Applications
Abstract

The goal of the data mining process is to extract information from various data sources. Different sources can provide documents that contain data with different structure may be considered as representing the same conceptual information. Solution to this is duplication detection. Duplicate detection is detection of same real world entity in the data sources. Duplicate detection is a necessary task in data cleansing. Various algorithms are proposed for detection of duplicates in relational data, but very few solutions are focused on hierarchical data like XML. Duplicate Detection exactly identifies whether the data is duplicated or not. A peculiar method XMLDup is introduced for duplicate detection in XML data. XMLDup uses Bayesian network to evaluate probability of two XML elements being duplicates. It considers not only the content within the elements but also the way that content is structured. To improve the run time efficiency of network evaluation, a lossless pruning strategy is used. The algorithm achieves high accuracy and recall score in several data sets. The XMLDup perform state-of-the-art in duplicate detection in terms of both effectiveness and efficiency.

References
  1. E. Rahm and H. H. Do, "Data cleaning: Problems and current approaches," IEEE Data Engineering Bulletin, vol. 23, pp. 3–13, 2000.
  2. S. Guha, H. V. Jagadish, N. Koudas, D. Srivastava, and T. Yu,"Approximate XML joins," in Conference on the Management of Data(SIGMOD), 2002.
  3. R. Ananthakrishna, S. Chaudhuri, and V. Ganti, "Eliminatingfuzzy duplicates in data warehouses," in Conference on Very LargeDatabases (VLDB), Hong Kong, China, 2002, pp. 586–597.
  4. D. Milano, M. Scannapieco, and T. Catarci, "Structure awareXML object identification," in VLDB Workshop on Clean Databases(CleanDB), Seoul, Korea, 2006.
  5. M. Weis and F. Naumann, "Dogmatix tracks down duplicatesin XML," in Conference on the Management of Data (SIGMOD),Baltimore, MD, 2005, pp. 431–442.
  6. J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks ofplausible inference, 2nd ed. Morgan Kaufmann Publishers, 1988.
  7. L. Leita o, P. Calado, and M. Weis, "Structure-Based Inference of XML Similarity for Fuzzy Duplicate Detection", Proc. 16th ACM Int'l Conf. Information and Knowledge Management,pp. 293-302, 2007.
  8. A. M. Kade and C. A. Heuser, "Matching XML documents inhighly dynamic applications," in ACM Symposium on DocumentEngineering (DocEng), 2008, pp. 191–198.
Index Terms

Computer Science
Information Sciences

Keywords

Duplicate Detection Xml Bayesian Networks Data Cleaning And Optimization.