Review on Duplicate Detection in Hierarchical Data Using Network Pruning Algorithm

Call for Paper

September Edition

IJCA solicits high quality original research papers for the upcoming September edition of the journal. The last date of research paper submission is 20 August 2025

Submit your paper

Know more

The week's pick

Assessing LLMs as Cognitive Interpreters of Student Prompts: A Typological Framework

Tadeu da Ponte Matevz Vremec Matej Mertik

Random Articles

Reseach Article

Review on Duplicate Detection in Hierarchical Data Using Network Pruning Algorithm

Published on December 2014 by Amita Fulsundar

Innovations and Trends in Computer and Communication Engineering

Foundation of Computer Science USA

ITCCE - Number 1

December 2014

Authors: Amita Fulsundar

Amita Fulsundar . Review on Duplicate Detection in Hierarchical Data Using Network Pruning Algorithm. Innovations and Trends in Computer and Communication Engineering. ITCCE, 1 (December 2014), 1-4.

@article{

author = { Amita Fulsundar },

title = { Review on Duplicate Detection in Hierarchical Data Using Network Pruning Algorithm },

journal = { Innovations and Trends in Computer and Communication Engineering },

issue_date = { December 2014 },

volume = { ITCCE },

number = { 1 },

month = { December },

year = { 2014 },

issn = 0975-8887,

pages = { 1-4 },

numpages = 4,

url = { /proceedings/itcce/number1/19037-2001/ },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Proceeding Article

%1 Innovations and Trends in Computer and Communication Engineering

%A Amita Fulsundar

%T Review on Duplicate Detection in Hierarchical Data Using Network Pruning Algorithm

%J Innovations and Trends in Computer and Communication Engineering

%@ 0975-8887

%V ITCCE

%N 1

%P 1-4

%D 2014

%I International Journal of Computer Applications

Abstract

The goal of the data mining process is to extract information from various data sources. Different sources can provide documents that contain data with different structure may be considered as representing the same conceptual information. Solution to this is duplication detection. Duplicate detection is detection of same real world entity in the data sources. Duplicate detection is a necessary task in data cleansing. Various algorithms are proposed for detection of duplicates in relational data, but very few solutions are focused on hierarchical data like XML. Duplicate Detection exactly identifies whether the data is duplicated or not. A peculiar method XMLDup is introduced for duplicate detection in XML data. XMLDup uses Bayesian network to evaluate probability of two XML elements being duplicates. It considers not only the content within the elements but also the way that content is structured. To improve the run time efficiency of network evaluation, a lossless pruning strategy is used. The algorithm achieves high accuracy and recall score in several data sets. The XMLDup perform state-of-the-art in duplicate detection in terms of both effectiveness and efficiency.

References

E. Rahm and H. H. Do, "Data cleaning: Problems and current approaches," IEEE Data Engineering Bulletin, vol. 23, pp. 3–13, 2000.
S. Guha, H. V. Jagadish, N. Koudas, D. Srivastava, and T. Yu,"Approximate XML joins," in Conference on the Management of Data(SIGMOD), 2002.
R. Ananthakrishna, S. Chaudhuri, and V. Ganti, "Eliminatingfuzzy duplicates in data warehouses," in Conference on Very LargeDatabases (VLDB), Hong Kong, China, 2002, pp. 586–597.
D. Milano, M. Scannapieco, and T. Catarci, "Structure awareXML object identification," in VLDB Workshop on Clean Databases(CleanDB), Seoul, Korea, 2006.
M. Weis and F. Naumann, "Dogmatix tracks down duplicatesin XML," in Conference on the Management of Data (SIGMOD),Baltimore, MD, 2005, pp. 431–442.
J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks ofplausible inference, 2nd ed. Morgan Kaufmann Publishers, 1988.
L. Leita o, P. Calado, and M. Weis, "Structure-Based Inference of XML Similarity for Fuzzy Duplicate Detection", Proc. 16th ACM Int'l Conf. Information and Knowledge Management,pp. 293-302, 2007.
A. M. Kade and C. A. Heuser, "Matching XML documents inhighly dynamic applications," in ACM Symposium on DocumentEngineering (DocEng), 2008, pp. 191–198.

Index Terms

Computer Science

Information Sciences

Keywords

Duplicate Detection Xml Bayesian Networks Data Cleaning And Optimization.