International Journal of Computer Applications |
Foundation of Computer Science (FCS), NY, USA |
Volume 122 - Number 12 |
Year of Publication: 2015 |
Authors: Manjusha R. Pawar, J. V. Shinde |
10.5120/21751-5018 |
Manjusha R. Pawar, J. V. Shinde . Efhcient Duplicate Detection and Elimination in Hierarchical Multimedia Data. International Journal of Computer Applications. 122, 12 ( July 2015), 15-21. DOI=10.5120/21751-5018
Today's important task is to clean data in data warehouses which has complex hierarchical structure. This is possibly done by detecting duplicates in large databases to increase the efficiency of data mining and to make it effective. Recently new algorithms are proposed that consider relations in a single table; hence by comparing records pairwise they can easily find out duplications. But now a day the data is being stored in more complex and semi-structured or hierarchical structure and the problem arose is how to detect duplicates on XML data. Also due to differences between various data models, the algorithms which are for single relations cannot be applied on XML data. The objective of this project is to detect duplicates in hierarchical data which contain textual data and multimedia data like images, audio and video. It also focuses on eliminating the duplicates by using elimination technique such as delete. Here Bayesian network is used with modified pruning algorithm for duplicate detection, and experiments are performed on both artificial and real world datasets. The new XMLMultiDup method is able to perform duplicate detection with high efficiency and effectiveness on multimedia datasets. This method compares each level of XML tree from root to the leaves computing probabilities of similarity by assigning weights. It goes through the comparison of structure, each descendant of both datasets and find duplicates despite difference in data.