CFP last date
20 January 2025
Reseach Article

A Technique for Data Deduplication using Q-Gram Concept with Support Vector Machine

by M. Padmanaban, T. Bhuvaneswari
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 61 - Number 12
Year of Publication: 2013
Authors: M. Padmanaban, T. Bhuvaneswari
10.5120/9977-4100

M. Padmanaban, T. Bhuvaneswari . A Technique for Data Deduplication using Q-Gram Concept with Support Vector Machine. International Journal of Computer Applications. 61, 12 ( January 2013), 1-9. DOI=10.5120/9977-4100

@article{ 10.5120/9977-4100,
author = { M. Padmanaban, T. Bhuvaneswari },
title = { A Technique for Data Deduplication using Q-Gram Concept with Support Vector Machine },
journal = { International Journal of Computer Applications },
issue_date = { January 2013 },
volume = { 61 },
number = { 12 },
month = { January },
year = { 2013 },
issn = { 0975-8887 },
pages = { 1-9 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume61/number12/9977-4100/ },
doi = { 10.5120/9977-4100 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T21:08:53.245477+05:30
%A M. Padmanaban
%A T. Bhuvaneswari
%T A Technique for Data Deduplication using Q-Gram Concept with Support Vector Machine
%J International Journal of Computer Applications
%@ 0975-8887
%V 61
%N 12
%P 1-9
%D 2013
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Several systems that rely on consistent data to offer high quality services, such as digital libraries and e-commerce brokers, may be affected by the existence of duplicates, quasi-replicas, or near-duplicate entries in their repositories. Because of that, there have been significant investments from private and government organizations in developing methods for removing replicas from its data repositories. In this paper, we have proposed accordingly. In the previous work, duplicate record detection was done using three different similarity measures and neural network. In the previous work, we have generated feature vector based on similarity measures and then, neural network was used to find the duplicate records. In this paper, we have developed Q-gram concept with support vector machine for deduplication process. The similarity function, which we are used Dice coefficient,Damerau–Levenshtein distance,Tversky index for similarity measurement. Finally, support vector machine is used for testing whether data record is duplicate or not. A set of data generated from some similarity measures are used as the input to the proposed system. There are two processes which characterize the proposed deduplication technique, the training phase and the testing phase the experimental results showed that the proposed deduplication technique has higher accuracy than the existing method. The accuracy obtained for the proposed deduplication 88%.

References
  1. Moises G. de Carvalho, Alberto H. F. Laender, Marcos Andre Goncalves, Altigran S. da Silva, "A Genetic Programming Approachto Record Deduplication", IEEE Transaction on Knowledge and Data Engineering, 2011.
  2. LuísLeitão and PávelCalado, "Duplicate detection through structure optimization", ACM international conference on Information and knowledge management, pp: 443-452, 2011.
  3. Ektefa, M, Sidi. F,Ibrahim. H, Jabar. M. A. , Memar. S, Ramli. A, "A threshold-based similarity measure for duplicatedetection ", Ieee conference on Open systems, pp: 37-41, 2011.
  4. Elhadi. M, Al-Tobi. A, " Duplicate Detection in Documents and WebPages Using Improved Longest Common Subsequence andDocuments Syntactical Structures",International Conference on Computer Sciences and Convergence Information Technology,pp: 679-684, 2009.
  5. Dutch T. Meyer and William J. Bolosky, " A Study of Practical Deduplication", Computer and Information Science,pp: 1-13,2011.
  6. Danny Harnik, Benny Pinkas, Alexandra Shulman-Peleg, " Side channels in cloud services, the case of deduplication in cloud storage",vol. 8, no. 6, pp: 40-47, 2010.
  7. Yujuan Tan, Hong Jiang, Dan Feng, Lei Tian, Zhichao Yan, Guohui Zhou, " SAM: A Semantic-AwareMulti-Tiered Source De-duplication Framework for Cloud Backup",International Conference on Parallel Processing (ICPP), pp: 614-623, 2010.
  8. N. Koudas, S. Sarawagi, and D. Srivastava, "Record linkage: similarity measures and algorithms," in Proceedings of the2006 ACM SIGMOD International Conference on Management of Data, pp. 802–803, 2006.
  9. C. Dubnicki, L. Gryz, L. Heldt, M. Kaczmarczyk, W. Kilian, P. Strzelczak, J. Szczepkowski, C. Ungureanu, and M. Welnicki. Hydrastor: a scalable secondary storage. In Proc. 7th USENIX Conference on File and Storage Technologies, 2009.
  10. C. Ungureanu, B. Atkin, A. Aranya, S. Gokhale, S. Rago, G. Cakowski, C. Dubnicki, and A. Bohra. Hydrafs: A high-throughputfile system for the Hydrastor content-addressable storage system. In Proc. 8th USENIX Conference on File and StorageTechnologies, 2010.
  11. W. Bolosky, S. Corbin, D. Goebel and J. Douceur. Single instance storage in Windows 2000. In Proc. 4th USENIX WindowsSystems Symposium, 2000.
  12. S. Dorward and S. Quinlan. Venti: A new approach to archival data storage. In Proc. 1st USENIX Conference on File andStorage Technologies, 2002.
  13. H. S. Gunawi, N. Agrawal, A. C. Arpaci-Dusseau, R. H. Arpaci- Dusseau, and J. Schindler, "Deconstructing commodity storageclusters," in ISCA '05: Proceedings of the 32nd annual international symposium on Computer Architecture. Washington, DC, USA:IEEE Computer Society, pp. 60–71, 2005.
  14. J. R. Douceur, A. Adya, W. J. Bolosky, D. Simon, and M. Theimer, "Reclaiming space from duplicate files in a serverlessdistributed file system,"International Conference on Distributed Computing Systems, vol. 0, p. 617, 2002.
  15. S. Quinlan and S. Dorward, "Venti: a new approach to archival storage," in First USENIX conference on File and StorageTechnologies, Monterey,CA, 2002.
  16. A. Muthitacharoen, B. Chen, and D. Mazieres, "A low-bandwidth network file system," in Symposium on Operating SystemsPrinciples, 2001, pp. 174–187.
  17. M. Vrable, S. Savage, and G. M. Voelker, " Cumulus: Filesystem Backup to the Cloud," in FAST'09, Feb. 2009.
  18. Syncsort Backup Express and NetApp, "http://www. syncsort. com. "
  19. EMC Avamar, "http://www. emc. com. "
  20. NetBackupPureDisk, "http://www. symantec. com. "
  21. CommvaultSimpana, "http://www. commvault. com. "
  22. B. Zhu, K. Li, and H. Patterson, "Avoiding the disk bottleneck in the Data Domain deduplication file system," in FAST'08, Feb. 2008.
  23. S. Rhea, R. Cox, and A. Pesterev, "Fast, inexpensive contentaddressed storage in Foundation," in USENIX'08, Jun. 2008.
  24. M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar, G. Trezise, and P. Campbell, "Sparse Indexing: Large scale, inlinededuplication using sampling and locality," in FAST'09, Feb. 2009.
  25. D. Bhagwat, K. Eshghi, D. D. Long, and M. Lillibridge, "Extreme Binning: Scalable, Parallel Deduplication for ChunkbasedFile Backup," HP Laboratories, Tech. Rep. HPL-2009- 10R2, Sep. 2009.
  26. Guyon I. , Weston J. , Barnhill S. , Vapnik V. ," Gene Selection for Cancer Classification using Support Vector Machines," Machine Learning, vol. 46, no. 1-3, pp. 389-422, 2002.
  27. Zhang J. , Liu Y. , "Cervical Cancer Detection Using SVM Based Feature Screening," Proc of the 7th Medical Image Computing and Computer-Assisted Intervention,vol. 2, pp. 873-880,2004.
  28. Zhang K. , CAO H. X. , Yan H. , "Application of support vector machines on network abnormal intrusion detection," Application Research of Computers, vol. 5, pp. 98-100, 2006.
  29. Kim D. , Park J. , "Network-based intrusion detection with support vector machines," Lecture Notes in Computer Science, vol. 2662, p. 747-756, 2003.
  30. http://www. cs. utexas. edu/users/ml/riddle/data. html
Index Terms

Computer Science
Information Sciences

Keywords

deduplication support vector machine training testing