We apologize for a recent technical issue with our email system, which temporarily affected account activations. Accounts have now been activated. Authors may proceed with paper submissions. PhDFocusTM
CFP last date
20 November 2024
Call for Paper
December Edition
IJCA solicits high quality original research papers for the upcoming December edition of the journal. The last date of research paper submission is 20 November 2024

Submit your paper
Know more
Reseach Article

Fast and Efficient Hashing for Sequence Similarity Search using Substring Extraction in DNA Sequence Databases

by Robinson Silvester. A, J. Cruz Antony, M. Pratheepa
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 78 - Number 9
Year of Publication: 2013
Authors: Robinson Silvester. A, J. Cruz Antony, M. Pratheepa
10.5120/13516-1295

Robinson Silvester. A, J. Cruz Antony, M. Pratheepa . Fast and Efficient Hashing for Sequence Similarity Search using Substring Extraction in DNA Sequence Databases. International Journal of Computer Applications. 78, 9 ( September 2013), 13-17. DOI=10.5120/13516-1295

@article{ 10.5120/13516-1295,
author = { Robinson Silvester. A, J. Cruz Antony, M. Pratheepa },
title = { Fast and Efficient Hashing for Sequence Similarity Search using Substring Extraction in DNA Sequence Databases },
journal = { International Journal of Computer Applications },
issue_date = { September 2013 },
volume = { 78 },
number = { 9 },
month = { September },
year = { 2013 },
issn = { 0975-8887 },
pages = { 13-17 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume78/number9/13516-1295/ },
doi = { 10.5120/13516-1295 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T21:51:49.082080+05:30
%A Robinson Silvester. A
%A J. Cruz Antony
%A M. Pratheepa
%T Fast and Efficient Hashing for Sequence Similarity Search using Substring Extraction in DNA Sequence Databases
%J International Journal of Computer Applications
%@ 0975-8887
%V 78
%N 9
%P 13-17
%D 2013
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Emergent interest in genomic research has resulted in the creation of huge biological sequence databases, however search and retrieval of relevant information from these databases takes a lot of processing time, when performed conventionally as size of databases containing DNA sequences is huge. Hence, providing an efficient searching mechanism is mandatory. In this paper we present an efficient search mechanism using Hashing techniques. Initially, the data is hashed and indexed according to different window sizes. During this process, we eliminate redundancies and only record patterns with distinct elements and provide them with corresponding hash values. During the search phase, the search string is checked for the size of the window and if it exceeds the maximum limit of 4, then it is divided. The first part is considered as the search string and the search is made. After the confirmation of the index, the strings that follow the current indexed string are matched with the search string and finally the confirmation is made. The simulation results show that the current methodology provides faster results, while occupying lesser memory.

References
  1. Peter Snustard, Michael J. Simmons. GENETICS, Wiley 6th edition, www. wiley. com/go/global/snustad.
  2. Jones, Neil C. and Pevzner, Pavel A. (2004) "An Introduction to Bioinformatics Algorithms. " Cambridge: The MIT Press. 148-226 and 311-337.
  3. Petri Kalsi, Hannu Peltola, and Jorma Tarhio, "Comparison of exact string matching algorithms for biological sequences", BIRD, 417–426, 2008.
  4. Simone Faro and Thierry Lecroq, "The exact string matching problem: a comprehensive experimental evaluation", CoRR, abs/1012. 2547, 2010.
  5. Simone Faro and Thierry Lecroq "The exact online string matching problem: a review of the most recent results", ACM Computing Surveys, 45(2): to appear, 2013.
  6. Gonzalo Navarro and Mathieu Raffinot, "Flexible pattern matching in strings - practical on-line search algorithms for texts and biological sequences", Cambridge University Press, 2002.
  7. Simone Faro, Thierry Lecroq, "Fast Searching in Biological Sequences Using Multiple Hash Functions", Proceedings of the 2012 IEEE 12th International Conference on Bioinformatics & Bioengineering (BIBE), Larnaca, Cyprus, 11-13 November 2012
  8. David Nassimi, Milind Joshi, Andrew Sohn,"H-PBS: A Hash-Based Scalable Technique for Parallel Bidirectional Search", 1063-6374/95 1995 IEEE
  9. Zhou Bai Stefan C. Kremer, "Sequence Learning: Analysis and Solutions for Sparse Data in High Dimensional Spaces", 978-1-4673-1191-5/12/$31. 00 ©2012 IEEE
  10. David Dittman, Taghi Khoshgoftaar, Randall Wald, and Amri Napolitano, "Similarity Analysis of Feature Ranking Techniques on Imbalanced DNA Microarray Datasets", 2012 IEEE International Conference on Bioinformatics and Biomedicine
  11. V. Hari Prasad, Dr. P. Y. Kumar, Dr. D. Vasumathi, "Adaptive segmentation of DNA sequences using SBC Tehcnique:A novel Algorithm", ICCCNT'12, July 2012, Coimbatore, India
  12. Gonzalo Navarro , Mathieu Raffinot , "Practical and flexible pattern matching over Ziv–Lempel compressed text", Journal of Discrete Algorithms 2 (2004) 347–371
  13. Nur'Aini Binti Abdul Rashid, Rana Ghadban, Hazrina Yusof Hamdani,Atheer A-Abdulrazaq, "Enhanced CAFÉ Indexing Algorithm Using Hashing Function", 978-1-4244-6716-7/10/$26. 00, 2010 IEEE
  14. Maryam Nuser, Izzat Alsmadi, "Evaluating Graphical and Statistical Techniques for Measuring Similarity in DNA Sequences", 978-1-4673-1550-0/12/$31. 00 ©2012 IEEE
Index Terms

Computer Science
Information Sciences

Keywords

Hashing Sequence Similarity by Hashing Substring Extraction DNA Sequence