CFP last date
20 January 2025
Reseach Article

Propose a Substitution Model for DNA Data Compression

by Ayad E. Korial, Ali Kamal Taqi
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 179 - Number 39
Year of Publication: 2018
Authors: Ayad E. Korial, Ali Kamal Taqi
10.5120/ijca2018916484

Ayad E. Korial, Ali Kamal Taqi . Propose a Substitution Model for DNA Data Compression. International Journal of Computer Applications. 179, 39 ( May 2018), 20-26. DOI=10.5120/ijca2018916484

@article{ 10.5120/ijca2018916484,
author = { Ayad E. Korial, Ali Kamal Taqi },
title = { Propose a Substitution Model for DNA Data Compression },
journal = { International Journal of Computer Applications },
issue_date = { May 2018 },
volume = { 179 },
number = { 39 },
month = { May },
year = { 2018 },
issn = { 0975-8887 },
pages = { 20-26 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume179/number39/29338-2018916484/ },
doi = { 10.5120/ijca2018916484 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-07T00:57:52.455875+05:30
%A Ayad E. Korial
%A Ali Kamal Taqi
%T Propose a Substitution Model for DNA Data Compression
%J International Journal of Computer Applications
%@ 0975-8887
%V 179
%N 39
%P 20-26
%D 2018
%I Foundation of Computer Science (FCS), NY, USA
Abstract

One of the most difficult challenges in lossless data compression is finding the right model for the Deoxyribonucleic Acid (DNA) compression. DNA sequences include four chemical bases: adenine (A), guanine (G), cytosine (C), and thymine (T) where the information in DNA is stored as a code made up of these four chemical bases and these sequences show these are not random, if they are totally random then store them in two bits, this is the most efficient and logical way. This paper proposed an algorithm called A2 for DNA data compression. The proposed algorithm consists of four stages to build a substitutional model. The first stage used a modified version of run-length coding, in second and third stages mapping model for formatting data to be suitable for the final stage fed into Burrows-Wheeler Transform to use permutation technique that group related symbols as possible to improve dictionary coding using Lempel-Ziv (LZ77) and output file stored as (.a2) extension. The A2 algorithm implemented and tested on data from GenBank and shows acceptable file size and processing time ratio.

References
  1. Priyanka Ms., Goel S., “A Compression Algorithm for DNA that uses ASCII Values”, in Advance Computing Conference (IACC), 21-22 Feb. 2014 IEEE International, Gurgaon.
  2. Afify H., Islam M., Wahed MA. “DNA Lossless Differnational Compression Algorithm Based on Similarity of Genomic Sequence Database”, International
  3. Journal of Computer Science & Information Technology (IJCSIT) Vol 3, No 4, August 2011.
  4. Xin Chen, Sam Kwong, Ming Li, “A Compression Algorithm for DNA Sequences Using Approximate Matching for Better Compression Ratio to Reveal the True Characteristics of DNA”, IEEE engineering in medicine and biology, July/August 2001.
  5. Ashutosh Gupta, Kamlesh Kumar Dubey, “Efficien Compressor for Biological Sequences”, IEEE International Advance Computing Conference (IACC), 2013.
  6. Kanika Mehta, Satya Prakash Ghrera, “DNA Compression using Referential Compression Algorithm”, IEEE, 2015.
  7. Srinivasa K. G, Jagadish M., Venugopal K. R., Patna L. M., “Efficient Compression of non-repetitive DNA sequences using Dynamic Programming”, IEEE, 2006.
  8. Ouyang J., Feng P. and Kang J., “Fast Compression of Huge DNA Sequence Data”, International Conference on BioMedical Engineering and Informatics, 2012.
  9. Chen SL., Yao H., Han JP., Liu C., Song JY., Shi LC., Zhu YJ., Ma XY., Gao T., Pang XH., et al., “Validation Of The ITS2 Region as a Novel DNA Barcode for Identifying Medicinal Plant Species”, . PLoS One 2010; 5:e8613.
  10. Liu C, Shi L, Xu X, Li H, et al., “DNA Barcode Goes Two-Dimensions: DNA QR Code Web Server”, PloS One 2012;7: e35146.
  11. Kumar NP., Rajavel A., Jambulingam P., “Application of PDF417 Symbology for ‘DNA Barcoding”, . Comput Meth Prog Biomed 2008; 90:187e9.
  12. Cai Y, Li XW, Li M, Chen XJ, Hu H, Ni JY, Wang YT. Traceability and quality control in traditional Chinese medicine: from chemical fingerprint to twodimensional
  13. barcode. Evid Based Complement Altern Med 2015, 251304. http://dx.doi.org/10.1155/2015/251304. 6 pages, 2015.
  14. Cai Y, et al., “Converting Panax ginseng DNA and Chemical Fingerprints into Two-Dimensional Barcode”, Journal of Ginseng Research, 41(3):339-46, 2017.
  15. “FTP Access to GenBank Data.” National Center for Biotechnology Information, U.S. National Library of Medicine, Aug. 2017, www.ncbi.nlm.nih.gov/genbank/ftp/.
  16. Chen, Xin, et al. “Welcome to GenCompress!” GenCompress Home Page, www.cs.cityu.edu.hk/~cssamk/gencomp/GenCompress1.htm.
Index Terms

Computer Science
Information Sciences

Keywords

DNA compression Compression Model Run-length coding Mapping Table BWT Dictionary Coding A2 algorithm.