International Journal of Computer Applications |
Foundation of Computer Science (FCS), NY, USA |
Volume 50 - Number 22 |
Year of Publication: 2012 |
Authors: Himanshu S. Mazumdar, Maulika S. Patel |
10.5120/7935-1246 |
Himanshu S. Mazumdar, Maulika S. Patel . Protein Sequence Similarity Search Suitable for Parallel Implementation. International Journal of Computer Applications. 50, 22 ( July 2012), 1-3. DOI=10.5120/7935-1246
Having entered the post genomic era, there lies a plethora of information, both genomic and proteomic. This provides quite a lot of resources so that the computational and machine learning strategies be applied to address the problems of biological relevance. Searching in biological databases for similar or homologous sequences is a fundamental step for many bioinformatics tasks. On discovery of a new protein sequence or drug, a biologist would like to confirm the discovery by comparing with the largest available protein database. Alignment based methods become too complex and time consuming with the increase in the number of sequences. Alignment free sequence comparison is many a time used as a filtering step for application of alignment. A novel method of searching for similar sequences in a huge protein database is proposed. The method has two interesting aspects. One is the divide and conquer approach and use of hashing like scheme for indexing the large database. The index consists of the addresses of the 15-residue words in the UniRef100. fasta database. The second aspect is the possibility of data parallelism as the database is divided into m segments for indexing. This can further increase the efficiency of the algorithm. The creation of index is time consuming but the search time is constant and affordable. The method is particularly useful when used with the large databases like UniRef100. fasta which consists of 9757328 protein sequences as on May 2010. The index based searching algorithm is implemented in C # . NET.