We apologize for a recent technical issue with our email system, which temporarily affected account activations. Accounts have now been activated. Authors may proceed with paper submissions. PhDFocusTM
CFP last date
20 December 2024
Reseach Article

Hadoop based Text Mining System for Identification of Chemicals Associated with Disease of Interest

by Kritika Bhowmik, Tejal Aher, Vaibhav Kale, K. Rajeswari, M. Karthikeyan
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 131 - Number 4
Year of Publication: 2015
Authors: Kritika Bhowmik, Tejal Aher, Vaibhav Kale, K. Rajeswari, M. Karthikeyan
10.5120/ijca2015907296

Kritika Bhowmik, Tejal Aher, Vaibhav Kale, K. Rajeswari, M. Karthikeyan . Hadoop based Text Mining System for Identification of Chemicals Associated with Disease of Interest. International Journal of Computer Applications. 131, 4 ( December 2015), 26-29. DOI=10.5120/ijca2015907296

@article{ 10.5120/ijca2015907296,
author = { Kritika Bhowmik, Tejal Aher, Vaibhav Kale, K. Rajeswari, M. Karthikeyan },
title = { Hadoop based Text Mining System for Identification of Chemicals Associated with Disease of Interest },
journal = { International Journal of Computer Applications },
issue_date = { December 2015 },
volume = { 131 },
number = { 4 },
month = { December },
year = { 2015 },
issn = { 0975-8887 },
pages = { 26-29 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume131/number4/23439-2015907296/ },
doi = { 10.5120/ijca2015907296 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T23:26:23.158680+05:30
%A Kritika Bhowmik
%A Tejal Aher
%A Vaibhav Kale
%A K. Rajeswari
%A M. Karthikeyan
%T Hadoop based Text Mining System for Identification of Chemicals Associated with Disease of Interest
%J International Journal of Computer Applications
%@ 0975-8887
%V 131
%N 4
%P 26-29
%D 2015
%I Foundation of Computer Science (FCS), NY, USA
Abstract

With huge amounts of biomedical data being generated day by day extracting statistical information about the chemicals mentioned in such huge databases manually is tedious and time consuming. Our system is mainly designed for naive users, which aims to automate data collection and knowledge extraction from chemical literature in a user friendly and efficient way on the hadoop platform. The system downloads the abstracts related to the disease of interest from Pubmed database. The text of the abstracts is then extensively parsed for chemicals such as protein/gene names and chemical compound names and classified into different classes. This analysis would prove to be helpful in various biomedical and pharmaceutical industries. The extraction of important information will be done using the Ling Pipe API wherein a training dataset is given to this Ling Pipe which classifies the extracted bioentities in the respective classes. The system being deployed on hadoop platform provides a scalable and distributed system which processes huge number of abstracts in a short time and with high efficiency. The system also provides a user friendly user interface for easy use of the hadoop system for non technical users.

References
  1. Evangelos Pafilis, Georgios A. Pavlopoulos, Venkata P. Satagopam, Nikolas Papanikolaou, Heiko Horn, Christos Arvanitidis, Lars Juhl Jensen, Reinhard Schneider, Ioannis Iliopoulos -OnTheFly 2.0: a tool for automatic annotation of files and biological information extraction.,IEEE,2013.
  2. Su Yan, W.Scott Spangler, and Ying Chen - Chemical Name Extraction Based on Automatic Training Data Generation and Rich Feature Set
  3. Spiros Papadimitriou ,Jimeng Sun DisCo: Distributed Coclustering with MapReduce A Case Study Towards Petabyte-Scale End-to-End Mining.
  4. Vincent Nicolas, Alzennyr Da Silva, Marie Luce Picard.-Heta: Hadoop Environment For Text Analysis.
  5. Tom White, Hadoop The Definitive Guide, OREILLY, 2009.
  6. Sanjay Ghemawat Jeffrey Dean, Mapreduce : Simplified data processing on large cluster, Google Inc, 2004.
  7. Von Mering, L. J. Jensen, B. Snel, S. D. Hooper, M. Krupp, M. Foglierini, N. Jouffre, M. A. Huynen, and P. Bork, STRING: known and predicted protein-protein associations, integrated and transferred across organisms, Nucleic Acids Res, vol. 33, no. Database issue, pp. D4337, Jan 1, 2005.
  8. M. Kuhn, D. Szklarczyk, A. Franceschini, M. Campillos, C. von Mering, L. J. Jensen, A. Beyer, and P. Bork, STITCH 2: an interaction network database for small molecules and proteins, Nucleic acids research, vol. 38, no. Database issue, pp. D5526, Jan, 2010.
  9. M. Kuhn, C. von Mering, M. Campillos, L. J. Jensen, and P. Bork, STITCH: interaction networks of chemicals and proteins, Nucleic acids research, vol. 36, no. Database issue, pp. D6848, Jan, 2008.
  10. N. Papanikolaou, E. Pafilis, S. Nikolaou, C. A. Ouzounis, I. Iliopoulos, and V. J. Promponas, BioTextQuest: a web-based biomedical text mining suite for concept discovery, Bioinformatics, vol. 27, no. 23, pp. 33278, Dec 1, 2011.
  11. R. A. Erhardt, R. Schneider, and C. Blaschke, Status of text-mining techniques applied to biomedical text, Drug discovery today, vol. 11, no. 78, pp. 31525, Apr, 2006.
  12. Rebholz Schuhmann, A. Jimeno Yepes, M. Arregui, and H. Kirsch- Measuring prediction capacity of individual verbs for the identification of protein interactions, Journal of biomedical informatics, vol. 43, no. 2, pp. 2007, Apr, 2010.
  13. D. Rebholz Schuhmann, M. Arregui, S. Gaudan, H. Kirsch, and A. Jimeno, Text processing through Web services: calling Whatizit, Bioinformatics, vol. 24, no. 2, pp. 2968, Jan 15, 2008.
  14. R. Hoffmann, and A. Valencia- A gene network for navigating the literature, Nature genetics, vol. 36, no. 7, pp. 664, Jul, 2004.
  15. Z. Lu- PubMed and beyond: a survey of web tools for searching biomedical literature, Database (Oxford), vol. 2011, pp. baq036, 2011. IEEE THIRD QUARTER 2007, Douglas O Shaughnessy.
  16. alias-i.com/lingpipe/ Ling pipe
  17. M. Karthikeyan and Renu Vyas, Practical Chemoinformatics.
  18. Anabel Usi, Joaquim Cruz, Jorge Comas, Francesc Solsona and Rui Alves CheNER: a tool for the identification of chemical entities and their classes in biomedical literature, 2015.
Index Terms

Computer Science
Information Sciences

Keywords

Text mining chemicals hadoop LingPipe bioentities data extraction data classification.