We apologize for a recent technical issue with our email system, which temporarily affected account activations. Accounts have now been activated. Authors may proceed with paper submissions. PhDFocusTM
CFP last date
20 December 2024
Reseach Article

Efficient Approach to find Bigram Frequency in Text Document using E-VSM

by Ankit Bhakkad, S. C. Dharamadhikari, Parag Kulkarni
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 68 - Number 19
Year of Publication: 2013
Authors: Ankit Bhakkad, S. C. Dharamadhikari, Parag Kulkarni
10.5120/11686-7356

Ankit Bhakkad, S. C. Dharamadhikari, Parag Kulkarni . Efficient Approach to find Bigram Frequency in Text Document using E-VSM. International Journal of Computer Applications. 68, 19 ( April 2013), 9-11. DOI=10.5120/11686-7356

@article{ 10.5120/11686-7356,
author = { Ankit Bhakkad, S. C. Dharamadhikari, Parag Kulkarni },
title = { Efficient Approach to find Bigram Frequency in Text Document using E-VSM },
journal = { International Journal of Computer Applications },
issue_date = { April 2013 },
volume = { 68 },
number = { 19 },
month = { April },
year = { 2013 },
issn = { 0975-8887 },
pages = { 9-11 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume68/number19/11686-7356/ },
doi = { 10.5120/11686-7356 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T21:28:18.047942+05:30
%A Ankit Bhakkad
%A S. C. Dharamadhikari
%A Parag Kulkarni
%T Efficient Approach to find Bigram Frequency in Text Document using E-VSM
%J International Journal of Computer Applications
%@ 0975-8887
%V 68
%N 19
%P 9-11
%D 2013
%I Foundation of Computer Science (FCS), NY, USA
Abstract

This paper proposes a novel and efficient approach to calculate bigram frequency which uses E-VSM as basis to represent text document. E-VSM: Enhanced-Vector Space Model is nothing but an extension to simple VSM which stores positions of tokens in addition to their frequency in document. Many recent methodologies in Information Retrieval and Text Mining have used bigram along with unigram since bigram gives more information gain than unigrams. Also recent efforts to provide more richer text document representation than simple 'Bag of Words' have also used bigram along with unigram. Proposed approach to calculate bigram frequency outperforms state-of-art in terms of time complexity. Analysis show that proposed approach improves time complexity to significant extent.

References
  1. Matthew A. Russel,"Mining the Social Web", O'Reilly (2011), chapter 7, pp 224-229
  2. Braga, Igor, Maria Monard, and Edson Matsubara (2009), "Combining unigrams and bigrams in semi-supervised text classification", Proceedings of Progress in Artificial Intelligence, 14th Portuguese Conference on Artificial Intelligence (EPIA 2009), Aveiro, pp. 489-500.
  3. Yashodhara Haribhakta, Arti Malgaokar and Dr. Parag Kulkarni, "Unsupervised Topic Detection Model and Its Application in Text Categorization", 2012 ACM 978-1-4503-1185-4/12/09
  4. Ankit Bhakkad, S. C. Dharmadhikari, Parag Kulkarni and M. Emmanuel, "E-VSM : Novel Text Representation Model to Capture Context-based Closeness between two Text documents", Proceedings of 7th International Conference on Intelligent Systems and Control (ISCO 2013), Coimbatore, India, pp. 345-348.
  5. R. Bekkerman and J. Allan. , "Using bigrams in text Categorization", Technical Report IR-408, Department of Computer Science, University of Massachusetts, Amherst, MA, 2004.
  6. M. Tan, Y. F. Wang, and C. D. Lee. , "The use of bigrams to enhance text categorization", Information Processing and Management, 38(4):529–546, 2002.
  7. T. Dumais, J. Platt, D. Heckerman, and M. Sahami, "Inductive learning algorithms and representations for text categorization", In Proceedings of CIKM'98, 7th ACM International Conference on Information and Knowledge Management, pages 148–155, Bethesda, US, 1998. ACM Press, New York, US.
Index Terms

Computer Science
Information Sciences

Keywords

E-VSM bigram trigram n-gram frequency count