We apologize for a recent technical issue with our email system, which temporarily affected account activations. Accounts have now been activated. Authors may proceed with paper submissions. PhDFocusTM
CFP last date
20 December 2024
Reseach Article

A Review on Text Similarity Technique used in IR and its Application

by Nitesh Pradhan, Manasi Gyanchandani, Rajesh Wadhvani
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 120 - Number 9
Year of Publication: 2015
Authors: Nitesh Pradhan, Manasi Gyanchandani, Rajesh Wadhvani
10.5120/21257-4109

Nitesh Pradhan, Manasi Gyanchandani, Rajesh Wadhvani . A Review on Text Similarity Technique used in IR and its Application. International Journal of Computer Applications. 120, 9 ( June 2015), 29-34. DOI=10.5120/21257-4109

@article{ 10.5120/21257-4109,
author = { Nitesh Pradhan, Manasi Gyanchandani, Rajesh Wadhvani },
title = { A Review on Text Similarity Technique used in IR and its Application },
journal = { International Journal of Computer Applications },
issue_date = { June 2015 },
volume = { 120 },
number = { 9 },
month = { June },
year = { 2015 },
issn = { 0975-8887 },
pages = { 29-34 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume120/number9/21257-4109/ },
doi = { 10.5120/21257-4109 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T23:05:48.501176+05:30
%A Nitesh Pradhan
%A Manasi Gyanchandani
%A Rajesh Wadhvani
%T A Review on Text Similarity Technique used in IR and its Application
%J International Journal of Computer Applications
%@ 0975-8887
%V 120
%N 9
%P 29-34
%D 2015
%I Foundation of Computer Science (FCS), NY, USA
Abstract

With large number of documents on the web, there is a increasing need to be able to retrieve the best relevant document. There are different techniques through which we can retrieve most relevant document from the large corpus. Similarity between words, sentences, paragraphs and documents is an important component in various tasks such as information retrieval, document clustering, word-sense disambiguation, automatic essay scoring, short answer grading, machine translation and text summarization. Text similarity means user's query text is matched with the document text and on the basis on this matching user retrieves the most relevant documents. Text similarity also plays an important role in the categorization of text as well as document. We can measure the similarity between sentences, words, paragraphs and documents to categorize them in an efficient way. On the basis of this categorization, we can retrieve the best relevant document corresponding to user's query. This paper describes different types of similarity like lexical similarity, semantic similarity etc.

References
  1. Chapman, "SimMetrics: a java & c#. net library of similarity metrics", http://sourceforge. net/projects/simmetrics, 2006.
  2. W. Irving, C. B. Fraser, "Two algorithms for the longest common subsequence of three (or more) strings, Proceedings of the 3rd Annual Symposium on Combinatorial Pattern Matching", pp. 214-229,1992.
  3. Alberto, B. Paolo, R. Eneko A. & Gorka L, "Plagiarism Detection across Distant Language Pairs", In Proceedings of the 23rd International Conference on Computational Linguistics, pp 37–45, 2010.
  4. Navarro, Gonzalo, "A guided tour to approximate string matching", ACM Computing Surveys 33 (1): pp 31–88, 2001.
  5. Cohen, W. Ravikumar, P. Fienberg, "A comparison of string distance metrics for name-matching tasks", KDD Workshop on Data Cleaning and Object Consolidation, pp 73–8, 2003.
  6. Manusnanth, Panyamee, Somjit Arj-in, "Document clustering results on the semantic web search", In Proceedings of The 5th National Conference on Computing and Information Technology, 2009.
  7. Jaccard, "Étude comparative de la distribution florale dans une portion des Alpes et des Jura", Bulletin dela Société Vaudoise des Sciences Naturelles, pp 547-579, 1901.
  8. Turney, "Mining the web for synonyms PMIIR versus LSA on TOEFL" In Proceedings of the Twelfth European Conference on Machine Learning (ECML), 2001.
  9. Resnik, "Using information content to evaluate semantic similarity", In Proceedings of the 14th International Joint Conference on Artificial Intelligence, 1995.
  10. Budanitsky, Hirst, "Semantic distance in Word-Net: An experimental application-oriented evaluation of five measures". In Proceedings of the NAACL Workshop on Word-Net and Other Lexical Resources, 2001.
  11. Sahami, Mehran Sahami, Timothy D. Heilman, "A web-based Kernel Function for Measuring the Similarity of Short Text Snippets", Proceedings of the 15th International Conference on World Wide Web, pp: 377-386, 2006.
  12. M. Li, P. M. B. Vitanyi, "An Introduction to Kolmogorov Complexity and Its Applications", 2nd Ed. , Springer-Verlag, New York, 1997.
  13. M. Li, J. H. Badger, X. Chen, S. Kwong, P. Kearney, and H. Zhang, "An information-based sequence distance and its application to whole mitochondrial genome phylogeny", Bioinformatics, pp 149–154 2001.
  14. Cilibrasi, Rudi L. Cilibrasi, Paul M. B. Vitanyi, "The Google Similarity Distance", in IEEE Transactions on Knowledge and Data Engineering, pp: 370-383, 2007.
  15. H. Muir," Software to unzip identity of unknown composers", New Scientist, 12 April 2003.
  16. Gang Qian, Shamik Sural, Yuelong Gu, Sakti Pramanik, "Similarity between euclidean and cosine angle distance for nearest neighbor queries", Proceedings of ACM Symposium on Applied Computing, 2004.
  17. Radev Dragomir Radev, Hongyan Jing, and Malgorzata Budzikowska, "Centroid-based summarization of multiple documents: sentence extraction, utility-based evaluation, and user studies", in NAACL-ANLP Workshop on Automatic summarization, 2000.
  18. Susan T. Dumais, "Latent Semantic Analysis". Annual Review of Information Science and Technology 2005.
  19. Rensch, Calvin R. "Calculating lexical similarity", In Eugene H. Casad (ed. ), Windows on bilingualism, pp 13-15 1992.
  20. Rada Mihalcea, Courtney Corley, and Carlo Strapparava. "Corpus-based and Knowledge-based Measures of Text Semantic Similarity", In Proceedings of AAAI, Boston, July, 2006.
  21. Manusnanth Panyamee, and Somjit Arj-in, "Document clustering results on the semantic web search", In Proceedings of The 5th National Conference on Computing and Information Technology, King Mongkut's University of Technology, 2009.
  22. Mahyuddin, "Kolmogorov Complexity: Clustering Objects and Similarity", Mathematic Department, In Proceedings of the 23rd International Conference on Computational Linguistics, 26 July 2012.
  23. Rudi Cilibrasi, Paul M. B. Vitanyi, "Normalized Web distance and word similarity", in NAACL-ANLP Workshop on Automatic summarization, May 2009.
Index Terms

Computer Science
Information Sciences

Keywords

Text similarity Lexical similarity semantic similarity Corpus based similarity and Knowledge based similarity.