CFP last date
01 October 2024
Reseach Article

Implementation of Text Similarity using Cosine Similarity Method in Python

by Ahmad Farhan AlShammari
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 185 - Number 2
Year of Publication: 2023
Authors: Ahmad Farhan AlShammari
10.5120/ijca2023922667

Ahmad Farhan AlShammari . Implementation of Text Similarity using Cosine Similarity Method in Python. International Journal of Computer Applications. 185, 2 ( Apr 2023), 11-14. DOI=10.5120/ijca2023922667

@article{ 10.5120/ijca2023922667,
author = { Ahmad Farhan AlShammari },
title = { Implementation of Text Similarity using Cosine Similarity Method in Python },
journal = { International Journal of Computer Applications },
issue_date = { Apr 2023 },
volume = { 185 },
number = { 2 },
month = { Apr },
year = { 2023 },
issn = { 0975-8887 },
pages = { 11-14 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume185/number2/32676-2023922667/ },
doi = { 10.5120/ijca2023922667 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-07T01:25:03.274805+05:30
%A Ahmad Farhan AlShammari
%T Implementation of Text Similarity using Cosine Similarity Method in Python
%J International Journal of Computer Applications
%@ 0975-8887
%V 185
%N 2
%P 11-14
%D 2023
%I Foundation of Computer Science (FCS), NY, USA
Abstract

The goal of this research is to develop a text similarity program using cosine similarity method in Python. The steps of text similarity process are: preprocessing text, word-tokenization, creating list of words, creating bag of words, calculating word frequency, and calculating cosine similarity. The developed program was examined on two experimental texts from Wikipedia. The program successfully performed the steps of text similarity and provided the required results.

References
  1. Hotho, A., Nürnberger, A., & Paass, G. (2005). "A Brief Survey of Text Mining". LDV Forum - GLDV Journal for Computational Linguistics and Language Technology. 20, 19-62.
  2. Salton, G. & Lesk, M. E. (1965). "The SMART Automatic Document Retrieval Systems: An Illustration". Communications of the ACM. 8 (6): 391-398.
  3. Salton, G. (1971). "The SMART Retrieval System: Experiments in Automatic Document Retrieval". Englewood Cliffs, N.J.: Prentice Hall Inc.
  4. Salton, G., Wong, A., & Yang, C. (1975). "A Vector Space Model for Automatic Indexing". Communications of the ACM, 18(11), 613-620.
  5. Salton, G., & Buckley, C. (1988). "Term-Weighting Approaches in Automatic Text Retrieval". Information Processing and Management, 24(5), 513-523.
  6. Salton, G. & McGill, M. (1983). "Introduction to Modern Information Retrieval". McGraw Hill Book Co, New York.
  7. Salton, G., Allan, J., & Buckley, C. (1994). "Automatic Structuring and Retrieval of Large Text Files". Communications of the ACM, 37(2), 97-108.
  8. Python: https://www.python.org
  9. Numpy: https://www.numpy.org
  10. Pandas: https:// pandas.pydata.org
  11. Matplotlib: https://www. matplotlib.org
  12. NLTK: https://www.nltk.org
  13. SciKit: https://scikit-learn.org
  14. Wikipedia: https://en.wikipedia.org
Index Terms

Computer Science
Information Sciences

Keywords

Artificial Intelligence Machine Learning Text Similarity Natural Language Processing Word-Tokenization Word Frequency Cosine Similarity Python Programming.