CFP last date
20 December 2024
Reseach Article

Implementation of Keyword Extraction using Term Frequency-Inverse Document Frequency (TF-IDF) in Python

by Ahmad Farhan AlShammari
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 185 - Number 35
Year of Publication: 2023
Authors: Ahmad Farhan AlShammari
10.5120/ijca2023923137

Ahmad Farhan AlShammari . Implementation of Keyword Extraction using Term Frequency-Inverse Document Frequency (TF-IDF) in Python. International Journal of Computer Applications. 185, 35 ( Sep 2023), 9-14. DOI=10.5120/ijca2023923137

@article{ 10.5120/ijca2023923137,
author = { Ahmad Farhan AlShammari },
title = { Implementation of Keyword Extraction using Term Frequency-Inverse Document Frequency (TF-IDF) in Python },
journal = { International Journal of Computer Applications },
issue_date = { Sep 2023 },
volume = { 185 },
number = { 35 },
month = { Sep },
year = { 2023 },
issn = { 0975-8887 },
pages = { 9-14 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume185/number35/32916-2023923137/ },
doi = { 10.5120/ijca2023923137 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-07T01:27:51.591577+05:30
%A Ahmad Farhan AlShammari
%T Implementation of Keyword Extraction using Term Frequency-Inverse Document Frequency (TF-IDF) in Python
%J International Journal of Computer Applications
%@ 0975-8887
%V 185
%N 35
%P 9-14
%D 2023
%I Foundation of Computer Science (FCS), NY, USA
Abstract

The goal of this research is to develop a keyword extraction program using Term Frequency-Inverse Document Frequency (TF-IDF) in Python. The purpose of keyword extraction is to identify the set of words (keywords) that describe the content of the text. The TF-IDF method is used to measure the importance of words in the text. The basic steps of keyword extraction are explained: preprocessing text, creating list of words, creating bag of words, creating word frequency (TF), creating inverse document frequency (IDF), creating word frequency-inverse document frequency (TF-IDF), creating keywords, and sorting keywords. The developed program was tested on an experimental text from Wikipedia. The program successfully performed the basic steps of keyword extraction and provided the required results.

References
  1. Sammut, C., & Webb, G. I. (2011). "Encyclopedia of Machine Learning". Springer.
  2. Aggarwal, C. (2015). "Data Mining: The Textbook". New York: Springer.
  3. Lee, S., & Kim, H. J. (2008). "Automatic Keyword Extraction from News Articles using TF-IDF Model". Networked Computing and Advanced Information Management, 2.
  4. Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010). "Automatic Keyword Extraction from Individual Documents". In Text Mining: Applications and Theory, 1-20.
  5. Kaur, J., & Gupta, V. (2010). "Effective Approaches for Extraction of Keywords". International Journal of Computer Science Issues, 7(6), 144-148.
  6. Hong, B., & Zhen, D. (2012). "An Extended Keyword Extraction Method". Physics Procedia, 24, 1120-1127.
  7. Beliga, S. (2014). "Keyword Extraction: A Review of Methods and Approaches". University of Rijeka, Department of Informatics, Rijeka, 1(9).
  8. Breitinger, C., Gipp, B., Langer, S. (2015). "Research-Paper Recommender Systems: A Literature Survey". International Journal on Digital Libraries, 17(4), 305-338.
  9. Siddiqi, S., & Sharan, A. (2015). "Keyword and Keyphrase Extraction Techniques: A Literature Review". International Journal of Computer Applications, 109(2), 18-23.
  10. Gupta, T. (2017). "Keyword Extraction: A Review". International Journal of Engineering Applied Sciences and Technology, 2(4), 215-220.
  11. Bharti, S. K., & Babu, K. S. (2017). "Automatic Keyword Extraction for Text Summarization: A Survey". arXiv preprint arXiv:1704.03242.
  12. Qaiser, S., & Ali, R. (2018). "Text Mining: Use of TF-IDF to Examine the Relevance of Words to Documents". International Journal of Computer Applications, 181(1), 25-29.
  13. Thushara, M. G., Mownika, T., & Mangamuru, R. (2019). "A Comparative Study on Different Keyword Extraction Algorithms". In 2019 3rd International Conference on Computing Methodologies and Communication (ICCMC) (pp. 969-973). IEEE.
  14. Firoozeh, N., Nazarenko, A., Alizon, F., & Daille, B. (2020). "Keyword Extraction: Issues and Methods". Natural Language Engineering, 26(3), 259-291.
  15. Xu, Z., & Zhang, J. (2021). "Extracting Keywords from Texts based on Word Frequency and Association Features". Procedia Computer Science, 187, 77-82.
  16. Li, J. (2021). "A Comparative Study of Keyword Extraction Algorithms for English Texts". Journal of Intelligent Systems, 30(1), 808-815.
  17. Luhn, H. (1958). "The Automatic Creation of Literature Abstracts". IBM Journal of Research and Development, 2(2), 159-165.
  18. Sparck Jones, K. (1972). "A Statistical Interpretation of Term Specificity and Its Application in Retrieval". Journal of Documentation. 28(1), 11–21.
  19. Sparck Jones, K. (2004). "IDF Term Weighting and IR Research Lessons". Journal of Documentation, 60(5), 521-523.
  20. Robertson, S. (1972). "Term Specificity". Journal of Documentation, 28(1), 164-165.
  21. Robertson, S. (1974). "Documentation Note: Specificity and Weighted Retrieval". Journal of Documentation, 30(1), 41-46.
  22. Robertson, S. (2004). "Understanding Inverse Document Frequency: On Theoretical Arguments for IDF". Journal of Documentation, 60(5), 503-520.
  23. Salton, G., Wong, A., & Yang, C. S. (1975a). "A Vector Space Model for Automatic Indexing". Communications of the ACM, 18(11), 613-620.
  24. Salton, G., Yang, C. S., & Yu, C. T. (1975b). "A Theory of Term Importance in Automatic Text Analysis". Journal of the American Society for Information Science, 26(1), 33-44.
  25. Salton, G. & McGill, M. (1983). "Introduction to Modern Information Retrieval". McGraw Hill Book Co, New York.
  26. Salton, G., & Buckley, C. (1988). "Term-Weighting approaches in Automatic Text Retrieval". Information Processing and Management, 24(5), 513-523.
  27. Salton, G. (1989). "Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer". Addison- Wesley Publishing Company, USA.
  28. Salton, G., Singhal, A., Mitra, M., & Buckley, C. (1997). "Automatic Text Structuring and Summarization". Information Processing & Management, 33(2), 193-207.
  29. Python: https://www.python.org
  30. Numpy: https://www.numpy.org
  31. Pandas: https:// pandas.pydata.org
  32. Matplotlib: https://www. matplotlib.org
  33. NLTK: https://www.nltk.org
  34. SK Learn: https://scikit-learn.org
  35. Wikipedia: https://en.wikipedia.org
Index Terms

Computer Science
Information Sciences

Keywords

Artificial Intelligence Machine Learning Natural Language Processing Text Mining Keyword Extraction Term Frequency-Inverse Document Frequency TF-IDF Python Programming.