CFP last date
20 January 2025
Reseach Article

Combination of K-Nearest Neighbor and K-Means based on Term Re-weighting for Classify Indonesian News

by Putu Wira Buana, Sesaltina Jannet D.r.m., I Ketut Gede Darma Putra
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 50 - Number 11
Year of Publication: 2012
Authors: Putu Wira Buana, Sesaltina Jannet D.r.m., I Ketut Gede Darma Putra
10.5120/7817-1105

Putu Wira Buana, Sesaltina Jannet D.r.m., I Ketut Gede Darma Putra . Combination of K-Nearest Neighbor and K-Means based on Term Re-weighting for Classify Indonesian News. International Journal of Computer Applications. 50, 11 ( July 2012), 37-42. DOI=10.5120/7817-1105

@article{ 10.5120/7817-1105,
author = { Putu Wira Buana, Sesaltina Jannet D.r.m., I Ketut Gede Darma Putra },
title = { Combination of K-Nearest Neighbor and K-Means based on Term Re-weighting for Classify Indonesian News },
journal = { International Journal of Computer Applications },
issue_date = { July 2012 },
volume = { 50 },
number = { 11 },
month = { July },
year = { 2012 },
issn = { 0975-8887 },
pages = { 37-42 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume50/number11/7817-1105/ },
doi = { 10.5120/7817-1105 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T20:48:03.277032+05:30
%A Putu Wira Buana
%A Sesaltina Jannet D.r.m.
%A I Ketut Gede Darma Putra
%T Combination of K-Nearest Neighbor and K-Means based on Term Re-weighting for Classify Indonesian News
%J International Journal of Computer Applications
%@ 0975-8887
%V 50
%N 11
%P 37-42
%D 2012
%I Foundation of Computer Science (FCS), NY, USA
Abstract

KNN is one of the accepted classification tool, it used all training samples in the classification which cause to a high level of computation complexity. To resolve this problem, it is necessary to combine traditional KNN algorithm and K-Means cluster algorithm that is proposed in this paper. After completing the preprocessing step, the first thing to do is weighting the word (term) by usingTerm Frequency-Inverse Document Frequency (TF-IDF). TF-IDF weightedthe words calculating the number of words that appear in a document. Second, grouping all the training samples of each category of K-means algorithm, and take all the cluster centers as the new training sample. Third, the modified training samples are used for classification with KNN algorithm. Finally, calculate the accuracy of the evaluation using precision, recall and f-measure. The simulation results show that the combination of the proposed algorithm in this study has a percentage accuracy reached 87%, an average value of f-measure evaluation= 0. 8029 with the best k-values= 5 and the computation takes 55 second for one document.

References
  1. Feldman, Ronen and Sanger, James. 2007. The Text Mining Handbook Advanced Approaches in Analyzing Unstructured Data. New York: Cambridge University Press.
  2. Hearst, Marti. 2003. What is text mining?. SIMS, UC Berkeley. http://www. sims. berkeley. edu/~hearst/text-mining. html
  3. Srivastava, Ashok N. and Sahami, Mehran. 2009. Text Mining Classification, Clustering, and Application. New York: CRC Press
  4. Herwansyah,Adhit. 2009. AplikasiPengkategorianDokumendanPengukuran Tingkat SimilaritasDokumenMenggunakan Kata KuncipadaDokumenPenulisanIlmiahUniversitasGunadarma. http://www. gunadarma. ac. id/library/articles/graduate/computer-science/2009/Artikel_10105046. pdf
  5. E. Fix and J. Hodges Discriminatory analysis. Nonparametric discrimination: Consistency properties. Technical Report 4, USAF School of Aviation Medicine Randolph Field, Texas, 1951.
  6. Xindong Wu and Vipin Kumar. The Top Ten Algorithms in Data Mining. Chapman & Hall/CRC. New York: CRC Press
  7. W. Yu, and W. Zhengguo, A Fast kNN algorithm for text categorization, Proceedings of the Sixth International Conference on Machine Learning and Cybernetics, Hong Kong, pp. 3436-3441, 2007.
  8. Yang Y, Pedersen J O. A comparative study on feature selection in text categorization, ICNL,1997, pp. 412-420
  9. Zhou Yong, LiYouwen and Xia Shixiong. 2009. An Improved KNN Text Classification Algorithm Based on Clustering. Journal of Computers, vol. 4,no. 3
  10. N. Suguna and Dr. K. Thanushkodi. 2010. An Improved k-Nearest Neighbor Classification Using Genetic Algorithm. International Journal of Computer Science Issues, vol. 7,Issue 4,No. 2
  11. Elisabeth, Hendrice. 2009. News Text Classification by Weight Adjusted K-Nearest Neighbor (WAKNN). InstitutTeknologi Telkom, Bandung,Indonesia.
  12. Garcia, Dr. E. 2005. The Classic Vector Space Model (Description, Advanteges and Limitations of the Classic Vector Space Model).
  13. Baldi, P, P. Frasconi, dan P. Smyth. 2003. ModellingThe Internet and The Web: Probabilistic Methods and Algorithms. New York: John and Willey & Sons.
  14. Keno Buss. Literature Review on Preprocessing for Text Mining. STRL, De Montfort University.
  15. Ramos, Juan. 2006. Using TF-IDF to Determine Word Relevance in Document Queries. Department of Computer Science, Rutgers University. http://www. cs. rutgers. edu/~mlittman/courses/m103/iCML03/papers/ramos. pdf
  16. Atila Elci. 2011. Text Classification by PNN Term Re-Weighting. Turkey. International Journal of Computer Application Vol 29-No. 12, September 2011
  17. Teknomo, Kardi. K-Nearest Neighbors Tutorial. http://people. revoledu. com/kardi/tutorial/KNN/index. html
  18. Yang Lihua, Dai Qi, GuoYanjun, Study on KNN Text Categorization Algorithm, Micro Computer Information, No. 21, 2006, pp. 269-271
  19. Xu, RuidanWunsch, D. C. 2009. Clustering. New York: John Wiley & Sons
  20. Khaled W. Alnaji and Wesam M. Ashour. 2011. A Novel Clustering Algorithm using K-means (CUK). The Islamic University of Gaza. International Journal of Computer Applications Vol 25 No. 1 July 2011
  21. Xinhao Wang, DingshengLuo, Xihong Wu, Huisheng Chi, Improving Chinese Text Categorization by Outlier Learning, Proceeding of NLP-KE'05, pp. 602-607
  22. Lewis, D. 1995. Evaluating and Optimizing Autonomous Text Classification Systems. AT&T Bell Laboratories Murray Hill, NJ 07974. USA. Proceedings of the Eighteenth Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, July, 1995, pp. 246-254 http://net. pku. edu. cn/~wbia/2005/public_html/papers/classification/
  23. Tala, Fadillah Z, 2003. A Study of Stemming Effects on Information Retrieval in Bahasa Indonesia. Master of Logic Project. Institute for Logic, Language and Computation. Unversiteitvan Amsterdam. The Netherlands. www. illc. uva. nl/Publications/ResearchReports/MoL-200302. text. pdf
  24. http://datamin. ubbcluj. ro/wiki/index. php/Evaluation_methods_in_text_categorization
Index Terms

Computer Science
Information Sciences

Keywords

Text Classification KNN classification Algorithm K-means Cluster Algorithm TF-IDF Method