CFP last date
20 January 2025
Reseach Article

Investigate the Performance of WordNet and Association Rules for Hard Clustering Web Document

by Noha Negm, Hany Mahgoub
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 178 - Number 14
Year of Publication: 2019
Authors: Noha Negm, Hany Mahgoub
10.5120/ijca2019918905

Noha Negm, Hany Mahgoub . Investigate the Performance of WordNet and Association Rules for Hard Clustering Web Document. International Journal of Computer Applications. 178, 14 ( May 2019), 22-32. DOI=10.5120/ijca2019918905

@article{ 10.5120/ijca2019918905,
author = { Noha Negm, Hany Mahgoub },
title = { Investigate the Performance of WordNet and Association Rules for Hard Clustering Web Document },
journal = { International Journal of Computer Applications },
issue_date = { May 2019 },
volume = { 178 },
number = { 14 },
month = { May },
year = { 2019 },
issn = { 0975-8887 },
pages = { 22-32 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume178/number14/30598-2019918905/ },
doi = { 10.5120/ijca2019918905 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-07T00:50:22.754780+05:30
%A Noha Negm
%A Hany Mahgoub
%T Investigate the Performance of WordNet and Association Rules for Hard Clustering Web Document
%J International Journal of Computer Applications
%@ 0975-8887
%V 178
%N 14
%P 22-32
%D 2019
%I Foundation of Computer Science (FCS), NY, USA
Abstract

A powerful technique that has been widely used to organizing a large number of web documents into a small number of general and meaningful clusters is Document Clustering. High dimensionality, scalability, accuracy, extracting semantics relations from texts and meaningful cluster labels are the major challenges for document clustering. To improve the document clustering quality, we intend to introduce an effective methodological system using association rules instead of frequent term sets for clustering web documents into different topical groups called Hard Document Clustering using Association Rules (HDCAR). HDCAR characterized by high performance in the organization of web documents and navigates them effectively in order to keep up with the explosive growth of the number and size of web documents. Association Rule has the equally important advantage of having a higher descriptive power compared to single words (frequent term sets). Moreover, the external knowledge from both WordNet synonym and hypernyms will be used to enhance the ‘‘bag of words’’ used before the clustering process and to assist the label generation procedure following the clustering process. Then, Multi-Hash Tire Association Rule (MHTAR) algorithm is used to discover a set of highly-related association rules to overcome the drawbacks of the Apriori algorithm. Through the resulted association rules, the hidden topics are discovered as the first step and then the documents will be cluster based on them. Finally, each document is assigned to only one cluster (hard clustering) with the highest Document Weighted-measure, and then the highly similar clusters are merged. To evaluate the performance of HDCAR, we conducted experiments based on four different kinds of datasets Classic, Re0, WebKB and REUTER datasets. The experimental results show that HDCAR outperforms the major document clustering methods like k-means, Bisecting k-means, FIHC, and UPGMA with higher accuracy quality, efficiency and lower execution time. Furthermore, HDCAR provides more general and meaningful labels for documents and increases the documents clusterization speed, as a result of the reduction of their dimensionality.

References
  1. R. M. Aliguliyev, “Clustering of document collection – A weighting approach”, Journal of Expert Systems with Applications, 2009, 36(4): 7904–7916.
  2. R. M. Aliguliyev, “Automatic document summarization by sentence extraction”, Journal of Computational Technologies, 2007, 12(5): 5–15.
  3. J. Kuo, H. Chen, “Cross-document event clustering using knowledge mining from co-reference chains”, Journal of Information Processing and Management, 2007, 327–343.
  4. Andrews O, Fox A. 2007. Recent developments in document clustering [R]. Computer Science, Virginia Tech, 1-25.
  5. S. Chow, H. Zhang, M. Rahman, “A new document representation using term frequency and vectorized graph connectionists with application to document retrieval”, Journal of Expert Systems with Applications, 2009, 36(10):12023–12035.
  6. S. Jun, S. Park, D. Jang, “Technology forecasting using matrix map and patent clustering”, Journal of Industrial Management and Data Systems, 2012, 112(5): 786–807.
  7. C. Luo, Y. Li, S. Chung, “Text document clustering based on neighbors”, Journal of Data & Knowledge Engineering,2009, 68(11): 1271–1288.
  8. S. Jun, S. Park, D..Jang, “Document clustering method using dimension reduction and support vector clustering to overcome sparseness”, Journal of Expert system with Applications, 2014, 41(7): 3204-3212.
  9. J. Zamora, M. Mendoza, H. Allende, “Hashing-based clustering in high dimensional data”, Journal of Expert system with Applications, 2016, 62(c):202-211.
  10. F. França, “A hash-based co-clustering algorithm for categorical data”, Journal of Expert System with Applications, 2016, 64(c):24-35.
  11. Steinbach M, Karypis G , Kumar V. 2000. A comparison of document clustering techniques. In Proceedings of the international Conference on Knowledge Discovery and Data Mining (KDD), 1-20.
  12. M. Filippone, F. Camastra, F. Masulli, S. Rovetta, “A survey of kernel and spectral methods for clustering”, Journal of Pattern Recognition, 2008, 41(1):176–190.
  13. J. Grabmeier, A. Rudolph, “Techniques of cluster algorithms in data mining”, Journal of Data Mining and Knowledge Discovery, 2002, 6(4):303–360.
  14. A. K. Jain, M. N. Murty, P. J. Flynn, “Data clustering: A review”, Journal of ACM Computing Surveys, 1999, 31(3):264–323.
  15. L. Parsons, E. Haque, H..Liu, “Subspace clustering for high dimensional data: A review”, Journal of ACM SIGKDD Explorations Newsletter, 2004, 6(1): 90–105.
  16. R. Xu, D.Wunsch, “Survey of clustering algorithms”, Journal of IEEE Transactions on Neural Networks, 2005, 16(3):645–678.
  17. MacQueen J. B. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the international Conference on Mathematical Statistics and Probability, 281–297.
  18. Manning C. D, Schutze H. 2000. Foundations of statistical natural language processing. (2nd Ed.). Cambridge: England: MIT Press.
  19. A. Bouguettaya, Q. Yu, X. Liu, X. Zhou, A. Song, “Efficient agglomerative hierarchical clustering”, Journal of Expert system with Applications, 2015, 42(5): 2785-2797.
  20. Baeza-Yates R, Ribeiro-Neto R. 1999. Modern information retrieval. (2nd edition.). NY: Addison Wesley, ACM Press.
  21. G. Miller, R. Beckwith, C. Fellbaum, D. Gross, K. Miller, “Introduction to WordNet: An On-line Lexical Database”, Journal of Lexicography, 1990, 3(4):235–244.
  22. T. Wei, Y. Lu, H. Chang, Q. Zhou, X. Bao, “A semantic approach for text clustering using WordNet and lexical chains”, Journal of Expert System with Applications, 2015, 4:2264-2275.
  23. Y. Li, C. Luo, S. M. Chung, “Text clustering with feature selection by using statistical data”, Journal of IEEE Transactions on Knowledge and Data Engineering, 2008, 20(20):641–652.
  24. Beil F, Ester M, Xu X. 2002. Frequent term-based text clustering. In Proceedings of the international Conference on knowledge Discovery and Data Mining, 436–442.
  25. R. Agrawal, J. Shafer, “Parallel mining of association Rules”, Journal of IEEE Transactions on Knowledge and Data Engineering, 1996, 8(6):962-969.
  26. Ch. Chen, F. Tseng, T. Liang, “Mining fuzzy frequent itemsets for hierarchical document clustering”, Journal of Information Processing & Management, 2010, 46(2):193-211.
  27. Zaiane O, Antonie M. 2002. Classifying text documents by association terms with text categories. In Proceedings of the International Conference of Australasian Database, 215-222.
  28. Xiangwei L, Pilian H. 2005. A study on text clustering algorithms based on frequent term sets. In Proceedings of the international Conference on Advanced Data Mining and Applications. 347–354.
  29. L. Abualigah, A. Khader, M. Al-Betar, O. Alomari, “Text feature selection with a robust weight scheme and dynamic dimension reduction to text document clustering”, Journal of Expert system with applications, 2017, 84:24-36.
  30. Agrawal R, Imielinski T, Swami A. 1993. Mining association rules between sets of items in large databases. In Proceedings of the international Conference on Management of Data, 207–216.
  31. Fung B, Wang K, Ester M. 2003. Hierarchical document clustering using frequent itemsets. In Proceedings of the international Conference on Data Mining. 59–70.
  32. Yu H, Searsmith D, Li X, Han J. 2004. Scalable construction of topic directory with nonparametric closed termset mining. In Proceedings of the international Conference ICDM, 563–566.
  33. A. Abdelmalek, E. Zakaria, S. Michel, “Evaluation of text clustering methods using WordNet”, Journal of Information Technology, 2010, 7(4):349-357.
  34. T. Gharib, M. Fouad, M. Aref, “Fuzzy document clustering approach using WordNet lexical categories”, Journal of Advanced Techniques in Computing Sciences and Software Engineering, 2010. 181-186.
  35. Carmel D, Roitman H, Zwerdling N. 2009. Enhancing cluster labeling using Wikipedia. In Proceedings of the International Conference on Research and Development Information Retrieval, 139–146.
  36. C. Chen, F. Tseng, T. Liang, “An integration of WordNet and fuzzy association rule mining for multi-label document clustering”, Journal of Data & Knowledge Engineering, 2011, 69:1208-1226.
  37. Y. Tseng, “Generic title labeling for clustered documents”, Journal of Expert Systems with Applications, 2009, 37(3):2247–2254.
  38. Treeratpituk P, Callan J. 2006. Automatically labeling hierarchical clusters. In Proceedings of the international Conference on Digital Government, 167-176.
  39. Sedding J, Kazakov D. 2004. WordNet-based text document clustering. In Proceedings of COLING-Workshop on Robust Methods in Analysis of Natural Language Data, 104-113.
  40. Y. Li, M. Chung, D. Holt, “Text document clustering based on frequent word meaning sequences”, Journal of Data and Knowledge Engineering, 2008, 64:381–404.
  41. Kiran R, Ravi S, Vikram P. 2010. Frequent itemset based hierarchical document clustering using Wikipedia as external knowledge. In Proceedings of the international conference on Knowledge-based and intelligent information and engineering systems, 11-20.
  42. V. Bhatia, R. Rani, “A parallel fuzzy clustering algorithm for large graphs using Pregel”, Journal of Expert System with Applications, 2017, 78(c):135-144.
  43. Mahgoub H, keshk A, Torkey F, Ismail N. 2010. An Efficient Online System of Concepts Based Association Rules Mining. In Proceedings of the international Conference on informatics and systems, 1-8.
  44. N. Negm, P. Elkafrawy, M. Amin, A. Salem, “Clustering Web Documents based on Efficient Multi-Tire Hashing Algorithm for Mining Frequent Termsets”, Journal of Advanced Research in Artificial Intelligence, 2013, 2(6): 6-14.
  45. C. Michenerand, R. Sokal, “A quantitative approach to a problem in classification”, Journal of Evolution, 1957, 11(2):130–162.
Index Terms

Computer Science
Information Sciences

Keywords

Web Mining Document Clustering Association rule mining WordNet Fuzzy weighting score.