International Journal of Computer Applications |
Foundation of Computer Science (FCS), NY, USA |
Volume 182 - Number 48 |
Year of Publication: 2019 |
Authors: K.N.S.S.V. Prasad, S. K. Saritha |
10.5120/ijca2019918731 |
K.N.S.S.V. Prasad, S. K. Saritha . Concept Mining in Text Documents using Clustering. International Journal of Computer Applications. 182, 48 ( Apr 2019), 24-33. DOI=10.5120/ijca2019918731
Due to daily quick growth of the information, there are considerable needs to extract and discover valuable knowledge from data sources such as World Wide Web. The common methods in text mining are mainly based on statistical analysis of term either phrase or word. These methods consider documents as bags of words and they will not give any importance to meanings of document content. In addition, statistical analysis of term frequency extracts the significance of term within a document only. Whenever any 2 terms might have same frequency in their documents, but only 1 term pays more to meaning of its sentences than other term.The concept-based model that analyses terms on corpus, document and sentence levels instead of ancient analysis of document is introduced. The planned model consists of, concept-based analysis, clustering by using k-means, concept-based similarity measure Term that contributes to sentence meaning is assigned with 2 dissimilar weights by concept-based statistical analyzer. These 2 weights are united into new weight. Concept-based similarity is used for computing similarity among documents. The concept based similarity method takes full benefit of using concept analysis measures on the corpus, document, and sentence levels in computing the similarity among documents. By using k-means algorithm experiments are done on concept based model on different datasets in text clustering .The experiments are done by comparing the concept-based weight obtained by concept-based model and statistical weight. The results in text clustering show the significant progress of clustering feature using: concept-based term frequency (tf), conceptual term frequency (ctf), concept-based statistical analyzer, and concept-based combined model. In text clustering the results are evaluated using f-measure and entropy.