CFP last date
20 December 2024
Reseach Article

Article:An Empirical Selection Method for Document Clustering

by P.Perumal, R. Nedunchezhian, D.Brindha
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 31 - Number 3
Year of Publication: 2011
Authors: P.Perumal, R. Nedunchezhian, D.Brindha
10.5120/3803-5249

P.Perumal, R. Nedunchezhian, D.Brindha . Article:An Empirical Selection Method for Document Clustering. International Journal of Computer Applications. 31, 3 ( October 2011), 15-19. DOI=10.5120/3803-5249

@article{ 10.5120/3803-5249,
author = { P.Perumal, R. Nedunchezhian, D.Brindha },
title = { Article:An Empirical Selection Method for Document Clustering },
journal = { International Journal of Computer Applications },
issue_date = { October 2011 },
volume = { 31 },
number = { 3 },
month = { October },
year = { 2011 },
issn = { 0975-8887 },
pages = { 15-19 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume31/number3/3803-5249/ },
doi = { 10.5120/3803-5249 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T20:17:09.678962+05:30
%A P.Perumal
%A R. Nedunchezhian
%A D.Brindha
%T Article:An Empirical Selection Method for Document Clustering
%J International Journal of Computer Applications
%@ 0975-8887
%V 31
%N 3
%P 15-19
%D 2011
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Model Selection is a task selecting set of potential models. This method is capable of establishing hidden semantic relations among the observed features, using a number of latent variables. In this paper, the selection of the correct number of latent variables is critical. In the most of the previous researches, the number of latent topics was selected based on the number of invoked classes. This paper presents a method, based on backward elimination approach, which is capable of unsupervised order selection in PLSA. During the elimination process, proper selection of some latent variables which must be deleted is the most essential problem, and its relation to the final performance of the pruned model is straightforward. To treat this problem, we introduce a new combined pruning method which selects the best options for removal, has been used. The obtained results show that this algorithm leads to an optimized number of latent variables. In this paper, we propose a novel approach, namely DPMFS, to address this issue.

References
  1. Tahereh Emami Azadi, FarshadAlmasganj (2009) “Using backward elimination with a new model order reduction algorithm to select best double mixture model for document clustering”, Expert Systems with Applications 36 (2009) 10485–10493
  2. M. A. T. Figueiredo and A. K. Jain, “Unsupervised learning of finite mixture models,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no.3, pp. 381–396, Mar. 2002.
  3. M. W. Graham and D. J. Miller, “Unsupervised learning of parsimoniousmixtures on large feature spaces,” Electrical Engineering Dept., Pennsylvania State, University Park, PA, Tech. Rep., 2004.
  4. Hofmann, T. (1999). Probabilistic latent semantic analysis. In Proceedings of the 22th annual international ACM/SIGIR conference on research and development in information retrieval (pp. 50–57).
  5. D. J. Miller and J. Browning, “A mixture model and EM-based algorithm for class discovery, robust classification, and outlier rejection in mixed labeled/unlabeled data sets,” IEEE Trans. Pattern Anal. Mach. Intell.,vol. 25, no. 11, pp. 1468–1483, Nov. 2003.
  6. S.Vaithyanathan and B. Dom, “Generalized model selection for unsupervised learning in high dimensions,” in Adv. Neural Inf. Process. Syst., vol. 11, 1999, pp. 970–976.
  7. S. C. Deerwester, S. T. Dumais, T. KLandauer, G.W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41 (6):391–407, 1990
  8. Steinbach, M., Karypis, G., & Kumar, V. (2000). A comparison of document clustering techniques. In Proceeding knowledge discovery and data mining (KDD) and workshop text mining. Boston.
  9. E. I. George and R. E. McCulloch. (1992). Variable selection via Gibbs sampling. Journal of the American Statistical Association, 88:881-889.
  10. S. Kim. (2006). Variable selection in clustering via Dirichlet process mixture models. Biometrika, 93(4):877-893.
  11. Document Clustering via Dirichlet Process Mixture Model with Feature Selection.GuanYu,Ruizhang Huang,Zhaojun WangKDD’10, July 25-28, 2010, Washington, DC, USA.
  12. Y. W. Teh, M. I. Jordan, M.J. Beal, and D.M. Blei. (2007).Hierarchical Dirichlet Processes. Journal of the American Statistical Association, 101(476):1566-1581.
  13. A. Vlachos, Z. Ghahramani, and A. Korhonen. (2008).Dirichlet process mixture models for verb clustering. ICML Workshop on Prior Knowledge for Text and Language Processing, Helsinki, Finland.
Index Terms

Computer Science
Information Sciences

Keywords

Document clustering Model selection EM algorithm Dirichlet Process Mixture Model Feature Selection