CFP last date
20 December 2024
Reseach Article

A New Text Mining Approach Based on HMM-SVM for Web News Classification

by Krishnalal G, S Babu Rengarajan, K G Srinivasagan
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 1 - Number 19
Year of Publication: 2010
Authors: Krishnalal G, S Babu Rengarajan, K G Srinivasagan
10.5120/395-589

Krishnalal G, S Babu Rengarajan, K G Srinivasagan . A New Text Mining Approach Based on HMM-SVM for Web News Classification. International Journal of Computer Applications. 1, 19 ( February 2010), 98-104. DOI=10.5120/395-589

@article{ 10.5120/395-589,
author = { Krishnalal G, S Babu Rengarajan, K G Srinivasagan },
title = { A New Text Mining Approach Based on HMM-SVM for Web News Classification },
journal = { International Journal of Computer Applications },
issue_date = { February 2010 },
volume = { 1 },
number = { 19 },
month = { February },
year = { 2010 },
issn = { 0975-8887 },
pages = { 98-104 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume1/number19/395-589/ },
doi = { 10.5120/395-589 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T19:47:00.810756+05:30
%A Krishnalal G
%A S Babu Rengarajan
%A K G Srinivasagan
%T A New Text Mining Approach Based on HMM-SVM for Web News Classification
%J International Journal of Computer Applications
%@ 0975-8887
%V 1
%N 19
%P 98-104
%D 2010
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Since the emergence of WWW, it is essential to handle a very large amount of electronic data of which the majority is in the form of text. This scenario can be effectively handled by various Data Mining techniques. This paper proposes an intelligent system for online news classification based on Hidden Markov Model (HMM) and Support Vector Machine (SVM). An intelligent system is designed to extract the keywords from the online news paper content and classify it according to the pre defined categories. Three different stages are designed to classify the content of online newspapers such as (1) Text pre-processing (2) HMM based Feature Extraction and (3) Classification using SVM. Data have been collected for experimentation from The Hindu, The New Indian Express, Times of India, Business Line, and The Economic Times. The experimental results are based on the news categories such as sports, finance and politics and their accuracies in percentage are 92.45, 96.34 and 90.76 respectively. These results are very good compared to that of other text classification methods.

References
  1. T. Joachims, Learning to Classify Text using Support Vector Machines, Kluwer, 2002.
  2. R. Yan, Y. Liu, A. Hauptmann. On Predicting Rare Classes with SVM Ensembles in Scene Classification. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP03), April 6-10 2003.
  3. D. Tao, X. Tang, X. Li, and X. Wu. Asymmetric Bagging and Random Subspacing for Support Vector Machines-based Relevance Feedback in Image Retrieval, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2006.
  4. Lei Tang, Huan Liu. Bias Analysis in Text Classification for Highly Skewed Data, Proceedings of Fifth IEEE International Conference on Data Mining (ICDM'05), 2005, pp. 781-784.
  5. J. Brank & M. Grobelnik, N. Milic-Frayling, D. Mladenic. Training text classifiers with SVM on very few positive examples. Microsoft Research technical report MSR-TR-2003-34. 2003.
  6. D. Lin & P. Pantel. Discovery of Inference Rules for Question Answering. Natural Language Engineering 2001 7(4):343-360.
  7. D. Lin & P. Pantel. Induction of Semantic Classes from Natural Language Text. In Proceedings of ACM SIGKDD Conference on Knowledge Discovery and Data Mining 2001. pp. 317-322.
  8. E.Voorhees. Query expansion using lexical-semantic relations. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1994, pp. 61—69.
  9. Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. European Conference on Machine Learning (ECML) (1998) Text Categorization with Class-Based and Corpus-Based Keyword Selection.
  10. Ozgur, L., Gungor, T., Gurgen, F.: Adaptive Anti-Spam Filtering for Agglutinative Languages. A Special Case for Turkish, Pattern Recognition Letters, 25 no.16 (2004) 1819–1831.
  11. McCallum, A., Nigam, K.: A Comparison of Event Models for Naive Bayes Text Classification. Sahami, M. (Ed.), Proc. of AAAI Workshop on Learning for Text Categorization (1998), Madison, WI, 41–48.
  12. Yang, Y., Liu, X.: A Re-examination of Text Categorization Methods. In Proceedings of SIGIR-99, 22nd ACM International Conference on Research and Development in Information Retrieval, Berkeley, US (1996).
  13. Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34 no. 5 (2002) 1–47.
  14. Forman, G.: An Extensive Empirical Study of Feature Selection Metrics for Text Classi.cation. Journal of Machine Learning Research 3 (2003) 1289–1305.
  15. Ozgur, A.: Supervised and Unsupervised Machine Learning Techniques for Text Document Categorization. Master’s Thesis (2004), Bogazici University, Turkey.
  16. Burges, C. J. C.: A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery Vol. 2 No. 2 (1998) 121–167.
  17. Joachims, T.: Advances in Kernel Methods-Support Vector Learning. chapter Making Large-Scale SVM Learning Practical MIT-Press (1999).
  18. Lin, S-H., Shih C-S., Chen, M. C., Ho, J-M.: Extracting Classi.cation Knowledge of Internet Documents with Mining Term Associations: A Semantic Approach. In Proc. of ACM/SIGIR (1998), Melbourne, Australia 241–249.
  19. Azcarraga, A. P., Yap, T., Chua, T. S.: Comparing Keyword Extraction Techniques for Websom Text Archives. International Journal of Artificial Intelligence Tools 11 no. 2 (2002).
  20. Aizawa, A.: Linguistic Techniques to Improve the Performance of Automatic Text Categorization. In Proceedings of 6th Natural Language Processing Pacific Rim Symposium (2001), Tokyo, JP 307–314.
  21. Yang, Y., Pedersen J. O.: A Comparative Study on Feature Selection in Text Categorization. In Proceedings of the 14th International Conference on Machine Learning (1997) 412–420.
  22. Mladenic, D., Grobelnic, M.: Feature Selection for Unbalanced Class Distribution and Naive Bayes. In Proceedings of the 16th International Conference on Machine Learning (1999) 258–267.
  23. Salton, G., Yang, C., Wong, A.: A Vector-Space Model for Automatic Indexing. Communications of the ACM 18 no.11 (1975) 613–620.
  24. ftp://ftp.cs.cornell.edu/pub/smart/ (2004).
  25. Porter, M. F.: An Algorithm for Suffix Stripping. Program 14 (1980) 130–137.
  26. Salton, G., Buckley, C.: Term Weighting Approaches in Automatic Text Retrieval. Information Processing and Management 24 no. 5 (1988) 513–523
  27. Lewis, D. D.: Reuters-21578 Document Corpus V1.0, http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html.
  28. Kroha, P.; Baeza-Yates, R. A Case Study: News Classification Based on Term Frequency , Sixteenth International Workshop on Database and Expert Systems Applications, 2005. Proceedings.
  29. Lin Lv; Yu-Shu Liu; Research of English text classification methods based on semantic meaning: ITI 3rd International Conference on information and Communications Technology, 2005. Enabling Technologies for the New Knowledge Society.
  30. Islam, Md. Rafiqul; Islam, Md. Rakibul; An effective term weighting method using random walk model for text classification, 11th International Conference on Computer and Information Technology, 2008. ICCIT 2008. 24-27 Dec. 2008.
  31. Sang-Bum Kim; Kyoung-Soo Han; Hae-Chang Rim; Sung Hyon Myaeng; Some Effective Techniques for Naive Bayes Text Classification, IEEE Transactions on Knowledge and Data Engineering, Volume 18, Issue 11, Nov,2006.
  32. Miao Zhang; De-xian Zhang; Trained SVMs based rules extraction method for text classification, IEEE International Symposium on IT in Medicine and Education, 2008. ITME 2008, 12-14 Dec. 2008.
  33. Agarwal, S.; Godbole, S.; Punjani, D.; Shourya Roy; How Much Noise Is Too Much: A Study in Automatic Text Classification, Seventh IEEE International Conference on Data Mining, 2007. ICDM 2007.
  34. Makrehchi, M.; Kamel, M.S.; Combining feature ranking for text classification, IEEE International Conference on Systems, Man and Cybernetics, 2007. ISIC. 7-10 Oct. 2007.
  35. Guifa Teng; Yihong Liu; Jianbin Ma; Fang Wang; Huiting Yao; Improved Algorithm for Text Classification Based on TSVM, First International Conference on Innovative Computing, Information and Control, 2006. ICICIC '06. Volume 2, Aug. 30 2006-Sept. 1 2006.
  36. Hui He; Bo Chen; Jun Guo; Semi-supervised Chinese compound word extraction based on HMM, 7th World Congress on Intelligent Control and Automation, 2008. WCICA 2008. 25-27 June 2008.
  37. Wei Hu; Dong-Mo Zhang; Huan-Ye Sheng; Vague events-based Chinese Web news classification, Proceedings of 2004 International Conference on Machine Learning and Cybernetics, 2004.
  38. Kroha, P.; Baeza-Yates, R.; A Case Study: News Classification Based on Term Frequency, Proceedings. Sixteenth International Workshop on Database and Expert Systems Applications, 2005. 26-26 Aug. 2005.
  39. Lisbon, Proceedings of the ACM first Ph.D. workshop in Information and Knowledge Management, Portugal, 2007.
  40. Jun-Peng Bao Jun-Yi Shen Xiao-Dong Liu Qin-Bao Song , A new text feature extraction model and its application in document copy detection, I International Conference on Machine Learning and Cybernetics, 2003.
  41. Harriman, Feature selection and feature extraction for text categorization, Proceedings of the workshop on Speech and Natural Language, Human Language Technology Conference archive.
Index Terms

Computer Science
Information Sciences

Keywords

Feature Extraction HMM kNN POS SVM