CFP last date
20 December 2024
Reseach Article

A Novel Method of Spam Mail Detection using Text Based Clustering Approach

by Dr. R. Prabhakar, M. Basavaraju
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 5 - Number 4
Year of Publication: 2010
Authors: Dr. R. Prabhakar, M. Basavaraju
10.5120/906-1283

Dr. R. Prabhakar, M. Basavaraju . A Novel Method of Spam Mail Detection using Text Based Clustering Approach. International Journal of Computer Applications. 5, 4 ( August 2010), 15-25. DOI=10.5120/906-1283

@article{ 10.5120/906-1283,
author = { Dr. R. Prabhakar, M. Basavaraju },
title = { A Novel Method of Spam Mail Detection using Text Based Clustering Approach },
journal = { International Journal of Computer Applications },
issue_date = { August 2010 },
volume = { 5 },
number = { 4 },
month = { August },
year = { 2010 },
issn = { 0975-8887 },
pages = { 15-25 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume5/number4/906-1283/ },
doi = { 10.5120/906-1283 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T19:53:22.117045+05:30
%A Dr. R. Prabhakar
%A M. Basavaraju
%T A Novel Method of Spam Mail Detection using Text Based Clustering Approach
%J International Journal of Computer Applications
%@ 0975-8887
%V 5
%N 4
%P 15-25
%D 2010
%I Foundation of Computer Science (FCS), NY, USA
Abstract

A novel method of efficient spam mail classification using clustering techniques is presented in this research paper. E-mail spam is one of the major problems of the today’s internet, bringing financial damage to companies and annoying individual users. Among the approaches developed to stop spam, filtering is an important and popular one. A new spam detection technique using the text clustering based on vector space model is proposed in this research paper. By using this method, one can extract spam/non-spam email and detect the spam email efficiently. Representation of data is done using a vector space model. Clustering is the technique used for data reduction. It divides the data into groups based on pattern similarities such that each group is abstracted by one or more representatives. Recently, there is a growing emphasis on exploratory analysis of very large datasets to discover useful patterns, it is called data mining. Each cluster is abstracted using one or more representatives. It models data by its clusters. Clustering is a type of classification imposed on a finite set of objects. If the objects are characterized as patterns, or points in a n-dimensional metric space, the proximity measure can be the Euclidean distance between pair of points or similarity in the form of the cosine of the angle between the vectors corresponding to the documents. In the work considered in this paper, an efficient clustering algorithm incorporating the features of K-means algorithm and BIRCH algorithm is presented. Nearest neighbour distances and K-Nearest neighbour distances can serve as the basis of classification of test data based on supervised learning. Predictive accuracy of the classifier is calculated for the clustering algorithm. Additionally, different evaluation measures are used to analyze the performance of the clustering algorithm developed in combination with the various classifiers. The results presented at the end of the paper in the results section show the effectiveness of the proposed method.

References
  1. Sudipto Guha, Adam Meyerson, Nina Mishra, Rajeev Motwani and Liadan O’Callaghan, “Clustering Data Streams,” IEEE Trans.s on Knowledge & Data Engg., 2003.
  2. Enrico Blanzieri and Anton Bryl, “A Survey of Learning-Based Techniques of Email Spam Filtering,” Conference on Email and Anti-Spam., 2008.
  3. Jain A.K., M.N. Murthy and P.J. Flynn, “Data Clustering : A Review,”ACM Computing Surveys., 1999.
  4. Tian Zhang, Raghu Ramakrishnan, Miron Livny, “BIRCH: An Efficient Data Clustering Method For Very Large Databases,” Technical Report, Computer Sciences Dept., Univ. of Wisconsin-Madison, 1996.
  5. Porter. M, “An algorithm for suffix stripping”, Proc. Automated library Information systems, pp. 130-137, 1980.
  6. Manning C.D., P. Raghavan, H. Schütze, “Introduction to Information Retrieval”, Cambridge University Press, 2008.
  7. Richard O. Duda, Peter E. Hart, David G. Stork, “Pattern Classification”, Wiley-Interscience Pubs., 2nd Edn., Oct. 26 2000.
  8. http://www.informationretrieval.org/
  9. http://www.aueb.gr/users/ion/publications.html
  10. http://www.cl.cam.ac.uk/users/bwm23/
  11. http://www.wikipedia.org
  12. Jiawei Han and Micheline Kamber, “Data Mining Concepts and Techniques”, Second Edn.
  13. Ajay Gupta and R. Sekar, “An Approach for Detecting Self-Propagating Email Using Anomaly Detection”, Springer Berlin / Heidelberg, Vol. 2820/2003.
  14. Anagha Kulkarni and Ted Pedersen, “Name Discrimination and Email Clustering using Unsupervised Clustering and Labeling of Similar Contexts”, 2nd Indian International Conference on Artificial Intelligence (IICAI-05), pp. 703-722, 2005.
  15. Bryan Klimt and Yiming Yang, “The Enron Corpus: A New Dataset for Email Classification Research”, European Conference on Machine Learning, Pisa, Italy, 2004.
  16. Sahami M., S. Dumais, D. Heckerman, E. Horvitz, “A Bayesian approach to filtering junk e-mail”. AAAI’98 Workshop on Learning for Text Categorization, http://robotics.stanford.edu/users/sahami/papers-dir/spam.pdf, 1998.
  17. Sculley D., Gordon V. Cormack, “Filtering Email Spam in the Presence of Noisy User Feedback”, CEAS 2008: Proc. of the Fifth Conference on Email and Anti-Spam. Aug., 2008.
  18. Dave DeBarr, Harry Wechsler, “Spam Detection using Clustering, Random Forests, and Active Learning”, CEAS 2009 – Sixth Conference on Email and Anti-Spam, Mountain View, California, USA, July 16-17, 2009.
  19. Manning, C.D., Raghavan, P., and Schutze, H., “Scoring, Term Weighting, and the Vector Space Model”, Introduction to Information Retrieval, Cambridge University Press, Cambridge, England, pp. 109-133, 2008.
  20. Naresh Kumar Nagwani and Ashok Bhansali, “An Object Oriented Email Clustering Model Using Weighted Similarities between Emails Attributes”, International Journal of Research and Reviews in Computer Science (IJRRCS), Vol. 1, No. 2, pp. 1-6. Jun. 2010.
Index Terms

Computer Science
Information Sciences

Keywords

Spam Mail Detection Text Based Clustering Approach K-Nearest