We apologize for a recent technical issue with our email system, which temporarily affected account activations. Accounts have now been activated. Authors may proceed with paper submissions. PhDFocusTM
CFP last date
20 November 2024
Reseach Article

Article:Discovering Local Outliers using Dynamic Minimum Spanning Tree with Self-Detection of Best Number of Clusters

by S. John Peter
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 9 - Number 7
Year of Publication: 2010
Authors: S. John Peter
10.5120/1396-1885

S. John Peter . Article:Discovering Local Outliers using Dynamic Minimum Spanning Tree with Self-Detection of Best Number of Clusters. International Journal of Computer Applications. 9, 7 ( November 2010), 36-42. DOI=10.5120/1396-1885

@article{ 10.5120/1396-1885,
author = { S. John Peter },
title = { Article:Discovering Local Outliers using Dynamic Minimum Spanning Tree with Self-Detection of Best Number of Clusters },
journal = { International Journal of Computer Applications },
issue_date = { November 2010 },
volume = { 9 },
number = { 7 },
month = { November },
year = { 2010 },
issn = { 0975-8887 },
pages = { 36-42 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume9/number7/1396-1885/ },
doi = { 10.5120/1396-1885 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T19:58:32.419396+05:30
%A S. John Peter
%T Article:Discovering Local Outliers using Dynamic Minimum Spanning Tree with Self-Detection of Best Number of Clusters
%J International Journal of Computer Applications
%@ 0975-8887
%V 9
%N 7
%P 36-42
%D 2010
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Detecting outliers in database (as unusual objects) using Clustering and Distance-based approach is a big desire. Minimum spanning tree based clustering algorithm is capable of detecting clusters with irregular boundaries. In this paper we propose a new algorithm to detect outliers based on minimum spanning tree clustering and distance-based approach. Outlier detection is an extremely important task in a wide variety of application. The algorithm partition the dataset into optimal number of clusters. Small clusters are then determined and considered as outliers. The rest of the outliers (if any) are then detected in the clusters using Distance-based method. The algorithm uses a new cluster validation criterion based on the geometric property of data partition of the dataset in order to find the proper number of clusters. The algorithm works in two phases. The first phase of the algorithm creates optimal number of clusters, where as the second phase of the algorithm detect outliers in the clusters. The key feature of our approach is it combines the best features of Distance-based and Clustering-based outlier detection to find noise-free/error-free clusters for a given dataset without using any input parameters.

References
  1. C. Aggarwal and P. Yu, “Outlier Detection for High Dimensional Data”, In Proceedings of the ACM SIGMOD International Conference on Management of Data, Volume 30, Issue 2,pages 37 – 46, May 2001.
  2. J.Almeida, L.Barbosa, A.Pais and S.Formosinho, “Improving Hierarchical Cluster Analysis: A new method with OutlierDetection and Automatic Clustering”, Chemometrics and Intelligent Laboratory Systems 87:208-217, 2007.
  3. Anil K. Jain, Richard C. Dubes “Algorithm for Clustering Data”, Michigan State University, Prentice Hall, Englewood Cliffs, New Jersey 07632.1988.
  4. F.Angiulli, and C.Pizzuti, “Outlier Mining in Large High-Dimensional Data sets”, IEEETransactions on Kaoeledge and Data Engineering, 17(2): 203-215, 2005
  5. T. Asano, B. Bhattacharya, M.Keil and F.Yao,”Clustering Algorithms based on minimum and maximum spanning trees”. In Proceedings of the 4th Annual Symposium on ComputationalGeometry,Pages252-257,1988.
  6. D. Avis “Diameter partitioning”, Discrete and Computational Geometry, 1:265-276, 1986.
  7. S. Bay and M. Schwabacher, “Mining distance-based outliers in near Linear Time with Randomization and a Simple Pruning Rule”. SIGKDD ’03, Washington, DC, USA 2003.
  8. C. Bei and R. Gray, “An improvement of the minimum distortion encoding algorithm for vector quantization”, IEEE Trans. Commun, 33:1132-1133, 1985.
  9. S. Chen and W. Hsieh, “Fast algorithm for VQ codebook Design”, IEEE Proc., 138: 357-362, 1991.
  10. B.Custem and I.Gath, “Detection of Outliers and Robust Estimation using Fuzzy clustering”, Computational Statistics & data Analyses 15,pp.47-61, 1993.
  11. Fawaz A.M. Masoud, Moh’d Belal Al-Zoubi, Imad Salah and Ali AL-Dahoud, “Fast Algorithms for Outlier Detection”, Journal of Computer Science 4(2): 129-132, 2008.
  12. Feng Luo,Latifur Kahn, Farokh Bastani, I-Ling Yen, and Jizhong Zhou, “A dynamically growing self-organizing tree(DGOST) for hierarchical gene expression profile” ,Bioinformatics,Vol 20,no 16, pp 2605-2617, 2004.
  13. M. Fredman and D. Willard, “Trans-dichotomous algorithms for minimum spanning trees and shortest paths”, In Proceedings of the 31st Annual IEEE Symposium on Foundations of Computer Science, pages 719-725, 1990.
  14. Gath and A.Geva, “Fuzzy Clustering for the estimation of the Parameters of the components of Mixtures of Normal distribution”, Pattern Recognition letters,9,pp.77-86, 1989.
  15. B. Ghosh-Dastidar and J.L. Schafer,” Outlier Detection and Editing Procedures for Continuous Multivariate Data”, ORP Working Papers, September 2003. (http://www.opr.princeton.edu/papers/), visited 20.09.2004.
  16. A. Hardy, “On the number of clusters”, Computational Statistics and Data Analysis, 23, pp. 83–96, 1996.
  17. T. Hastie, R. Tibshirani, and J. Friedman, “The elements of statistical learning: Data mining, inference and prediction”, Springer-Verlag, 2001.
  18. Z. He, X. Xu and S. Deng, “Discovering cluster-based Local Outliers”. Pattern Recognition Letters, Volume 24, Issue 9-10, pp 1641 – 1650, June 2003.
  19. H.Gabow, T.Spencer and R.Rarjan. “Efficient algorithms for finding minimum spanning trees in undirected and directed graphs”, Combinatorica, 6(2):pp 109-122, 1986.
  20. M. Jaing, S. Tseng and C. Su, “Two-phase Clustering Process for Outlier Detection “, Pattern Recognition Letters, Volume 22, Issue 6 – 7, pp 691 – 700, May 2001.
  21. S. John Peter, S.P. Victor, “A Novel Algorithm for Meta similarity clusters using Minimum spanning tree”, International Journal of computer science and Network Security. Vol.10 No.2 pp. 254 – 259, 2010
  22. D. Karger, P. Klein and R. Tarjan, “A randomized linear-time algorithm to find minimum spanning trees”. Journal of the ACM, 42(2):321-328, 1995.
  23. E. Knorr and R. Ng, “A Unified Notion of Outliers: Properties and Computation”. In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, pages 219 – 222, August 1997.
  24. E.Knorr and R.Ng, “Algorithms for Mining Distance-based Outliers in Large Data sets”, Proc.the 24th International Conference on Very Large Databases(VLDB),pp.392-403, 1998.
  25. E.Knorr, R.Ng and V.Tucakov, “Distance- Based Outliers: Algorithms and Applications”, VLDB Journal, 8(3-4):237-253, 2000.
  26. J. Kruskal,” On the shortest spanning subtree and the travelling salesman problem”, In Proceedings of the American Mathematical Society, pp 48-50, 1956.
  27. Y. Linde, A. Buzo and R. Gray, 1980. ”An algorithm for vector quantizer design”. IEEE Trans. Commun., 28: 89-95.
  28. A.Loureiro, L.Torgo and C.Soares, “Outlier detection using Clustering methods: A data cleaning Application”, In Proceedings of KDNet Symposium on Knowledge-based systems for the Public Sector. Bonn, Germany, 2004.
  29. Moh’d Belal Al-Zoubi, “An Effective Clustering-Based Approach for Outlier Detection”, European Journal of Scientific Research, Vol.28 No.2 ,pp.310-316, 2009
  30. Moh’d Belal Al-Zoubi and Nadim Obeid, “A Fast Distance-Based Approach to Detect Outliers”, Journal Of Computer Science 3 (12) ,pp.944-3947, 2007
  31. J. Nesetril, E.Milkova and H.Nesetrilova. Otakar boruvka “On Minimum spanning tree problem: Translation of both the 1926 papers, comments, history. DMATH:”, Discrete Mathematics, 233, 2001.
  32. K.Niu, C.Huang, S.Zhang and J.Chen, “ODDC: Outlier Detection Using Distance Distribution Clustering”, T.Washio et al. (Eds.): PAKDD 2007 Workshops, Lecture Notes in Artificial Intelligence (LNAI) 4819,pp.332-343,Springer-Verlag, 2007.
  33. Oleksandr Grygorash, Yan Zhou, Zach Jorgensen, “Minimum spanning Tree Based Clustering Algorithms”, Proceedings of the 18th IEEE International conference on tools with Artificial Intelligence (ICTAI’06), 2006.
  34. K. Paliwal and V. Ramasubramanian, “Effect of ordering the codebook on the efficiency of the partial distance search algorithm for vector quantization”, IEEE Trans. Commun., 37:538-540, 1989.
  35. F. Preparata and M.Shamos, “Computational Geometry: An Introduction”. Springer-Verlag, Newyr, NY ,USA, 1985¬.
  36. R. Prim, “Shortest connection networks and some generalization”. Bell systems Technical Journal,36:1389-1401, 1957.
  37. S. Ramaswamy, R. Rastogi and K. Shim, “Efficient Algorithms for Mining Outliers from Large Data Sets”, In Proceedings of the 2000 ACM SIGMOD InternationalConference on Management of Data, Volume 29, Issue 2, pages 427 – 438, May 2000.
  38. D. M. Rocke and J. J. Dai, “Sampling and subsampling for cluster analysis in data mining: With applications to sky survey data”, Data Mining and Knowledge Discovery, 7, pp. 215–232, 2003.
  39. S. Salvador and P. Chan, “Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms”, in Proceedings Sixteenth IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2004, Los Alamitos, CA, USA, IEEE Computer Society, pp. 576–584 , 2004.
  40. S. Still and W. Bialek, “How many clusters?, An information-theoretic perspective”, Neural Computation, 16, pp. 2483–2506, 2004.
  41. Stefan Wuchty and Peter F. Stadler, “Centers of Complex Networks”, 2006
  42. C. Sugar and G. James, “Finding the number of clusters in a data set, An information theoretic approach”, Journal of the American Statistical Association, 98 pp. 750–763, 2003.
  43. N. Venkateswarlu and P. Raju, “A new fast classifier for remotely sensed images”. Int.J. Remote Sens., 14:383-390, 1993.
  44. G. Williams, R. Baxter, H. He, S. Hawkins and L. Gu, “A Comparative Study for RNN for Outlier Detection in Data Mining”, In Proceedings of the 2nd IEEE International Conference on Data Mining, page 709, Maebashi City, Japan, December 2002.
  45. Y.Xu, V.Olman and D.Xu, “Minimum spanning trees for gene expression data clustering”, Genome Informatics, 12:24-33, 2001.
  46. K.Yoon, O.Kwon and D.Bae, “An approach to outlier Detection of Software Measurement Data using the K-means Clustering Method”, First International Symposium on Empirical Software Engineering and Measurement (ESEM 2007), Madrid.pp.443-445, 2007.
  47. C. Zahn, “Graph-theoretical methods for detecting and describing gestalt clusters”, IEEE Transactions on Computers, C-20:68-86, 1971
  48. J.Zhang and N.Wang, “Detecting outlying subspaces for high-dimensional data: the new task, Algorithms and Performance”, Knowledge and Information Systems, 10(3):333-555, 2006.
Index Terms

Computer Science
Information Sciences

Keywords

Euclidean minimum spanning tree Subtree Clustering Eccentricity Cluster validity Cluster Separation Small Clusters Outliers