International Journal of Computer Applications |
Foundation of Computer Science (FCS), NY, USA |
Volume 17 - Number 1 |
Year of Publication: 2011 |
Authors: Rishi Sayal, V. Vijay Kumar |
10.5120/2184-2757 |
Rishi Sayal, V. Vijay Kumar . A Novel Similarity Measure for Clustering Categorical Data Sets. International Journal of Computer Applications. 17, 1 ( March 2011), 25-30. DOI=10.5120/2184-2757
Measuring similarity between two data objects is a more challenging problem for data mining and knowledge discovery tasks. The traditional clustering algorithms have been mainly stressed on numerical data, the implicit property of which can be exploited to define distance function between the data points to define similarity measure. The problem of similarity becomes more complex when the data is categorical which do not have a natural ordering of values or can be called as non geometrical attributes. Clustering on relational data sets when majority of its attributes are of categorical types makes interesting facts. No earlier work has been done on clustering categorical attributes of relational data set types making use of the property of functional dependency as parameter to measure similarity. This paper is an extension of earlier work on clustering relational data sets where domains are unique and similarity is context based and introduces a new notion of similarity based on dependency of an attribute on other attributes prevalent in the relational data set. This paper also gives a brief overview of popular similarity measures of categorical attributes. This novel similarity measure can be used to apply on tuples and their respective values. The important property of categorical domain is that they have smaller number of attribute values. The similarity measure of relational data sets then can be applied to the smaller data sets for efficient results.