International Journal of Computer Applications |
Foundation of Computer Science (FCS), NY, USA |
Volume 24 - Number 3 |
Year of Publication: 2011 |
Authors: Dipak V. Patil, R. S. Bichkar |
10.5120/2930-3878 |
Dipak V. Patil, R. S. Bichkar . An Optimistic Data Mining Approach for Handling Large Data Set using Data Partitioning Techniques. International Journal of Computer Applications. 24, 3 ( June 2011), 29-33. DOI=10.5120/2930-3878
The use of the Internet for various purposes leads to collection of large volume of data. The knowledge contents of large data can be utilized to improve decision-making process of an organization. The knowledge discovery on this high volume data becomes very slow, as it has to be done serially on currently available terabyte plus data sets. In some cases, mining of large data set may become impossible due to limitations of processor and memory. The proposed algorithm is based on Tim Oates and Davis Jensen’s [1] findings which state that increasing size of training data does not considerably increase classification accuracy of a classifier. The proposed algorithm also follows survival of the fittest principal used in genetic algorithm. The solution provides partitioning algorithm wherein decision trees can be learned on partitioned data that are disjoint subsets of a complete data set. These learned decision trees have comparable accuracies with each other and that is equivalent to the tree learned on complete data set. The algorithm finds a single tree with highest accuracy amongst the learned decision trees. The selected decision tree is used for classification of unseen data. The results on 12 benchmark data sets from UCI data repository indicate that the final learned decision tree have equal accuracy and in many cases, significant improvement in classification accuracy is observed, improvement in classification performance as compared to decision trees learned on the entire data set. An experiment on big data set Census-income (KDD) also supports the claim. The most important aspect of this approach is that it is very simple as compared to other methods with enhanced classification performance.