CFP last date
20 January 2025
Reseach Article

Imbalanced Data Classification using Sampling Techniques and XGBoost

by Priyanka Lahoti, Ajeet Kumar Rai
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 182 - Number 12
Year of Publication: 2018
Authors: Priyanka Lahoti, Ajeet Kumar Rai
10.5120/ijca2018917735

Priyanka Lahoti, Ajeet Kumar Rai . Imbalanced Data Classification using Sampling Techniques and XGBoost. International Journal of Computer Applications. 182, 12 ( Aug 2018), 19-22. DOI=10.5120/ijca2018917735

@article{ 10.5120/ijca2018917735,
author = { Priyanka Lahoti, Ajeet Kumar Rai },
title = { Imbalanced Data Classification using Sampling Techniques and XGBoost },
journal = { International Journal of Computer Applications },
issue_date = { Aug 2018 },
volume = { 182 },
number = { 12 },
month = { Aug },
year = { 2018 },
issn = { 0975-8887 },
pages = { 19-22 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume182/number12/29872-2018917735/ },
doi = { 10.5120/ijca2018917735 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-07T01:11:13.945725+05:30
%A Priyanka Lahoti
%A Ajeet Kumar Rai
%T Imbalanced Data Classification using Sampling Techniques and XGBoost
%J International Journal of Computer Applications
%@ 0975-8887
%V 182
%N 12
%P 19-22
%D 2018
%I Foundation of Computer Science (FCS), NY, USA
Abstract

While implementing any machine learning algorithms it is good to have the descriptive knowledge of the dataset. In any dataset, in case having more than 90% of the data in target variable is from class 1 and the remaining data is from class 2. In such type of dataset, error evaluation metric accuracy is not going to help much. Having the unknown dataset with only class 1 itself gives more than 90% accuracy, which shows accuracy as evaluation metric should be ignored. Such a problem with highly skewed target outcome is known as an Imbalanced classification problem. There is a number of techniques to deal with imbalanced dataset. In this paper, we are interested to see how sampling techniques and XGBoost can be used while working with the Imbalanced dataset.

References
  1. Cochran, W.G. (1977). Sampling Techniques. New York: Wiley.
  2. Richard G. Lyons, How Fast Must You Sample? , Test and Measurement World, November, 1988, pp. 47-57.
  3. N. V. Chawla, D. A. Cieslak, L. O. Hall, and A. Joshi, “Automatically countering imbalance and its empirical relationship to cost,” Data Mining and Knowledge Discovery, vol. 17, no. 2, pp. 225–252, 2008.
  4. H. Han, W. Y. Wang, and B. H. Mao, “Borderline-smote: A new over-sampling method in imbalanced data sets learning,” in Advances in Intelligent Computing, (Hefei, China), vol. 3644, pp. 878–887, Springer-Verlag, 2005.
  5. N. Japkowicz “Learning from imbalanced data sets: A comparison of various strategies,” in AAAI Workshop on Learning from Imbalanced Data Sets, (Austin, Texas), vol. 68, AAAI Press, 2000.
  6. Almuallim H., An Efficient Algorithm for Optimal Pruning of Decision Trees. Artificial Intelligence 83(2): 347-362, 1996.
  7. Breiman L., Friedman J., Olshen R., and Stone C.. Classification and Regression Trees. Wadsworth Int. Group, 1984
  8. Trevor Hastie, Rob Tibshirani, Jerome Friedman (2009) “Statistical Learning” (Springer).
  9. Leo Breiman (2001) “Random Forests” Machine Learning, 45, 5-32.
  10. Bradley, A.P., 1997. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recogn. 30 (7), 1145–1159
  11. Saitta, L., Neri, F., 1998. Learning in the ‘‘real world’’. Mach. Learning 30, 133–163.
  12. Egan, J.P., 1975. Signal detection theory and ROC analysis, Series in Cognition and Perception. Academic Press, New York.
  13. Zhou, Z.-H. (2012). Ensemble Methods: Foundations and Algorithms. Chapman & Hall/CRC, 1st edition.
  14. Schapire, R. E., and Freund, Y. (2012). Boosting: Foundations and Algorithms. The MIT Press.
Index Terms

Computer Science
Information Sciences

Keywords

Random Forest XGBOOST ROC curve Anomaly detection ROSE