CFP last date
20 January 2025
Reseach Article

Cat2Vec with Position Encoding: A New Approach for Handling Ordinal Features using Learned Embeddings with Positional Encoding

by Aditya Narvekar, Shubh Mehta
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 186 - Number 44
Year of Publication: 2024
Authors: Aditya Narvekar, Shubh Mehta
10.5120/ijca2024924052

Aditya Narvekar, Shubh Mehta . Cat2Vec with Position Encoding: A New Approach for Handling Ordinal Features using Learned Embeddings with Positional Encoding. International Journal of Computer Applications. 186, 44 ( Oct 2024), 9-15. DOI=10.5120/ijca2024924052

@article{ 10.5120/ijca2024924052,
author = { Aditya Narvekar, Shubh Mehta },
title = { Cat2Vec with Position Encoding: A New Approach for Handling Ordinal Features using Learned Embeddings with Positional Encoding },
journal = { International Journal of Computer Applications },
issue_date = { Oct 2024 },
volume = { 186 },
number = { 44 },
month = { Oct },
year = { 2024 },
issn = { 0975-8887 },
pages = { 9-15 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume186/number44/cat2vec-with-position-encoding-a-new-approach-for-handling-ordinal-features-using-learned-embeddings-with-positional-encoding/ },
doi = { 10.5120/ijca2024924052 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-10-26T00:55:35+05:30
%A Aditya Narvekar
%A Shubh Mehta
%T Cat2Vec with Position Encoding: A New Approach for Handling Ordinal Features using Learned Embeddings with Positional Encoding
%J International Journal of Computer Applications
%@ 0975-8887
%V 186
%N 44
%P 9-15
%D 2024
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Machine learning projects spend a significant amount of their time and money on pre-processing data. Often the success of Machine learning projects depends on how the features are handled and processed before model building begins. Without rigorous exploration and preprocessing of features machine learning projects will often suffer from time and cost overruns. This study proposes a new technique called “cat2vec with position” to handle categorial features. For nominal features this study proposes the use learned embeddings. This study proposes a new technique that uses learned embeddings with positional encoding for ordinal features. Position encoding is a technique used with transformers to encode relative position of words in a sentence. This study adapts this technique for ordinal variables. Ordinal variables are categorical variables whose values have an inherent position. The authors wrote the code for learning position encodings for ordinal variables. This study used a large dataset which contained a mix of nominal and ordinal variables to run experiments. Experiments were based on sklearn pipelines where each pipeline covered an approach to preprocessing. Pipelines were built using the typical approach, the new approach, as well as hybrid pipelines that combine elements of both the traditional and the new approach. The experiments demonstrate that the new approach, named “Cat2Vec with position,” outperforms traditional techniques for handling nominal and ordinal variables. To the best of current knowledge, this is the first study to apply a positional encoding technique from NLP to encode ordinal variables.

References
  1. Huang, J., Li, Y.-F. and Xie, M. 2015. An empirical analysis of data preprocessing for machine learning-based software cost estimation. Information and Software Technology. 67, (Nov. 2015), 108–127. DOI: https://doi.org/10.1016/j.infsof.2015.07.004.Ding, W. and Marchionini, G. 1997 A Study on Video Browsing Strategies. Technical Report. University of Maryland at College Park.
  2. Huang, J., Li, Y.-F. and Xie, M. 2015. An empirical analysis of data preprocessing for machine learning-based software cost estimation. Information and Software Technology. 67, (Nov. 2015), 108–127. DOI: https://doi.org/10.1016/j.infsof.2015.07.004.
  3. Obaid, Hadeel & Ahmed Dheyab, Saad & Al-azzawi, Sana. (2019). The Impact of Data Pre-Processing Techniques and Dimensionality Reduction on the Accuracy of Machine Learning. 10.1109/IEMECONX.2019.8877011.
  4. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L. and Polosukhin, I. 2017. Attention is All you Need. arXiv (Cornell University). 30, (Jun. 2017), 5998–6008.
  5. Pargent, F., Pfisterer, F., Thomas, J. and Bischl, B. 2022. Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features. Computational Statistics. 37, 5 (Mar. 2022), 2671–2692. DOI: https://doi.org/10.1007/s00180-022-01207-6.
  6. Pargent, F. 2019. A benchmark experiment on how to encode categorical features in predictive modeling. https://osf.io/6fstx/. (Mar. 2019).
  7. Potdar, K., S, T. and D, C. 2017. A comparative study of categorical variable encoding techniques for neural network classifiers. International Journal of Computer Applications. 175, 4 (Oct. 2017), 7–9. DOI: https://doi.org/10.5120/ijca2017915495.
  8. Cerda, P. and Varoquaux, G. 2022. Encoding High-Cardinality String categorical variables. IEEE Transactions on Knowledge and Data Engineering. 34, 3 (Mar. 2022), 1164–1176. DOI: https://doi.org/10.1109/tkde.2020.2992529.
  9. Golinko, E. and Zhu, X. 2018. Generalized feature embedding for supervised, unsupervised, and online learning tasks. Information Systems Frontiers. 21, 1 (Apr. 2018), 125–142. DOI: https://doi.org/10.1007/s10796-018-9850-y.
  10. Hancock, J.T. and Khoshgoftaar, T.M. 2020. Survey on categorical data for neural networks. Journal of Big Data. 7, 1 (Apr. 2020). DOI: https://doi.org/10.1186/s40537-020-00305-w.
  11. Maharana, K., Mondal, S. and Nemade, B. 2022. A review: Data pre-processing and data augmentation techniques. Global Transitions Proceedings. 3, 1 (Jun. 2022), 91–99. DOI: https://doi.org/10.1016/j.gltp.2022.04.020.
  12. Gupta, S., Namdev, U., Gupta, V., Chheda, V. and Bhowmick, K. 2021. Data-driven preprocessing techniques for early diagnosis of diabetes, heart and liver diseases. 2021 Fourth International Conference on electrical, Computer and Communication Technologies (ICECCT). (Sep. 2021). DOI: https://doi.org/10.1109/icecct52121.2021.9616835
  13. Krishna, G.S., Supriya, K. and Rao, K.M. 2022. Selection of Data Preprocessing Techniques and Its Emergence Towards Machine Learning Algorithms using HPI Dataset. 2022 IEEE Global Conference on Computing, Power and Communication Technologies (GlobConPT). (Sep. 2022). DOI: https://doi.org/10.1109/globconpt57482.2022.9938255.
  14. Sukumar, P., Robert, L. and Yuvaraj, S. 2016. Review on Modern Data Preprocessing Techniques in Web Usage Mining (WUM). IEEE. (Oct. 2016). DOI: https://doi.org/10.1109/csitss.2016.7779441.
  15. Avanzi, B., Taylor, G., Wang, M. and Wong, B. 2024. Machine Learning with High-Cardinality Categorical Features in Actuarial Applications. Astin Bulletin. 54, 2 (Apr. 2024), 213–238. DOI: https://doi.org/10.1017/asb.2024.7.
  16. Kosaraju, N., Sankepally, S.R. and Rao, K.M. 2023. Categorical Data: need, encoding, selection of Encoding method and its Emergence in Machine Learning Models—A Practical Review Study on Heart Disease Prediction Dataset using Pearson Correlation. Lecture notes in networks and systems. 369–382.
  17. Destercke, S. and Yang, G. 2014. Cautious ordinal classification by binary decomposition. Lecture notes in computer science. 323–337.
  18. Bolikulov F, Nasimov R, Rashidov A, Akhmedov F, Cho Y-I. Effective Methods of Categorical Data Encoding for Artificial Intelligence Algorithms. Mathematics. 2024; 12(16):2553. https://doi.org/10.3390/math12162553
  19. Choong, A.C.H. and Lee, N.K. 2017. Evaluation of convolutionary neural networks modelling of DNA sequences using ordinal versus one-hot encoding method. IEEE. (Nov. 2017). DOI: https://doi.org/10.1109/iconda.2017.8270400.
  20. Yuan, Q., Chen, K., Yu, Y., Le, N.Q.K. and Chua, M.C.H. 2023. Prediction of anticancer peptides based on an ensemble model of deep learning and machine learning using ordinal positional encoding. Briefings in Bioinformatics. 24, 1 (Jan. 2023). DOI: https://doi.org/10.1093/bib/bbac630.
  21. K. Kunanbayev, I. Temirbek and A. Zollanvari, “Complex Encoding," 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 2021, pp. 1-6, DOI: 10.1109/IJCNN52387.2021.9534094.
  22. Dahouda, M.K. and Joe, I. 2021. A Deep-Learned embedding technique for categorical features encoding. IEEE Access. 9, (Jan. 2021), 114381–114391. DOI: https://doi.org/10.1109/access.2021.3104357.
  23. Chen, P.-C., Tsai, H., Bhojanapalli, S., Chung, H.W., Chang, Y.-W. and Ferng, C.-S. 2021. A simple and effective positional encoding for transformers. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. (Jan. 2021). DOI: https://doi.org/10.18653/v1/2021.emnlp-main.236.
  24. Ke, G., He, D. and Liu, T.-Y. 2021. Rethinking positional encoding in language pre-training. International Conference on Learning Representations. (May 2021).
  25. Gil Press (2021) Andrew Ng Launches a Campaign for Data-Centric AI. Forbes. Available from: https://www.forbes.com/sites/gilpress/2021/06/16/andrew-ng-launches-a-campaign-for-data-centric-ai/?sh=1b802f8d74f5.
Index Terms

Computer Science
Information Sciences

Keywords

Machine learning learned embeddings learned embeddings with positions categorical variable nominal variable ordinal variable linear regression k-nearest neighbors random forests support vector machines XGBoost positional encoding big data