International Journal of Computer Applications |
Foundation of Computer Science (FCS), NY, USA |
Volume 186 - Number 44 |
Year of Publication: 2024 |
Authors: Aditya Narvekar, Shubh Mehta |
10.5120/ijca2024924052 |
Aditya Narvekar, Shubh Mehta . Cat2Vec with Position Encoding: A New Approach for Handling Ordinal Features using Learned Embeddings with Positional Encoding. International Journal of Computer Applications. 186, 44 ( Oct 2024), 9-15. DOI=10.5120/ijca2024924052
Machine learning projects spend a significant amount of their time and money on pre-processing data. Often the success of Machine learning projects depends on how the features are handled and processed before model building begins. Without rigorous exploration and preprocessing of features machine learning projects will often suffer from time and cost overruns. This study proposes a new technique called “cat2vec with position” to handle categorial features. For nominal features this study proposes the use learned embeddings. This study proposes a new technique that uses learned embeddings with positional encoding for ordinal features. Position encoding is a technique used with transformers to encode relative position of words in a sentence. This study adapts this technique for ordinal variables. Ordinal variables are categorical variables whose values have an inherent position. The authors wrote the code for learning position encodings for ordinal variables. This study used a large dataset which contained a mix of nominal and ordinal variables to run experiments. Experiments were based on sklearn pipelines where each pipeline covered an approach to preprocessing. Pipelines were built using the typical approach, the new approach, as well as hybrid pipelines that combine elements of both the traditional and the new approach. The experiments demonstrate that the new approach, named “Cat2Vec with position,” outperforms traditional techniques for handling nominal and ordinal variables. To the best of current knowledge, this is the first study to apply a positional encoding technique from NLP to encode ordinal variables.