CFP last date
22 July 2024
Reseach Article

Systematic Review on Text Normalization Techniques and its Approach to Non-Standard Words

by Abubakar Ahmad Aliero, Bashir Sulaimon Adebayo, Hamzat Olanrewaju Aliyu, Amina Gogo Tafida, Bashar Umar Kangiwa, Nasiru Muhammad Dankolo
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 185 - Number 33
Year of Publication: 2023
Authors: Abubakar Ahmad Aliero, Bashir Sulaimon Adebayo, Hamzat Olanrewaju Aliyu, Amina Gogo Tafida, Bashar Umar Kangiwa, Nasiru Muhammad Dankolo
10.5120/ijca2023923106

Abubakar Ahmad Aliero, Bashir Sulaimon Adebayo, Hamzat Olanrewaju Aliyu, Amina Gogo Tafida, Bashar Umar Kangiwa, Nasiru Muhammad Dankolo . Systematic Review on Text Normalization Techniques and its Approach to Non-Standard Words. International Journal of Computer Applications. 185, 33 ( Sep 2023), 44-55. DOI=10.5120/ijca2023923106

@article{ 10.5120/ijca2023923106,
author = { Abubakar Ahmad Aliero, Bashir Sulaimon Adebayo, Hamzat Olanrewaju Aliyu, Amina Gogo Tafida, Bashar Umar Kangiwa, Nasiru Muhammad Dankolo },
title = { Systematic Review on Text Normalization Techniques and its Approach to Non-Standard Words },
journal = { International Journal of Computer Applications },
issue_date = { Sep 2023 },
volume = { 185 },
number = { 33 },
month = { Sep },
year = { 2023 },
issn = { 0975-8887 },
pages = { 44-55 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume185/number33/32905-2023923106/ },
doi = { 10.5120/ijca2023923106 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-07T01:27:42.719312+05:30
%A Abubakar Ahmad Aliero
%A Bashir Sulaimon Adebayo
%A Hamzat Olanrewaju Aliyu
%A Amina Gogo Tafida
%A Bashar Umar Kangiwa
%A Nasiru Muhammad Dankolo
%T Systematic Review on Text Normalization Techniques and its Approach to Non-Standard Words
%J International Journal of Computer Applications
%@ 0975-8887
%V 185
%N 33
%P 44-55
%D 2023
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Text normalization is the process of transforming text into a standardized and canonical form. It involves correcting spelling errors, expanding abbreviations, resolving contractions, normalizing punctuation, capitalization, and other linguistic variations to ensure consistent and coherent representations of textual data. The goal of text normalization is to reduce the lexical and orthographic variations in text, making it easier to process, analyze, and understand. It is a critical preprocessing step in many natural language processing (NLP) tasks, such as machine translation, text-to-speech synthesis, sentiment analysis, and information retrieval. Many techniques and approaches have been used for normalizing different kind of text including the User-Generated Content (UGC). This normalization helps to improve the performance of NLP downstream task. This paper provides a broad picture of the state-of-the-art researches in the area of text normalization from 2018 to 2022. About 54 journal and conference papers was selected to identifies and analyzed the trends of the text normalization techniques, approaches and issues in the related field. The use of dataset and evaluation metrics were excluded for future research.

References
  1. Zhang, C., et al. Adaptive parser-centric text normalization. in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2013.
  2. Mehmood, K., et al., An unsupervised lexical normalization for Roman Hindi and Urdu sentiment analysis. Information Processing & Management, 2020. 57(6): p. 102368.
  3. Rahate, P.M. and M. Chandak, An experimental technique on text normalization and its role in speech synthesis. Int. J. Innov. Technol. Exploring Eng., 2019. 8(8S3): p. 1-4.
  4. Veliz, C.M., O. De Clercq, and V. Hoste, Is neural always better? SMT versus NMT for Dutch text normalization. Expert Systems with Applications, 2021. 170: p. 114500.
  5. Baldwin, T., et al. How noisy social media text, how diffrnt social media sources? in Proceedings of the Sixth International Joint Conference on Natural Language Processing. 2013.
  6. Ariffin, S.N.A.N. and S. Tiun, Rule-based text normalization for Malay social media texts. International Journal of Advanced Computer Science and Applications, 2020. 11(10).
  7. Lourentzou, I., K. Manghnani, and C. Zhai. Adapting sequence to sequence models for text normalization in social media. in Proceedings of the international AAAI conference on web and social media. 2019.
  8. Ruzsics, T. and T. Samardžić, Multilevel text normalization with sequence-to-sequence networks and multisource learning. arXiv preprint arXiv:1903.11340, 2019.
  9. Zhang, H., et al., Neural models of text normalization for speech applications. Computational Linguistics, 2019. 45(2): p. 293-337.
  10. Huang, L., S. Zhuang, and K. Wang, A Text Normalization Method for Speech Synthesis Based on Local Attention Mechanism. IEEE Access, 2020. 8: p. 36202-36209.
  11. Kawamura, R., et al. Neural text normalization leveraging similarities of strings and sounds. in Proceedings of the 28th International Conference on Computational Linguistics. 2020.
  12. Makarov, P. and S. Clematide. Semi-supervised contextual historical text normalization. 2020. Association for Computational Linguistics.
  13. Higashiyama, S., et al. A Text Editing Approach to Joint Japanese Word Segmentation, POS Tagging, and Lexical Normalization. in Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021). 2021.
  14. Bosch, T.E., Using online social networking for teaching and learning: Facebook use at the University of Cape Town. Communicatio: South African Journal for Communication Theory and Research, 2009. 35(2): p. 185-200.
  15. Laflin, P., et al., Discovering and validating influence in a dynamic online social network. Social Network Analysis and Mining, 2013. 3: p. 1311-1323.
  16. Susilo, A. Exploring Facebook and WhatsApp as supporting social network applications for English learning in higher education. 2014. Conference On Professional Development In Education (PDE2014), Widyatama ….
  17. Pang, H., Connecting mobile social media with psychosocial well-being: Understanding relationship between WeChat involvement, network characteristics, online capital and life satisfaction. Social Networks, 2022. 68: p. 256-263.
  18. Huey, L.S. and R. Yazdanifard, How Instagram can be used as a tool in social network marketing. Center for Southern New Hampshire University (SNHU), 2014. 7(4): p. 122-124.
  19. Vandekerckhove, R. and J. Nobels, Code eclecticism: Linguistic variation and code alternation in the chat language of Flemish teenagers 1. Journal of sociolinguistics, 2010. 14(5): p. 657-677.
  20. Liu, X., et al. Recognizing named entities in tweets. in Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies. 2011.
  21. Kitchenham, B., et al., Systematic literature reviews in software engineering–a systematic literature review. Information and software technology, 2009. 51(1): p. 7-15.
  22. Widyassari, A.P., et al., Review of automatic text summarization techniques & methods. Journal of King Saud University-Computer and Information Sciences, 2022. 34(4): p. 1029-1046.
  23. Sharma, M., P. Singh, and D. Shaveta, A Review Paper On Sms Text To Plain English Translation (Text Normalization). International Journal of Computer Science & Engineering Technology (IJCSET), 2014. Vol. 5 p. 792-797.
  24. Rogers, D., et al., Real-time text classification of user-generated content on social media: Systematic review. IEEE Transactions on Computational Social Systems, 2021. 9(4): p. 1154-1166.
  25. Rashad, M., et al., An overview of text-to-speech synthesis techniques. Latest trends on communications and information technology, 2010: p. 84-89.
  26. Zhang, X., R. Mao, and E. Cambria, A survey on syntactic processing techniques. Artificial Intelligence Review, 2022: p. 1-84.
  27. Nandwani, P. and R. Verma, A review on sentiment analysis and emotion detection from text. Social Network Analysis and Mining, 2021. 11(1): p. 81.
  28. Satapathy, R., et al., A review of shorthand systems: From brachygraphy to microtext and beyond. Cognitive Computation, 2020. 12: p. 778-792.
  29. Bollmann, M., A large-scale comparison of historical text normalization systems. arXiv preprint arXiv:1904.02036, 2019.
  30. Tuan, D.A., P.T. Lam, and P.D. Hung. A study of text normalization in Vietnamese for text-to-speech system. in Proceedings of Oriental COCOSDA Conference, Macau, China. 2012.
  31. Zhang, J., et al. A hybrid text normalization system using multi-head self-attention for mandarin. in ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). 2020. IEEE.
  32. Alnajran, N., et al. A heuristic based pre-processing methodology for short text similarity measures in microblogs. in 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS). 2018. IEEE.
  33. Dang, H.-T. and X.-H. Phan. Non-Standard Vietnamese Word Detection and Normalization for Text–to–Speech. in 2022 14th International Conference on Knowledge and Systems Engineering (KSE). 2022. IEEE.
  34. Sproat, R., et al., Normalization of non-standard words. Computer speech & language, 2001. 15(3): p. 287-333.
  35. Bakhturina, E., Y. Zhang, and B. Ginsburg, Shallow Fusion of Weighted Finite-State Transducer and Language Model for Text Normalization. arXiv preprint arXiv:2203.15917, 2022.
  36. Eryigit, G. and D. Torunoglu-Selamet, Social media text normalization for Turkish. Natural Language Engineering, 2017. 23(6): p. 835-875.
  37. Aw, A., et al. A phrase-based statistical model for SMS text normalization. in Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions. 2006.
  38. Scannell, K. Statistical models for text normalization and machine translation. in Proceedings of the First Celtic Language Technology Workshop. 2014.
  39. Sridhar, V.K.R. Unsupervised text normalization using distributed representations of words and phrases. in Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing. 2015.
  40. Awadalla, H.H. and A. Menezes. Social text normalization using contextual graph random walks. in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2013.
  41. Deshpande, A.K. and P.R. Devale, Natural language query processing using probabilistic context free grammar. International Journal of Advances in Engineering & Technology, 2012. 3(2): p. 568.
  42. Satapathy, R., et al. Phonetic-based microtext normalization for twitter sentiment analysis. in 2017 IEEE international conference on data mining workshops (ICDMW). 2017. IEEE.
  43. Pramanik, S. and A. Hussain, Text normalization using memory augmented neural networks. Speech Communication, 2019. 109: p. 15-23.
  44. Sproat, R. and N. Jaitly, RNN approaches to text normalization: A challenge. arXiv preprint arXiv:1611.00068, 2016.
  45. Yolchuyeva, S., G. Németh, and B. Gyires-Tóth, Text normalization with convolutional neural networks. International Journal of Speech Technology, 2018. 21: p. 589-600.
  46. Satapathy, R., et al. Seq2seq deep learning models for microtext normalization. in 2019 international joint conference on neural networks (IJCNN). 2019. IEEE.
  47. Lai, T.M., et al., A unified transformer-based framework for duplex text normalization. arXiv preprint arXiv:2108.09889, 2021.
  48. Partanen, N., M. Hämäläinen, and K. Alnajjar. Dialect text normalization to normative standard finnish. in The Fifth Workshop on Noisy User-generated Text (W-NUT 2019). 2019. The Association for Computational Linguistics.
  49. Khan, J. and S. Lee, Enhancement of Text Analysis Using Context-Aware Normalization of Social Media Informal Text. Applied Sciences, 2021. 11(17): p. 8172.
  50. Dai, W., et al. An End-to-end Chinese Text Normalization Model based on Rule-guided Flat-Lattice Transformer. in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2022. IEEE.
  51. Schulz, S., et al., Multimodular text normalization of dutch user-generated content. ACM Transactions on Intelligent Systems and Technology (TIST), 2016. 7(4): p. 1-22.
  52. Hanafiah, N., et al., Text normalization algorithm on twitter in complaint category. Procedia computer science, 2017. 116: p. 20-26.
  53. Poolsukkho, S. and R. Kongkachandra. Text normalization on thai twitter messages using ipa similarity algorithm. in 2018 International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP). 2018. IEEE.
  54. Jiang, N., et al., A Fast Randomized Algorithm for Massive Text Normalization. arXiv preprint arXiv:2110.03024, 2021.
  55. Roy, A., et al., An Unsupervised Normalization Algorithm for Noisy Text: A Case Study for Information Retrieval and Stance Detection. Journal of Data and Information Quality (JDIQ), 2021. 13(3): p. 1-25.
  56. Veliz, C.M., O. De Clercq, and V. Hoste. Comparing MT approaches for text normalization. in Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019). 2019.
  57. Kozhirbayev, Z. and Z. Yessenbayev. Kazakh text normalization using machine translation approaches. in CEUR Workshop Proceedings. 2020. CEUR-WS.
Index Terms

Computer Science
Information Sciences

Keywords

Text Normalization Techniques Method Approach Rule-based Statistical Method Neural Network Similarity-based Context-based etc