CFP last date
20 April 2026
Call for Paper
May Edition
IJCA solicits high quality original research papers for the upcoming May edition of the journal. The last date of research paper submission is 20 April 2026

Submit your paper
Know more
Random Articles
Reseach Article

Synthetic Data Generation for Automated JavaScript Vulnerability Detection using Fine-Tuned CodeBERT

by Harun Hadzagic, Zerina Altoka
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 187 - Number 90
Year of Publication: 2026
Authors: Harun Hadzagic, Zerina Altoka
10.5120/ijca2026926575

Harun Hadzagic, Zerina Altoka . Synthetic Data Generation for Automated JavaScript Vulnerability Detection using Fine-Tuned CodeBERT. International Journal of Computer Applications. 187, 90 ( Mar 2026), 16-22. DOI=10.5120/ijca2026926575

@article{ 10.5120/ijca2026926575,
author = { Harun Hadzagic, Zerina Altoka },
title = { Synthetic Data Generation for Automated JavaScript Vulnerability Detection using Fine-Tuned CodeBERT },
journal = { International Journal of Computer Applications },
issue_date = { Mar 2026 },
volume = { 187 },
number = { 90 },
month = { Mar },
year = { 2026 },
issn = { 0975-8887 },
pages = { 16-22 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume187/number90/synthetic-data-generation-for-automated-javascript-vulnerability-detection-using-fine-tuned-codebert/ },
doi = { 10.5120/ijca2026926575 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2026-03-20T22:55:35.589170+05:30
%A Harun Hadzagic
%A Zerina Altoka
%T Synthetic Data Generation for Automated JavaScript Vulnerability Detection using Fine-Tuned CodeBERT
%J International Journal of Computer Applications
%@ 0975-8887
%V 187
%N 90
%P 16-22
%D 2026
%I Foundation of Computer Science (FCS), NY, USA
Abstract

The dynamic and flexible nature of JavaScript, the foundational language of modern web development, makes it highly susceptible to vulnerabilities such as Cross-Site Scripting (XSS), SQL Injection, and Hardcoded Secrets. Traditional security analysis tools, as well as manual code review, struggle to maintain accuracy and scalability in complex codebases, especially with the increasing use of AI in code production. To address this, this paper presents a high-performance solution utilizing a CodeBERT transformer model fine-tuned for automated binary sequence classification. A balanced dataset constructed of 71 vulnerabilities with 60 JavaScript code snippets (30 pairs of secure and insecure versions) generated through advanced LLMs. Employing a rigorous Pair-ID splitting methodology, it ensured the model was evaluated on truly unseen vulnerability patterns, preventing data leakage and overfitting. The fine-tuned CodeBERT model achieved exceptional performance on the held-out test set, culminating in an F1-Score of 0.9413. Crucially, the model attained a Recall of 0.9468 for the 'Insecure' class, confirming its ability to minimize missed vulnerabilities, the most critical error in security screening. Furthermore, a generalization check using an alternating dataset validated the model's robustness, maintaining a high F1-Score. The findings demonstrate the viability of specialized Code LLMs for reliable vulnerability detection, paving the way for low-latency integration into continuous integration pipelines to enforce secure coding practices in real time.

References
  1. Kluban, M., Mannan, M., & Youssef, A. (2022). On measuring vulnerable JavaScript functions in the wild. Proceedings of the 2022 ACM on Asia Conference on Computer and Communications Security, 917–930. https://doi.org/10.1145/3488932.3497769
  2. Chin, K. (2025, June 29). Biggest data breaches in Europe [Updated 2025]. UpGuard. https://www.upguard.com/blog/biggest-data-breaches-europe
  3. Hollander, M. (2020, August 24). Most common security vulnerabilities using JavaScript. SecureCoding. https://www.securecoding.com/blog/most-common-security-vulnerabilities-using-javascript/
  4. Anton Cheshkov, Pavel Zadorozhny, Rodion Levichev. (2023). ChatGPT: Limitations in Vulnerability Detection for Programming Languages (arXiv preprint arXiv:2304.07232). Retrieved from https://arxiv.org/pdf/2304.07232
  5. Achimugu, P., Selamat, A., Ibrahim, R., & Mahrin, M. N. (2014). A systematic literature review of Software Requirements Prioritization Research. Information and Software Technology, 56(6), 568–585. https://doi.org/10.1016/j.infsof.2014.02.001
  6. Fang, Q. et al. (2018). Detecting DOM-based XSS with Static Analysis.
  7. Russell, R. et al. (2018). Automated Vulnerability Detection in Source Code Using Deep Representation Learning.
  8. Harer, J. A., Kim, L. Y., Russell, R. L., Ozdemir, O., Rogers, K. K., Watt, R. K., & Nicholson, P. K. (2018). Automated software vulnerability detection with machine learning. arXiv preprint arXiv:1802.08038.
  9. Hoang, T., Kang, H. J., Lo, D., & Lawall, J. (2020). Hierarchical Graph Neural Network for Open-Source Software Vulnerability Detection. 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER), 385–396.
  10. Hanif, S., & Maffeis, S. (2022). VulBERTa: Learning Deep Representations of Code for Vulnerability Detection. arXiv:2203.13460.
  11. Lin, Z. et al. (2020). Software Vulnerability Detection Using Deep Learning: A Survey.
  12. Wessel, M., Serebrenik, A., Wermelinger, M., Rossi, B., & Steinmacher, I. (2020). What to expect from code review bots on GitHub? A survey of open source projects. IEEE Software, 38(3), 67–75.
  13. Lu, Y., Wang, H., & Wei, W. (2023). Machine Learning for Synthetic Data Generation: a Review. https://doi.org/10.48550/arxiv.2302.04062
  14. Khanna, C. (2021, August 13). Byte-Pair Encoding: Subword-based tokenization algorithm. Medium. https://medium.com/data-science/byte-pair-encoding-subword-based-tokenization-algorithm-77828a70bee0
  15. Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong, M., Shou, L., Qin, B., Liu, T., Jiang, D., & Zhou, M. (2020). Codebert: A pre-trained model for programming and natural languages. Findings of the Association for Computational Linguistics: EMNLP 2020, 1536–1547. https://doi.org/10.18653/v1/2020.findings-emnlp.139
  16. Hao, Y., Tang, Z., Tian, Y., Zhang, Y., & Zhou, Z. (2024). AdamW. Cornell Optimization. Retrieved November 21, 2025
Index Terms

Computer Science
Information Sciences

Keywords

Synthetic Data Generation JavaScript Vulnerability Detection CodeBERT Static Code Analysis Secure JavaScript Transformer Models