CFP last date
20 July 2026
Reseach Article

Synthetic Medical Data Generation using Transformer-based Generative AI: A Performance Comparison with Faker and CTGAN

by Srinivas Suresh Sikhakolli, Asha Kiran Sikhakolli
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 187 - Number 106
Year of Publication: 2026
Authors: Srinivas Suresh Sikhakolli, Asha Kiran Sikhakolli
10.5120/ijca372b529d7c61

Srinivas Suresh Sikhakolli, Asha Kiran Sikhakolli . Synthetic Medical Data Generation using Transformer-based Generative AI: A Performance Comparison with Faker and CTGAN. International Journal of Computer Applications. 187, 106 ( May 2026), 22-26. DOI=10.5120/ijca372b529d7c61

@article{ 10.5120/ijca372b529d7c61,
author = { Srinivas Suresh Sikhakolli, Asha Kiran Sikhakolli },
title = { Synthetic Medical Data Generation using Transformer-based Generative AI: A Performance Comparison with Faker and CTGAN },
journal = { International Journal of Computer Applications },
issue_date = { May 2026 },
volume = { 187 },
number = { 106 },
month = { May },
year = { 2026 },
issn = { 0975-8887 },
pages = { 22-26 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume187/number106/synthetic-medical-data-generation-using-transformer-based-generative-ai-a-performance-comparison-with-faker-and-ctgan/ },
doi = { 10.5120/ijca372b529d7c61 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2026-05-21T00:16:55.327031+05:30
%A Srinivas Suresh Sikhakolli
%A Asha Kiran Sikhakolli
%T Synthetic Medical Data Generation using Transformer-based Generative AI: A Performance Comparison with Faker and CTGAN
%J International Journal of Computer Applications
%@ 0975-8887
%V 187
%N 106
%P 22-26
%D 2026
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Access to medical data is essential for health care research and advanced analytics. However, strict privacy regulations significantly limit data availability, hinder the machine learning applications. Due to these limitations, synthetic data usage raising across the world. Prior studies focused on building synthetic data using rule-based models such as Faker and deep learning models such as CTGAN. In recent years, ChatGPT, a transformer based Generative AI model has emerged with advanced capabilities to generate wide variety of synthetic data on demand. The aim of this research is to show that the transformer based generative AI model produces quality synthetic data that yields better predictive performance when compared with the Faker and CTGAN models. The synthetic data has been generated with reference to the UCI Cleveland Heart data. Random Forest algorithm has been used to evaluate the performance of the model. The results of the experiment prove that the transformer based GenAI, ChatGPT generated synthetic data yields better performance when compared with the Faker and CTGAN models. Also, proves that the performance metrics of ChatGPT based synthetic data are close to the actual Cleveland heart medical data. Our findings suggest that ChatGPT model effectively captured clinical relationships and offers practical insights for researchers without losing the privacy in synthetic data. This type experiment is useful for non-clinical research.

References
  1. Bayrem Kaabachi, Jérémie Despraz, Thierry Meurers, Karen Otte, Mehmed Halilovic, Bogdan Kulynych, Fabian Prasser, and Jean Louis Raisaro (2023),Scoping review: “Privacy and utility in synthetic healthcare data” PubMed. Availble at https://pubmed.ncbi.nlm.nih.gov/39870798/
  2. DataIntelo. (2024). Synthetic data in healthcare market outlook 2025–2033 (Market report). https://dataintelo.com/report/synthetic-data-in-healthcare-market
  3. Fang, M. L., Dhami, D. S., & Kersting, K. (2022). DP-CTGAN: Differentially private tabular GAN. In M. Michalowski, S. S. R. Abidi, & S. Abidi (Eds.), Artificial Intelligence in Medicine: 20th International Conference on Artificial Intelligence in Medicine (AIME 2022) – Proceedings (pp. 178–188). Springer. https://doi.org/10.1007/978-3-031-09342-5_17
  4. Janosi, A., Steinbrunn, W., Pfisterer, M., & Detrano, R. (1989). Heart disease [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C52P4X
  5. Z. Zhao, A. Kunar, R. Birke, H. Van der Scheer and L. Y. Chen, “CTAB-GAN+: Enhancing tabular data synthesis,” Frontiers in Big Data, vol. 6, p. 1296508, Jan. 2024, doi: 10.3389/fdata.2023.1296508.
  6. Umesh, C., Mahendra, M., Bej, S., Wolkenhauer, O., & Wolfien, M (2024). Challenges and applications in generative AI for clinical tabular data in physiology. Pflügers Archiv – European Journal of Physiology.
  7. Ahmed, H. A., Nepomuceno, J. A., Vega‑Márquez, B., et al. (2025). “Synthetic Data Generation for Healthcare: Exploring Generative Adversarial Networks Variants for Medical Tabular Data.” International Journal of Data Science and Analytics, Springer.
  8. Ghosheh, M., Murtaza, S., & others (2025). “A Systematic Review of Privacy‑Preserving Techniques for Synthetic Tabular Health Data.” Discover Data (Springer).
  9. Karmakar, A., Shaw, A., Rakshit, S., Chakraborty, S., Biswas, S., Sahoo, S., & Biswas, S. (2025). The role of generative AI in medical image synthesis: A review. Discover Applied Sciences, 7, Article 714. https://doi.org/10.1007/s42452-025-07714-7
  10. Zhang, W., Liu, R., Zhu, X., et al. (2025). “Enhancing Privacy Protection of Physical Examination Data through Synthetic Algorithms Based on Differential Privacy.” BMC Medical Informatics and Decision Making (Springer Nature).
Index Terms

Computer Science
Information Sciences

Keywords

Synthetic Medical Data Generative AI Privacy-Preserving Data Random Forest Model Performance