Evaluating Text-to-Text Generation from LLMs: A Case Study and Scalable Framework

Ziqiao Ao; Juhi Singh; Sebastian Antinome

Call for Paper

May Edition

IJCA solicits high quality original research papers for the upcoming May edition of the journal. The last date of research paper submission is 20 April 2026

Submit your paper

Know more

The week's pick

Evaluating Text-to-Text Generation from LLMs: A Case Study and Scalable Framework

Ziqiao Ao Juhi Singh Sebastian Antinome

Random Articles

Reseach Article

Evaluating Text-to-Text Generation from LLMs: A Case Study and Scalable Framework

by Ziqiao Ao, Juhi Singh, Sebastian Antinome

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 187 - Number 94

Year of Publication: 2026

Authors: Ziqiao Ao, Juhi Singh, Sebastian Antinome

10.5120/ijca2026926616

Ziqiao Ao, Juhi Singh, Sebastian Antinome . Evaluating Text-to-Text Generation from LLMs: A Case Study and Scalable Framework. International Journal of Computer Applications. 187, 94 ( Mar 2026), 1-10. DOI=10.5120/ijca2026926616

@article{ 10.5120/ijca2026926616,

author = { Ziqiao Ao, Juhi Singh, Sebastian Antinome },

title = { Evaluating Text-to-Text Generation from LLMs: A Case Study and Scalable Framework },

journal = { International Journal of Computer Applications },

issue_date = { Mar 2026 },

volume = { 187 },

number = { 94 },

month = { Mar },

year = { 2026 },

issn = { 0975-8887 },

pages = { 1-10 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume187/number94/evaluating-text-to-text-generation-from-llms-a-case-study-and-scalable-framework/ },

doi = { 10.5120/ijca2026926616 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2026-03-29T02:17:20+05:30

%A Ziqiao Ao

%A Juhi Singh

%A Sebastian Antinome

%T Evaluating Text-to-Text Generation from LLMs: A Case Study and Scalable Framework

%J International Journal of Computer Applications

%@ 0975-8887

%V 187

%N 94

%P 1-10

%D 2026

%I Foundation of Computer Science (FCS), NY, USA

Abstract

Large Language Models (LLMs) have enabled a wide range of text-to-text generation applications across diverse domains, yet robust evaluation of their outputs remains challenging, particularly for open-ended tasks where ground truth is unavailable. This paper introduces a comprehensive and scalable evaluation framework for LLM-generated instructional content, integrating statistical, semantic, lexical, and domain-specific metrics. The effectiveness of the framework is demonstrated through a real-world case study that converts Microsoft Learn content into Power- Point slides for Instructor-Led Training (ILT). The evaluation suite combines established metrics such as Perplexity, Entropy, and BERTScore with task-specific measures including Context Match Score and Rule Compliance Score, as well as rubric-driven assessments using an LLM-as-a-Judge approach. Experimental results from iterative prompt refinement demonstrate consistent gains in semantic fidelity, structural compliance, and instructional clarity. The framework facilitates reliable evaluation without reliance on ground truth and delivers actionable insights for prompt optimization in enterprise-scale generative workflows. While demonstrated in an instructional content generation setting, the framework generalizes to a broad class of text-to-text generation tasks.

References

OpenAI . Openai evals, 10 2023.
Maider Azanza, Beatriz P´erez Lamancha, and Eneko Pizarro. Tracking the moving target: A framework for continuous evaluation of llm test generation in industry, 2025.
Sambaran Bandyopadhyay, Himanshu Maheshwari, Anandhavelu Natarajan, and Apoorv Saxena. Enhancing presentation slide generation by LLMs with a multi-staged end-to-end approach. In Saad Mahamood, Nguyen Le Minh, and Daphne Ippolito, editors, Proceedings of the 17th International Natural Language Generation Conference, pages 222–229, Tokyo, Japan, September 2024. Association for Computational Linguistics.
Jonas Becker, Jan Philip Wahle, Bela Gipp, and Terry Ruas. Text generation: A systematic literature review of tasks, evaluation, and challenges, 05 2024.
Rishi Bommasani, Percy Liang, and Tong Lee. Holistic evaluation of language models. Annals of the New York Academy of Sciences, 1525:140–146, 05 2023.
Asli Celikyilmaz, Elizabeth Clark, and Jianfeng Gao. Evaluation of text generation: A survey. arXiv:2006.14799 [cs], 05 2021.
Stanford University Center for Research on Foundation Models. Holistic evaluation of language models (helm), 2025.
Yushuo Chen, Tianyi Tang, Erge Xiang, Linjiang Li, Wayne Xin Zhao, Jing Wang, Yunpeng Chai, and Ji-Rong Wen. Towards coarse-to-fine evaluation of inference efficiency for large language models, 2024.
Jan Deriu, Alvaro Rodrigo, Arantxa Otegi, Guillermo Echegoyen, Sophie Rosset, Eneko Agirre, and Mark Cieliebak. Survey on evaluation methods for dialogue systems. Artificial Intelligence Review, 54:755–810, 06 2020.
Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. A diversity-promoting objective function for neural conversation models. In Kevin Knight, Ani Nenkova, and Owen Rambow, editors, Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 110–119, San Diego, California, June 2016. Association for Computational Linguistics.
Yi Li, Haonan Wang, Qixiang Zhang, Boyu Xiao, Chenchang Hu, Hualiang Wang, and Xiaomeng Li. Unieval: Unified holistic evaluation for unified multimodal understanding and generation, 2025.
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation using gpt-4 with better human alignment, 12 2023.
Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks, 2019.
Harsh Saini, Md Tahmid Rahman Laskar, Cheng Chen, Elham Mohammadi, and David Rossouw. LLM evaluate: An industry-focused evaluation tool for large language models. In Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert, Kareem Darwish, and Apoorv Agarwal, editors, Proceedings of the 31st International Conference on Computational Linguistics: Industry Track, pages 286–294, Abu Dhabi, UAE, January 2025. Association for Computational Linguistics.
Elena Samuylova. Llm-as-a-judge: a complete guide to using llms for evaluations, 06 2025.
Liming Shao, Hong Yu, Wei Huang, Huiyuan Zhao, Lixin Zhang, and Jing Song. Deepseek-based multi-dimensional augmentation of short and highly domain-specific textual inquires for aquaculture question-answering framework. Aquaculture International, 33, 04 2025.
Brown Tom, Mann Benjamin, Ryder Nick, Subbiah Melanie, Jared D, Kaplan, Dhariwal Prafulla, Neelakantan Arvind, Shyam Pranav, Sastry Girish, Askell Amanda, Agarwal Sandhini, Herbert-Voss Ariel, Krueger Gretchen, Henighan Tom, Child Rewon, Ramesh Aditya, Ziegler Daniel, Wu Jeffrey, Winter Clemens, Hesse Chris, Chen Mark, Sigler Eric, Litwin Mateusz, Gray Scott, Chess Benjamin, Clark Jack, Berner Christopher, McCandlish Sam, Radford Alec, Sutskever Ilya, and Amodei Dario. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 2020.
Hao Yu, Aoran Gan, Kai Zhang, Shiwei Tong, Qi Liu, and Zhaofeng Liu. Evaluation of retrieval-augmented generation: A survey, 05 2024.
Weizhe Yuan, Graham Neubig, and Pengfei Liu. Bartscore: evaluating generated text as text generation. In Proceedings of the 35th International Conference on Neural Information Processing Systems, NIPS ’21, Red Hook, NY, USA, 2021. Curran Associates Inc.
Tianyi Zhang*, Varsha Kishore*, Felix Wu*, Kilian Q. Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert, 09 2019.
Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models, 11 2023.
Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. Texygen: A benchmarking platform for text generation models. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR ’18, page 1097–1100, New York, NY, USA, 2018. Association for Computing Machinery.

Index Terms

Computer Science

Information Sciences

Keywords

Large Language Models; Instructional Content Generation; Textto- Text Evaluation Framework; Prompt Optimization; Generative AI Assessment