| International Journal of Computer Applications |
| Foundation of Computer Science (FCS), NY, USA |
| Volume 187 - Number 94 |
| Year of Publication: 2026 |
| Authors: Ziqiao Ao, Juhi Singh, Sebastian Antinome |
10.5120/ijca2026926616
|
Ziqiao Ao, Juhi Singh, Sebastian Antinome . Evaluating Text-to-Text Generation from LLMs: A Case Study and Scalable Framework. International Journal of Computer Applications. 187, 94 ( Mar 2026), 1-10. DOI=10.5120/ijca2026926616
Large Language Models (LLMs) have enabled a wide range of text-to-text generation applications across diverse domains, yet robust evaluation of their outputs remains challenging, particularly for open-ended tasks where ground truth is unavailable. This paper introduces a comprehensive and scalable evaluation framework for LLM-generated instructional content, integrating statistical, semantic, lexical, and domain-specific metrics. The effectiveness of the framework is demonstrated through a real-world case study that converts Microsoft Learn content into Power- Point slides for Instructor-Led Training (ILT). The evaluation suite combines established metrics such as Perplexity, Entropy, and BERTScore with task-specific measures including Context Match Score and Rule Compliance Score, as well as rubric-driven assessments using an LLM-as-a-Judge approach. Experimental results from iterative prompt refinement demonstrate consistent gains in semantic fidelity, structural compliance, and instructional clarity. The framework facilitates reliable evaluation without reliance on ground truth and delivers actionable insights for prompt optimization in enterprise-scale generative workflows. While demonstrated in an instructional content generation setting, the framework generalizes to a broad class of text-to-text generation tasks.