International Journal of Computer Applications |
Foundation of Computer Science (FCS), NY, USA |
Volume 187 - Number 27 |
Year of Publication: 2025 |
Authors: Ritu Kuklani, Gururaj Shinde, Varad Vishwarupe |
![]() |
Ritu Kuklani, Gururaj Shinde, Varad Vishwarupe . Semantic Jailbreaks and RLHF Limitations in LLMs: A Taxonomy, Failure Trace, and Mitigation Strategy. International Journal of Computer Applications. 187, 27 ( Aug 2025), 38-43. DOI=10.5120/ijca2025925482
In this paper, various production scale model responses have been evaluated against encoded and cleverly paraphrased, obfuscated, or multimodal prompts to bypass guardrails. These attacks succeed by deceiving the model’s alignment layers trained via Reinforcement Learning from Human Feedback [10], [12], [20]. The paper proposes a comprehensive taxonomy that systematically categorizes RLHF limitations and also provide mitigation strategies for these attacks.