CFP last date
20 August 2025
Call for Paper
September Edition
IJCA solicits high quality original research papers for the upcoming September edition of the journal. The last date of research paper submission is 20 August 2025

Submit your paper
Know more
Random Articles
Reseach Article

Semantic Jailbreaks and RLHF Limitations in LLMs: A Taxonomy, Failure Trace, and Mitigation Strategy

by Ritu Kuklani, Gururaj Shinde, Varad Vishwarupe
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 187 - Number 27
Year of Publication: 2025
Authors: Ritu Kuklani, Gururaj Shinde, Varad Vishwarupe
10.5120/ijca2025925482

Ritu Kuklani, Gururaj Shinde, Varad Vishwarupe . Semantic Jailbreaks and RLHF Limitations in LLMs: A Taxonomy, Failure Trace, and Mitigation Strategy. International Journal of Computer Applications. 187, 27 ( Aug 2025), 38-43. DOI=10.5120/ijca2025925482

@article{ 10.5120/ijca2025925482,
author = { Ritu Kuklani, Gururaj Shinde, Varad Vishwarupe },
title = { Semantic Jailbreaks and RLHF Limitations in LLMs: A Taxonomy, Failure Trace, and Mitigation Strategy },
journal = { International Journal of Computer Applications },
issue_date = { Aug 2025 },
volume = { 187 },
number = { 27 },
month = { Aug },
year = { 2025 },
issn = { 0975-8887 },
pages = { 38-43 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume187/number27/semantic-jailbreaks-and-rlhf-limitations-in-llms-a-taxonomy-failure-trace-and-mitigation-strategy/ },
doi = { 10.5120/ijca2025925482 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2025-08-02T01:56:29.628136+05:30
%A Ritu Kuklani
%A Gururaj Shinde
%A Varad Vishwarupe
%T Semantic Jailbreaks and RLHF Limitations in LLMs: A Taxonomy, Failure Trace, and Mitigation Strategy
%J International Journal of Computer Applications
%@ 0975-8887
%V 187
%N 27
%P 38-43
%D 2025
%I Foundation of Computer Science (FCS), NY, USA
Abstract

In this paper, various production scale model responses have been evaluated against encoded and cleverly paraphrased, obfuscated, or multimodal prompts to bypass guardrails. These attacks succeed by deceiving the model’s alignment layers trained via Reinforcement Learning from Human Feedback [10], [12], [20]. The paper proposes a comprehensive taxonomy that systematically categorizes RLHF limitations and also provide mitigation strategies for these attacks.

References
  1. Bluedot. (2024). RLHF Limitations for AI Safety. https://bluedot.org/blog/rlhf-limitations-for-ai-safety
  2. Vishwarupe, V., Zahoor, S., Akhter, R., Bhatkar, V. P., Bedekar, M., Pande, M., Joshi, P. M., Patil, A., & Pawar, V. (2023). Designing a human-centered AI-based cognitive learning model for Industry 4.0 applications. In Industry 4.0 Convergence with AI, IoT, Big Data and Cloud Computing (pp. 84–95). Bentham Science Publishers.
  3. Anup. (2024). LLM Security 101: Defending Against Prompt Injections. https://www.anup.io/p/llm-security-101-defending-against
  4. Gehman, S., et al. (2020). RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. arXiv preprint arXiv:2009.11462.
  5. Sayyed, H., Alwazae, M., & Vishwarupe, V. (2025). BlockSafe: Universal blockchain-based identity management. In Big Data in Finance (Vol. 169, pp. 101–118). Springer.
  6. Vishwarupe, V., Maheshwari, S., Deshmukh, A., Mhaisalkar, S., Joshi, P. M., & Mathias, N. (2022). Bringing humans at the epicentre of artificial intelligence. Procedia Computer Science, 204, 914–921.
  7. HiddenLayer. (2024a). Novel Universal Bypass for All Major LLMs. https://hiddenlayer.com/innovation-hub/novel-universal-bypass-for-all-major-llms
  8. HiddenLayer. (2024b). Prompt Injection Attacks on LLMs. https://hiddenlayer.com/innovation-hub/prompt-injection-attacks-on-llms
  9. Vishwarupe, V., Bedekar, M., Pande, M., & Hiwale, A. (2018). Intelligent Twitter spam detection: A hybrid approach. In Smart trends in systems, security and sustainability (Vol. 18, pp. 157–167). Springer.
  10. Kili Technology. (2024a). Preventing Adversarial Prompt Injections with LLM Guardrails. https://kili-technology.com/large-language-models-llms/preventing-adversarial-prompt-injections-with-llm-guardrails
  11. Kili Technology. (2024b). Exploring Reinforcement Learning from Human Feedback (RLHF): A Comprehensive Guide. https://kili-technology.com/large-language-models-llms/exploring-reinforcement-learning-from-human-feedback-rlhf-a-comprehensive-guide
  12. Label Studio. (2024). Reinforcement Learning from Verifiable Rewards. https://labelstud.io/blog/reinforcement-learning-from-verifiable-rewards/
  13. Vishwarupe, V., Joshi, P. M., Mathias, N., Maheshwari, S., Mhaisalkar, S., & Pawar, V. (2022). Explainable AI and interpretable machine learning: A case study in perspective. Procedia Computer Science, 204, 869–876.
  14. Wani, K., Khedekar, N., Vishwarupe, V., & Pushyanth, N. (2023). Digital twin and its applications. In Research Trends in Artificial Intelligence: Internet of Things (pp. 120–134). Bentham Science Publishers.
  15. Labellerr. (2024). RLHF Explained. https://www.labellerr.com/blog/reinforcement-learning-from-human-feedback/
  16. Vishwarupe, V., Bedekar, M., Pande, M., Bhatkar, V. P., Joshi, P., Zahoor, S., & Kuklani, P. (2022). Comparative analysis of machine learning algorithms for analyzing NASA Kepler mission data. Procedia Computer Science, 204, 945–951.
  17. Vishwarupe, V. (2022). Synthetic content generation using artificial intelligence. All Things Policy, IVM Podcasts.
  18. Zahoor, S., Bedekar, M., Mane, V., & Vishwarupe, V. (2016). Uniqueness in user behavior while using the web. In Proceedings of the International Congress on Information and Communication Technology (Vol. 438, pp. 229–236). Springer.
  19. Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155.
  20. Understanding RLHF. (2024). A Comprehensive Curriculum on RLHF. https://understanding-rlhf.github.io
  21. Vishwarupe, V., Bedekar, M., & Zahoor, S. (2015). Zone-specific weather monitoring system using crowdsourcing and telecom infrastructure. In 2015 International Conference on Information Processing (ICIP) (pp. 823–827). IEEE.
  22. Zahoor, S., Bedekar, M., & Vishwarupe, V. (2016). A framework to infer webpage relevancy for a user. In Proceedings of First International Conference on ICT for Intelligent Systems (Vol. 50, pp. 173–181). Springer.
  23. WithSecure. (2024). LLaMA 3 Prompt Injection Hardening. https://labs.withsecure.com/publications/llama3-prompt-injection-hardening
  24. Reddit – Prompt Engineering. (2024). Prompting an LLM to stop giving extra responses. https://www.reddit.com/r/PromptEngineering/comments/1h5367l/how_do_i_prompt_an_llm_to_stop_giving_me_extra/
  25. Deoskar, V., Pande, M., & Vishwarupe, V. (2024). An analytical study for implementing 360-degree M-HRM practices using AI. In Intelligent Systems for Smart Cities (pp. 429–442). Springer.
  26. Vishwarupe, V., et al. (2021). A zone-specific weather monitoring system. Australian Patent No. AU2021106275.
  27. Reddit – Outlier AI. (2024). How to Create a Model Failure for Cypher RLHF. https://www.reddit.com/r/outlier_ai/comments/1hgoho7/how_to_create_a_model_failure_for_cypher_rlhf/
  28. arXiv (2024a). Prompt Injection Mitigation for LLMs. arXiv preprint arXiv:2503.03039v1.
  29. Vishwarupe, V., Bedekar, M., Joshi, P. M., Pande, M., Pawar, V., & Shingote, P. (2022). Data analytics in the game of cricket: A novel paradigm. Procedia Computer Science, 204, 937–944.
  30. Alignment Forum. (2024). Interpreting Preference Models with Sparse Autoencoders. https://www.alignmentforum.org/posts/5XmxmszdjzBQzqpmz/interpreting-preference-models-w-sparse-autoencoders
  31. Vishwarupe, V. V., & Joshi, P. M. (2016). Intellert: A novel approach for content-priority based message filtering. In IEEE Bombay Section Symposium (IBSS) (pp. 1–6). IEEE.
  32. Vishwarupe, V., et al. (2025). Predicting mental health ailments using social media activities and keystroke dynamics with machine learning. In Big Data in Finance (Vol. 169, pp. 63–80). Springer.
  33. Zahoor, S., Akhter, R., Vishwarupe, V., Bedekar, M., Pande, M., Bhatkar, V. P., Joshi, P. M., Pawar, V., Mandora, N., & Kuklani, P. (2023). A comprehensive study of state-of-the-art applications and challenges in IoT and blockchain technologies for Industry 4.0. In Industry 4.0 Convergence with AI, IoT, Big Data and Cloud Computing (pp. 1–16). Bentham.
  34. NeurIPS 2024. (2024). Poster #96148. https://neurips.cc/virtual/2024/poster/96148
  35. OpenReview. (2024). Submission T1lFrYwtf7. https://openreview.net/forum?id=T1lFrYwtf7
  36. Anup. (2024). LLM Security 101: Defending Against Prompt Injections. https://www.anup.io/p/llm-security-101-defending-against
  37. Bluedot. (2024). RLHF Limitations for AI Safety. https://bluedot.org/blog/rlhf-limitations-for-ai-safety
  38. Gehman, S., et al. (2020). RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. arXiv preprint arXiv:2009.11462.
  39. HiddenLayer. (2024a). Novel Universal Bypass for All Major LLMs. https://hiddenlayer.com/innovation-hub/novel-universal-bypass-for-all-major-llms
  40. HiddenLayer. (2024b). Prompt Injection Attacks on LLMs. https://hiddenlayer.com/innovation-hub/prompt-injection-attacks-on-llms
  41. Kili Technology. (2024a). Preventing Adversarial Prompt Injections with LLM Guardrails. https://kili-technology.com/large-language-models-llms/preventing-adversarial-prompt-injections-with-llm-guardrails
  42. Kili Technology. (2024b). Exploring Reinforcement Learning from Human Feedback (RLHF): A Comprehensive Guide. https://kili-technology.com/large-language-models-llms/exploring-reinforcement-learning-from-human-feedback-rlhf-a-comprehensive-guide
  43. Label Studio. (2024). Reinforcement Learning from Verifiable Rewards. https://labelstud.io/blog/reinforcement-learning-from-verifiable-rewards/
  44. Labellerr. (2024). RLHF Explained. https://www.labellerr.com/blog/reinforcement-learning-from-human-feedback/
  45. Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155.
  46. Understanding RLHF. (2024). A Comprehensive Curriculum on RLHF. https://understanding-rlhf.github.io
  47. WithSecure. (2024). LLaMA 3 Prompt Injection Hardening. https://labs.withsecure.com/publications/llama3-prompt-injection-hardening
  48. Alignment Forum. (2024). Interpreting Preference Models with Sparse Autoencoders. https://www.alignmentforum.org/posts/5XmxmszdjzBQzqpmz/interpreting-preference-models-w-sparse-autoencoders
  49. OpenReview. (2024). Submission T1lFrYwtf7. https://openreview.net/forum?id=T1lFrYwtf7
Index Terms

Computer Science
Information Sciences

Keywords

Reinforcement Learning from Human Feedback Indirect Multimodal Manipulations Large Language Models Semantic Jailbreaks