Dynamic LLM Routing and Selection based on User Preferences: Balancing Performance, Cost, and Ethics

Deepak Babu Piskala; Vijay Raajaa; Sachin Mishra; Bruno Bozza

Call for Paper

September Edition

IJCA solicits high quality original research papers for the upcoming September edition of the journal. The last date of research paper submission is 20 August 2025

Submit your paper

Know more

The week's pick

Assessing LLMs as Cognitive Interpreters of Student Prompts: A Typological Framework

Tadeu da Ponte Matevz Vremec Matej Mertik

Random Articles

Reseach Article

Dynamic LLM Routing and Selection based on User Preferences: Balancing Performance, Cost, and Ethics

by Deepak Babu Piskala, Vijay Raajaa, Sachin Mishra, Bruno Bozza

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 186 - Number 51

Year of Publication: 2024

Authors: Deepak Babu Piskala, Vijay Raajaa, Sachin Mishra, Bruno Bozza

10.5120/ijca2024924172

Deepak Babu Piskala, Vijay Raajaa, Sachin Mishra, Bruno Bozza . Dynamic LLM Routing and Selection based on User Preferences: Balancing Performance, Cost, and Ethics. International Journal of Computer Applications. 186, 51 ( Nov 2024), 1-7. DOI=10.5120/ijca2024924172

@article{ 10.5120/ijca2024924172,

author = { Deepak Babu Piskala, Vijay Raajaa, Sachin Mishra, Bruno Bozza },

title = { Dynamic LLM Routing and Selection based on User Preferences: Balancing Performance, Cost, and Ethics },

journal = { International Journal of Computer Applications },

issue_date = { Nov 2024 },

volume = { 186 },

number = { 51 },

month = { Nov },

year = { 2024 },

issn = { 0975-8887 },

pages = { 1-7 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume186/number51/optiroute-dynamic-llm-routing-and-selection-based-on-user-preferences-balancing-performance-cost-and-ethics/ },

doi = { 10.5120/ijca2024924172 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2024-12-01T00:09:59.606929+05:30

%A Deepak Babu Piskala

%A Vijay Raajaa

%A Sachin Mishra

%A Bruno Bozza

%T Dynamic LLM Routing and Selection based on User Preferences: Balancing Performance, Cost, and Ethics

%J International Journal of Computer Applications

%@ 0975-8887

%V 186

%N 51

%P 1-7

%D 2024

%I Foundation of Computer Science (FCS), NY, USA

Abstract

With the widespread deployment of large language models (LLMs) such as GPT-4 [12], BART [9], and LLaMA [5], the need for a system that can intelligently select the most suitable model for specific tasks—while balancing cost, latency, accuracy, and ethical considerations—has become increasingly important. Recognizing that not all tasks necessitate models with over 100+ billion parameters, we introduce OptiRoute, an advanced model routing engine designed to dynamically select and route tasks to the optimal LLM based on detailed user-defined requirements. OptiRoute captures both functional (e.g., accuracy, speed, cost) and non-functional (e.g., helpfulness, harmlessness, honesty) criteria, leveraging lightweight task analysis and complexity estimation to efficiently match tasks with the best-fit models from a diverse array of LLMs. By employing a hybrid approach combining k-nearest neighbors (kNN) search and hierarchical filtering, OptiRoute optimizes for user priorities while minimizing computational overhead. This makes it ideal for real-time applications in cloud-based ML platforms, personalized AI services, and regulated industries. [4]

References

Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Jackson Kernion, Kamal Ndousse, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, and Jared Kaplan. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861, December 2021.
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, AlbertWebson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and JasonWei. Scaling instruction-finetuned language models, 2022.
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, May 2023.
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers, 2023.
Aaron Grattafiori, Abhimanyu Dubey, and Abhinav Jauhri et. al. The llama 3 herd of models, 2024.
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, March 2015.
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, October 2021.
Renren Jin, Jiangcun Du, Wuwei Huang, Wei Liu, Jian Luan, Bin Wang, and Deyi Xiong. A comprehensive evaluation of quantization strategies for large language models, 2024.
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, October 2019.
Mingjie Liu, Teodor-Dumitru Ene, Robert Kirby, Chris Cheng, Nathaniel Pinckney, Rongjian Liang, Jonah Alben, Himyanshu Anand, Sanmitra Banerjee, Ismet Bayraktaroglu, Bonita Bhaskaran, Bryan Catanzaro, Arjun Chaudhuri, Sharon Clay, Bill Dally, Laura Dang, Parikshit Deshpande, Siddhanth Dhodhi, Sameer Halepete, Eric Hill, Jiashang Hu, Sumit Jain, Ankit Jindal, Brucek Khailany, George Kokai, Kishor Kunal, Xiaowei Li, Charley Lind, Hao Liu, Stuart Oberman, Sujeet Omar, Ghasem Pasandi, Sreedhar Pratty, Jonathan Raiman, Ambar Sarkar, Zhengjiang Shao, Hanfei Sun, Pratik P Suthar, Varun Tej, Walker Turner, Kaizhe Xu, and Haoxing Ren. Chipnemo: Domain-adapted llms for chip design, 2024.
Renqian Luo, Liai Sun, Yingce Xia, Tao Qin, Sheng Zhang, Hoifung Poon, and Tie-Yan Liu. Biogpt: Generative pretrained transformer for biomedical text generation and mining. Briefings in Bioinformatics, 23(6):bbac409, November 2022.
OpenAI. Gpt-4 technical report. March 2024.
Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li,Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face. arXiv preprint arXiv:2303.17580, December 2023.
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R´emi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, July 2020.
Mitchell Wortsman, Gabriel Ilharco, Samir Yitzhak Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S. Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time, 2022.
Stanisław Wo´zniak, Bartłomiej Koptyra, Arkadiusz Janz, Przemysław Kazienko, and Jan Koco´n. Personalized large language models, 2024.
Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, and Gideon Mann. Bloomberggpt: A large language model for finance, 2023.
Hongyang Yang, Xiao-Yang Liu, and Christina Dan Wang. Fingpt: Open-source financial large language models, 2023.
Shih yang Liu, Zechun Liu, Xijie Huang, Pingcheng Dong, and Kwang-Ting Cheng. Llm-fp4: 4-bit floating-point quantized transformers. pages 592–605, 2023.

Index Terms

Computer Science

Information Sciences

LLM Optimization

Benchmarks

Evaluation

Routing

Complexityestimation

Feedback

Domain Adaptation

Keywords

GPT4 Llama Helpfulness Honesty Harmlessness Latency Accuracy Cost kNN Optiroute Domain Model Merging Re-ranking Fallback Steerability Instruction-following Ability MLaaS Healthcare Finance Legal Hallucinations Grounding FLAN BERT BART