A METHOD OF IMPROVING LLM INFERENCE EFFICIENCY VIA CASCADE ROUTING BASED ON SEMANTIC ENTROPY EVALUATION

Authors

  • Vasyl Khrapko National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”, Ukraine
  • Artem Volokyta Department of Computer Engineering, National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”, Ukraine

Keywords:

large language models, LLM inference, cascade routing, semantic entropy

Abstract

The article presents a method for improving the inference efficiency of large language models using cascade routing based on semantic entropy evaluation to address the issue of high costs and latencies of modern LLMs, which is highly relevant due to the rapid growth of LLM deployment in real-world production and commercial systems. It reveals the working principle, which involves automatically determining the cognitive complexity of the input query using a combined semantic entropy metric and dynamically redirecting it to the most optimal model of the corresponding efficiency and power level. A comparative analysis with existing routing methods and a verification of this method's functionality were conducted on a test dataset of queries.

References

Moslem Y. Dynamic Model Routing and Cascading for Efficient LLM Inference: A Survey [Електронний ресурс] / Y. Moslem // arXiv. – 2026. – arXiv:2603.04445. – Режим доступу: https://arxiv.org/abs/2603.04445 – Дата звернення: квітень 2026.

Dekoninck J. A Unified Approach to Routing and Cascading for LLMs [Електронний ресурс] / J. Dekoninck // OpenReview. – 2024. – Режим доступу: https://openreview.net/forum?id=AAl89VNNy1 – Дата звернення: квітень 2026.

Kuhn L. Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation [Електронний ресурс] / L. Kuhn // arXiv. – 2023. – arXiv:2302.09664. – Режим доступу: https://arxiv.org/abs/2302.09664 –Дата звернення: квітень 2026.

Farquhar S. Detecting hallucinations in large language models using semantic entropy [Електронний ресурс] / S. Farquhar // arXiv. – 2024. – arXiv:2406.02532. – Режим доступу: https://arxiv.org/abs/2406.02532 – Дата звернення: квітень 2026.

Kossen J. Beyond Semantic Entropy: Boosting LLM Uncertainty Quantification with Pairwise Semantic Similarity [Електронний ресурс] / J. Kossen // arXiv. – 2025. – arXiv:2506.00245. – Режим доступу: https://arxiv.org/abs/2506.00245 – Дата звернення: квітень 2026.

Jiang A. Q. Mixtral of Experts [Електронний ресурс] / A. Q. Jiang // arXiv. – 2024. – arXiv:2401.04088. – Режим доступу: https://arxiv.org/abs/2401.04088 – Дата звернення: квітень 2026.

Manakul P. SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models [Електронний ресурс] / P. Manakul // arXiv. – 2023. – arXiv:2303.08896. – Режим доступу: https://arxiv.org/abs/2303.08896 – Дата звернення: квітень 2026.

Published

2026-05-08

Issue

Section

Machine learning, Big Data (AI)