AudioRouter: Data Efficient Audio Understanding via RL-based Dual Reasoning

Liyang Chen1,3, Hongkai Chen3, Yujun Cai2, Sifan Li3,4, Qingwen Ye3, Yiwei Wang4
1University of California, Los Angeles   2The University of Queensland   3vivo Mobile Communication Co., Ltd.   4University of California, Merced
Preprint, 2026
AudioRouter overview figure

AudioRouter improves audio understanding by learning when and how to use external audio tools, via reinforcement learning over tool routing, while keeping the underlying reasoning model frozen.

Abstract

Large Audio Language Models (LALMs) have demonstrated strong capabilities in audio understanding and reasoning. However, their performance on fine-grained auditory perception remains unreliable, and existing approaches largely rely on data-intensive training to internalize perceptual abilities. We propose AudioRouter, a reinforcement learning framework that enables LALMs to improve audio understanding by learning when and how to use external audio tools. Rather than tightly coupling tool usage with audio reasoning, AudioRouter formulates tool use as an explicit decision-making problem and optimizes a lightweight routing policy while keeping the underlying reasoning model frozen. Experimental results show that AudioRouter achieves substantial improvements on standard audio understanding benchmarks while requiring up to 600× less training data to learn tool usage compared with conventional training paradigms. These findings suggest that learning effective tool usage offers a data-efficient and scalable alternative to internalizing perceptual abilities in LALMs.

Framework

AudioRouter decouples tool usage from reasoning by learning a routing policy that decides tool calls, then executing selected tools and feeding structured outputs into a frozen reasoning model.

AudioRouter framework

Experiments

AudioRouter achieves substantial improvements on audio understanding benchmarks while using far less training data to learn effective tool usage.

Main experimental results

Usage

The inference pipeline consists of three stages:

  1. Tool Routing: generate routing decisions for each audio-question pair.
  2. Tool Execution: execute selected audio tools and collect structured outputs.
  3. Final Reasoning: run the reasoning model with optional tool outputs.

See the code repository for detailed setup and commands.

Citation

@article{chen2026audiorouter,
  title={AudioRouter: Data Efficient Audio Understanding via RL based Dual Reasoning},
  author={Chen, Liyang and Chen, Hongkai and Cai, Yujun and Li, Sifan and Ye, Qingwen and Wang, Yiwei},
  year={2026}
}

Acknowledgements

We especially acknowledge the training framework provided by Omni-CLST and the toolkits from Audio-Maestro for enabling efficient audio processing and system integration.