Optimizing Agentic Reasoning with Retrieval via Synthetic Semantic Information Gain Reward

Senkang Hu1,2, Yong Dai3, Yuzhi Zhao2, Yihang Tao1,2, Yu Guo1,2, Zhengru Fang1,2, Sam Tak Wu Kwong4, Yuguang Fang1,2
1Hong Kong JC STEM Lab of Smart City, 2City University of Hong Kong, 3Fudan University, 4Lingnan University

Abstract

Large reasoning models have shown strong capabilities in complex tasks by combining chain-of-thought reasoning with external retrieval. However, existing training signals for retrieval are often sparse and delayed, making it difficult to assign credit to intermediate search actions.

We introduce InfoReasoner, a unified framework that rewards retrieval by measuring synthetic semantic information gain. The key idea is simple: good retrieval should reduce uncertainty over the answer. We formalize this as uncertainty reduction over belief states and provide properties such as non-negativity and telescoping additivity to justify per-step optimization.

Across seven QA benchmarks, InfoReasoner consistently outperforms strong retrieval-augmented baselines and achieves up to 5.4% average accuracy improvement.

Method Overview

InfoReasoner computes a dense intrinsic reward directly from model outputs without manual retrieval annotation:

  1. Sample multiple candidate answers with and without retrieved evidence.
  2. Group semantically equivalent answers via bidirectional textual entailment.
  3. Estimate belief distributions over semantic classes.
  4. Use entropy reduction as information gain reward to train retrieval policy.

The policy is optimized with GRPO, combining output reward and information gain reward to balance answer quality and uncertainty-reducing retrieval behaviors.

Framework Figure

Overview of InfoReasoner framework
Figure 1 (Overview): The pipeline has four connected stages: (1) input question and current reasoning context, (2) two parallel branches for baseline/no-retrieval and retrieval-conditioned generation, (3) semantic clustering over sampled answers, and (4) policy optimization with information-gain reward.
How to read: the key comparison is the uncertainty gap between the two branches. If retrieved evidence makes answer distributions more concentrated, semantic entropy drops and IG increases.
Main takeaway: InfoReasoner rewards retrieval actions that reduce uncertainty, not only actions that immediately produce a correct final answer.
Role in this page: this figure supports the Method section by showing why the reward is dense and why it improves multi-step retrieval behavior.

Theoretical Foundations

InfoReasoner views multi-step retrieval as an uncertainty-reduction process. Instead of rewarding only the final answer, it gives credit to retrieval actions that make the model more certain about the correct semantic answer class.

This gives a dense learning signal during reasoning, while still keeping the final correctness objective in training.

Practical Reward Formulation

In implementation, the model compares uncertainty before and after retrieval, and uses their difference as an intrinsic reward:

$$\widehat{\mathrm{IG}}_t(x_t,a_t)=H_{\text{sem}}(x_t,B)-H_{\text{sem}}(x_t,C_t).$$

The training reward then combines exact-match correctness and this information-gain term (weighted by \(\lambda\)).

Main Results (QA Benchmarks)

  • Consistent gains across 7 datasets, including both single-hop and multi-hop QA settings.
  • Parameter-efficient performance: the 3B model surpasses several 7B retrieval baselines.
  • State-of-the-art at both scales: strong improvements are maintained at 3B and 7B.
  • Better retrieval behavior: reward shaping encourages exploration early and efficient evidence use later.
Method NQ TriviaQA PopQA HotpotQA 2Wiki MuSiQue Bamboogle Avg.
RAG (3B) 0.348 0.544 0.387 0.255 0.226 0.047 0.080 0.270
Search-R1-3B-Instruct 0.405 0.566 0.354 0.316 0.224 0.056 0.184 0.301
InfoReasoner-3B 0.453 0.634 0.442 0.344 0.324 0.080 0.144 0.346
Search-R1-7B-Instruct 0.407 0.590 0.390 0.340 0.194 0.080 0.360 0.337
AutoRefine-7B-Base 0.439 0.608 0.402 0.410 0.242 0.116 0.368 0.369
InfoReasoner-7B 0.447 0.614 0.416 0.414 0.302 0.120 0.424 0.391
Table 1 (Main Results): Comparison across seven QA benchmarks covering single-hop (NQ, TriviaQA, PopQA) and multi-hop (HotpotQA, 2Wiki, MuSiQue, Bamboogle) settings.
How to read: compare each baseline row with `InfoReasoner-3B` and `InfoReasoner-7B` in the same columns; the last column (`Avg.`) summarizes overall performance.
Main takeaway: InfoReasoner improves both 3B and 7B settings consistently, and the 3B model already outperforms several stronger 7B retrieval baselines on average.
Role in this page: this table is the primary quantitative evidence that the proposed information-gain reward improves retrieval-augmented reasoning quality.

Ablation: Information Gain Weight \(\lambda\)

Setting NQ TriviaQA PopQA HotpotQA 2Wiki MuSiQue Bamboogle Avg.
\(\lambda=1.0\)0.4350.6000.4320.3200.2740.0520.1280.320
\(\lambda=0.8\)0.4360.5900.4260.2920.2380.0320.1120.304
\(\lambda=0.6\)0.4520.6340.4420.3440.3240.0800.1440.346
\(\lambda=0.4\)0.4430.6020.4240.3240.2660.0580.0880.315
\(\lambda=0.2\)0.4480.6180.4320.3220.2880.0440.1360.327
\(\lambda=0.0\)0.4260.5380.4260.2940.2540.0400.1180.299
Table 2 (Ablation on \(\lambda\)): Sensitivity analysis of the information-gain weight across the same seven QA benchmarks.
How to read: \(\lambda=0.0\) means no information-gain term, while larger values increase the reward contribution from uncertainty reduction.
Main takeaway: moderate weighting works best in this setup (\(\lambda=0.6\)); too small weakens retrieval guidance, too large may over-emphasize intrinsic reward.
Role in this page: this table explains why performance gains come from balancing task correctness and information gain, rather than maximizing either term alone.

Case Study

Query: In what city was the band behind the album Love Bites formed?

Ground-truth: Bolton, England

<search> band Love Bites album </search>
<information> Love Bites is by Buzzcocks... </information>
<search> formation city Buzzcocks band </search>
<information> Buzzcocks were formed in Bolton, England... </information>
<answer> Bolton, England </answer>
Figure 2 (Case Study): This example illustrates two-hop evidence accumulation: the first retrieval identifies the target entity (Buzzcocks), and the second retrieval resolves the target attribute (formation city).
How to read: each `search -> information -> reasoning` step narrows the candidate answer space.
Main takeaway: the model succeeds by making retrieval steps complementary rather than redundant.
Role in this page: this figure-level example explains the qualitative mechanism behind the quantitative gains reported in the results tables.

Conclusion

InfoReasoner provides a theoretically grounded and practical path to optimize agentic reasoning with retrieval. By rewarding uncertainty reduction directly, it offers better credit assignment for intermediate retrieval steps and improves end-task performance in a scalable way.

BibTeX

@article{hu2026inforeasoner, title={Optimizing Agentic Reasoning with Retrieval via Synthetic Semantic Information Gain Reward}, author={Hu, Senkang and Dai, Yong and Zhao, Yuzhi and Tao, Yihang and Guo, Yu and Fang, Zhengru and Kwong, Sam Tak Wu and Fang, Yuguang}, journal={arXiv preprint arXiv:2602.00845}, year={2026} }
Disclaimer: This page is an overview for quick reading. For formal definitions, complete derivations, experiment protocols, and final conclusions, please refer to the original paper and official release.