Abstract
Large reasoning models have shown strong capabilities in complex tasks by combining chain-of-thought reasoning with external retrieval. However, existing training signals for retrieval are often sparse and delayed, making it difficult to assign credit to intermediate search actions.
We introduce InfoReasoner, a unified framework that rewards retrieval by measuring synthetic semantic information gain. The key idea is simple: good retrieval should reduce uncertainty over the answer. We formalize this as uncertainty reduction over belief states and provide properties such as non-negativity and telescoping additivity to justify per-step optimization.
Across seven QA benchmarks, InfoReasoner consistently outperforms strong retrieval-augmented baselines and achieves up to 5.4% average accuracy improvement.
Method Overview
InfoReasoner computes a dense intrinsic reward directly from model outputs without manual retrieval annotation:
- Sample multiple candidate answers with and without retrieved evidence.
- Group semantically equivalent answers via bidirectional textual entailment.
- Estimate belief distributions over semantic classes.
- Use entropy reduction as information gain reward to train retrieval policy.
The policy is optimized with GRPO, combining output reward and information gain reward to balance answer quality and uncertainty-reducing retrieval behaviors.
Theoretical Foundations
InfoReasoner views multi-step retrieval as an uncertainty-reduction process. Instead of rewarding only the final answer, it gives credit to retrieval actions that make the model more certain about the correct semantic answer class.
This gives a dense learning signal during reasoning, while still keeping the final correctness objective in training.
Practical Reward Formulation
In implementation, the model compares uncertainty before and after retrieval, and uses their difference as an intrinsic reward:
$$\widehat{\mathrm{IG}}_t(x_t,a_t)=H_{\text{sem}}(x_t,B)-H_{\text{sem}}(x_t,C_t).$$
The training reward then combines exact-match correctness and this information-gain term (weighted by \(\lambda\)).
Main Results (QA Benchmarks)
- Consistent gains across 7 datasets, including both single-hop and multi-hop QA settings.
- Parameter-efficient performance: the 3B model surpasses several 7B retrieval baselines.
- State-of-the-art at both scales: strong improvements are maintained at 3B and 7B.
- Better retrieval behavior: reward shaping encourages exploration early and efficient evidence use later.
| Method |
NQ |
TriviaQA |
PopQA |
HotpotQA |
2Wiki |
MuSiQue |
Bamboogle |
Avg. |
| RAG (3B) |
0.348 |
0.544 |
0.387 |
0.255 |
0.226 |
0.047 |
0.080 |
0.270 |
| Search-R1-3B-Instruct |
0.405 |
0.566 |
0.354 |
0.316 |
0.224 |
0.056 |
0.184 |
0.301 |
| InfoReasoner-3B |
0.453 |
0.634 |
0.442 |
0.344 |
0.324 |
0.080 |
0.144 |
0.346 |
| Search-R1-7B-Instruct |
0.407 |
0.590 |
0.390 |
0.340 |
0.194 |
0.080 |
0.360 |
0.337 |
| AutoRefine-7B-Base |
0.439 |
0.608 |
0.402 |
0.410 |
0.242 |
0.116 |
0.368 |
0.369 |
| InfoReasoner-7B |
0.447 |
0.614 |
0.416 |
0.414 |
0.302 |
0.120 |
0.424 |
0.391 |
Table 1 (Main Results): Comparison across seven QA benchmarks covering single-hop (NQ, TriviaQA, PopQA) and multi-hop (HotpotQA, 2Wiki, MuSiQue, Bamboogle) settings.
How to read: compare each baseline row with `InfoReasoner-3B` and `InfoReasoner-7B` in the same columns; the last column (`Avg.`) summarizes overall performance.
Main takeaway: InfoReasoner improves both 3B and 7B settings consistently, and the 3B model already outperforms several stronger 7B retrieval baselines on average.
Role in this page: this table is the primary quantitative evidence that the proposed information-gain reward improves retrieval-augmented reasoning quality.
Ablation: Information Gain Weight \(\lambda\)
| Setting |
NQ |
TriviaQA |
PopQA |
HotpotQA |
2Wiki |
MuSiQue |
Bamboogle |
Avg. |
| \(\lambda=1.0\) | 0.435 | 0.600 | 0.432 | 0.320 | 0.274 | 0.052 | 0.128 | 0.320 |
| \(\lambda=0.8\) | 0.436 | 0.590 | 0.426 | 0.292 | 0.238 | 0.032 | 0.112 | 0.304 |
| \(\lambda=0.6\) | 0.452 | 0.634 | 0.442 | 0.344 | 0.324 | 0.080 | 0.144 | 0.346 |
| \(\lambda=0.4\) | 0.443 | 0.602 | 0.424 | 0.324 | 0.266 | 0.058 | 0.088 | 0.315 |
| \(\lambda=0.2\) | 0.448 | 0.618 | 0.432 | 0.322 | 0.288 | 0.044 | 0.136 | 0.327 |
| \(\lambda=0.0\) | 0.426 | 0.538 | 0.426 | 0.294 | 0.254 | 0.040 | 0.118 | 0.299 |
Table 2 (Ablation on \(\lambda\)): Sensitivity analysis of the information-gain weight across the same seven QA benchmarks.
How to read: \(\lambda=0.0\) means no information-gain term, while larger values increase the reward contribution from uncertainty reduction.
Main takeaway: moderate weighting works best in this setup (\(\lambda=0.6\)); too small weakens retrieval guidance, too large may over-emphasize intrinsic reward.
Role in this page: this table explains why performance gains come from balancing task correctness and information gain, rather than maximizing either term alone.
Case Study
Query: In what city was the band behind the album Love Bites formed?
Ground-truth: Bolton, England
<search> band Love Bites album </search>
<information> Love Bites is by Buzzcocks... </information>
<search> formation city Buzzcocks band </search>
<information> Buzzcocks were formed in Bolton, England... </information>
<answer> Bolton, England </answer>
Figure 2 (Case Study): This example illustrates two-hop evidence accumulation: the first retrieval identifies the target entity (Buzzcocks), and the second retrieval resolves the target attribute (formation city).
How to read: each `search -> information -> reasoning` step narrows the candidate answer space.
Main takeaway: the model succeeds by making retrieval steps complementary rather than redundant.
Role in this page: this figure-level example explains the qualitative mechanism behind the quantitative gains reported in the results tables.
Conclusion
InfoReasoner provides a theoretically grounded and practical path to optimize agentic reasoning with retrieval. By rewarding uncertainty reduction directly, it offers better credit assignment for intermediate retrieval steps and improves end-task performance in a scalable way.
BibTeX
@article{hu2026inforeasoner,
title={Optimizing Agentic Reasoning with Retrieval via Synthetic Semantic Information Gain Reward},
author={Hu, Senkang and Dai, Yong and Zhao, Yuzhi and Tao, Yihang and Guo, Yu and Fang, Zhengru and Kwong, Sam Tak Wu and Fang, Yuguang},
journal={arXiv preprint arXiv:2602.00845},
year={2026}
}
Disclaimer: This page is an overview for quick reading. For formal definitions, complete derivations, experiment protocols, and final conclusions, please refer to the original paper and official release.