Optimizing Agentic Reasoning with Retrieval via Synthetic Semantic Information Gain Reward

Senkang Hu1,2, Yong Dai3, Yuzhi Zhao2, Yihang Tao1,2, Yu Guo1,2, Zhengru Fang1,2, Sam Tak Wu Kwong4, Yuguang Fang1,2
1Hong Kong JC STEM Lab of Smart City, 2City University of Hong Kong, 3Fudan University, 4Lingnan University

Why This Project Exists

Retrieval gives an LLM agent a way to look outside its parametric memory. But learning when to search, what to search, and whether a search result actually helps is still difficult. In most training pipelines, the agent only receives a final answer reward. That reward arrives too late to explain which retrieval action changed the reasoning trajectory.

A good search step does not always contain the final answer directly. Sometimes it removes a wrong hypothesis. Sometimes it makes one candidate answer much more plausible than the rest. If training only asks whether the final answer is correct, this useful intermediate behavior is easy to miss.

The question behind InfoReasoner is simple: can we reward retrieval actions by measuring how much they reduce semantic uncertainty during reasoning?

The Core Intuition

Imagine asking the model the same question several times. Before retrieval, its answers may scatter across multiple semantic possibilities. After a useful search, those answers should concentrate around a better supported answer class. This concentration is information gain.

InfoReasoner turns that intuition into a reward. It samples answers with and without retrieved evidence, groups equivalent answers by meaning, estimates semantic belief distributions, and rewards retrieval when evidence increases support for the correct semantic answer class.

$$r_t = \log p(c^\star\mid C_t) - \log p(c^\star\mid B).$$

How InfoReasoner Works

Sample answer beliefs

For a reasoning state, the model samples candidate answers before and after a retrieval action.

Cluster semantic answers

Answers that express the same meaning are grouped together, so the reward tracks meaning rather than surface wording.

Estimate uncertainty

The semantic answer clusters form belief distributions, and their entropy describes how uncertain the model is.

Reward useful retrieval

A search action receives intrinsic reward when retrieved evidence increases probability mass on the correct semantic class.

Optimize the agent

The information-gain reward is combined with final answer reward, then used to train the retrieval policy with GRPO.

Overview of InfoReasoner framework
Overview: InfoReasoner compares answer beliefs before and after retrieval, turns semantic information gain into a dense reward, and trains the agent to search when search meaningfully improves belief.

What We Observed

We evaluate InfoReasoner on seven question-answering benchmarks, including both single-hop and multi-hop settings. The results show that information-gain reward improves retrieval-augmented reasoning at both 3B and 7B scales.

Method NQ TriviaQA PopQA HotpotQA 2Wiki MuSiQue Bamboogle Avg.
RAG (3B) 0.348 0.544 0.387 0.255 0.226 0.047 0.080 0.270
Search-R1-3B-Instruct 0.405 0.566 0.354 0.316 0.224 0.056 0.184 0.301
InfoReasoner-3B 0.453 0.634 0.442 0.344 0.324 0.080 0.144 0.346
Search-R1-7B-Instruct 0.407 0.590 0.390 0.340 0.194 0.080 0.360 0.337
AutoRefine-7B-Base 0.439 0.608 0.402 0.410 0.242 0.116 0.368 0.369
InfoReasoner-7B 0.447 0.614 0.416 0.414 0.302 0.120 0.424 0.391
Main result: InfoReasoner gives consistent gains across seven benchmarks. The 3B model already surpasses several stronger retrieval baselines on average, and the 7B model keeps the advantage.

The ablation tells the same story from another angle. A moderate information-gain weight works best. Too little makes the retrieval signal weak. Too much can over-emphasize intrinsic reward and drift away from final answer quality.

Setting NQ TriviaQA PopQA HotpotQA 2Wiki MuSiQue Bamboogle Avg.
\(\lambda=1.0\)0.4350.6000.4320.3200.2740.0520.1280.320
\(\lambda=0.8\)0.4360.5900.4260.2920.2380.0320.1120.304
\(\lambda=0.6\)0.4520.6340.4420.3440.3240.0800.1440.346
\(\lambda=0.4\)0.4430.6020.4240.3240.2660.0580.0880.315
\(\lambda=0.2\)0.4480.6180.4320.3220.2880.0440.1360.327
\(\lambda=0.0\)0.4260.5380.4260.2940.2540.0400.1180.299

Beyond short-answer QA, InfoReasoner also transfers to more agentic settings. On MATH500 with Python tool execution, InfoReasoner-7B reaches 85.8%. On WebDetective, a long-horizon deep-search benchmark, InfoReasoner-3B improves Pass@1 from 26.5% to 30.0% over Search-R1-3B-Base.

Training Dynamics

The training curves explain why the method works. InfoReasoner explores more in the early phase because useful retrieval can receive positive reward even before the final answer is correct. After that phase, EM rises, entropy stabilizes, and responses become shorter.

EM score comparison between InfoReasoner and Search-R1
Reward component comparison for InfoReasoner
Entropy loss comparison between InfoReasoner and Search-R1
Response length comparison between InfoReasoner and Search-R1
Training dynamics analysis: (a) EM score trajectories compare InfoReasoner with Search-R1 during training. (b) Reward components decompose the total training signal into Information Gain (IG) and EM scores. (c) Entropy loss shows how the policy first explores diverse retrieval paths and then becomes more certain. (d) Response length shows that InfoReasoner learns shorter and more stable outputs while maintaining better final accuracy.

A Small Example

The behavior we want is not more search by default. We want search that narrows the answer space. A two-hop question shows this clearly.

Query: In what city was the band behind the album Love Bites formed?

Ground truth: Bolton, England

<search> band Love Bites album </search>
<information> Love Bites is by Buzzcocks... </information>
<search> formation city Buzzcocks band </search>
<information> Buzzcocks were formed in Bolton, England... </information>
<answer> Bolton, England </answer>

The first search identifies the target entity. The second search resolves the target attribute. InfoReasoner rewards this kind of complementary retrieval because each step makes the final answer distribution more certain.

Information gain boxplot across retrieval scenarios
Group size sensitivity for information gain estimation
Information gain analysis: (a) The boxplot compares different retrieval scenarios in the Love Bites example. Jointly observing both documents gives stronger information gain than using either document alone, showing that complementary evidence matters in multi-hop search. (b) The line plot studies group size sensitivity for reward estimation. The error drops quickly as the sampling group grows, then shows diminishing returns, which motivates a moderate group size for balancing reward quality and computation.

Takeaway

InfoReasoner teaches retrieval agents to search when search changes belief. This is a small but important shift: from rewarding final answers only, to rewarding the intermediate actions that make good answers more likely.

BibTeX

If you find our work useful for your research, please consider citing our ICML 2026 paper:

@inproceedings{hu2026inforeasoner, title={Optimizing Agentic Reasoning with Retrieval via Synthetic Semantic Information Gain Reward}, author={Hu, Senkang and Dai, Yong and Zhao, Yuzhi and Tao, Yihang and Guo, Yu and Fang, Zhengru and Kwong, Sam Tak Wu and Fang, Yuguang}, booktitle={Proceedings of the 43rd International Conference on Machine Learning (ICML)}, year={2026} }
Disclaimer: This page is an overview for quick reading. For formal definitions, complete derivations, experiment protocols, and final conclusions, please refer to the original paper and official release.