Improving Context Fidelity via Native Retrieval-Augmented Reasoning

Suyuchen Wang, Jinlin Wang, Xinyu Wang, Shiqi Li, Xiangru Tang, Sirui Hong, Xiao-Wen Chang
Chenglin Wu, Bang Liu
University of Montreal and Mila, MetaGPT, McGill University, Yale University
EMNLP 2025

Large Language Models often struggle with context fidelity, producing answers that contradict the provided information—a problem known as context hallucination. To address this, we introduce CARE (Context-Aware Retrieval-Enhanced reasoning), a framework that teaches LLMs to dynamically identify and integrate evidence from the input context directly into their reasoning process.

Figure 1: Comparison showing how CARE integrates in-context facts into its reasoning chain, improving context fidelity over direct generation and standard reasoning methods.

Figure 1: Unlike standard approaches that may ignore or misinterpret context, CARE explicitly retrieves relevant text snippets (highlighted) and weaves them into its thought process, leading to a factually grounded and reasonable answer.

Abstract

Large language models (LLMs) often struggle with context fidelity, producing inconsistent answers when responding to questions based on provided information. Existing approaches either rely on expensive supervised fine-tuning to generate evidence post-answer or train models to perform web searches without necessarily improving utilization of the given context. We propose CARE, a novel native retrieval-augmented reasoning framework that teaches LLMs to explicitly integrate in-context evidence within their reasoning process with the model's own retrieval capabilities. Our method requires minimal labeled evidence data while significantly enhancing both retrieval accuracy and answer generation performance through strategically retrieved in-context tokens in the reasoning chain. Extensive experiments on multiple real-world and counterfactual QA benchmarks demonstrate that our approach substantially outperforms supervised fine-tuning, traditional retrieval-augmented generation methods, and external retrieval solutions. This work represents a fundamental advancement in making LLMs more accurate, reliable, and efficient for knowledge-intensive tasks.

Our Approach: Native Retrieval-Augmented Reasoning

Conventional Retrieval-Augmented Generation (RAG) is powerful, but often feels like a patch. It bolts on external tools like vector databases, creating a pipeline that adds complexity, latency, and a critical point of failure. More importantly, it frequently underutilizes the rich information already present in the input context. Our key idea is to empower the LLM to perform this retrieval natively, using its inherent language understanding to identify and extract salient facts directly from the provided text.

CARE achieves this through a two-phase training process designed to be data-efficient and highly effective:

Figure 2: The training pipeline for CARE, showing SFT data generation and the subsequent RL phase.

Figure 2: The CARE training pipeline. Top: We generate SFT data by injecting ground-truth facts into reasoning chains. Bottom: The model is first trained in the SFT phase and then refined in the RL phase using multiple rewards and a curriculum learning strategy.


Phase 1: Supervised Fine-Tuning (SFT)

To kickstart the learning process and solve the "cold-start" problem for reinforcement learning, we first familiarize the model with the target output format. We generate a high-quality dataset by taking existing QA pairs with annotated supporting facts (like HotpotQA) and programmatically injecting these facts into a chain-of-thought reasoning process. We wrap the injected evidence with special tokens (<RETRIEVAL> and </RETRIEVAL>). This phase teaches the model the structure of evidence-based reasoning.

Phase 2: Reinforcement Learning (RL) with Curriculum

While SFT teaches the format, RL teaches the skill of self-retrieval. In this phase, the model learns to identify and retrieve relevant evidence on its own, without ground-truth labels. We use Group Relative Policy Optimization (GRPO) with a custom, multi-faceted reward function that encourages:

  • Answer Accuracy (Racc): The generated answer must be correct.
  • Format Consistency (Rfmt): The reasoning must follow the desired <THINK> and <RETRIEVAL> structure.
  • Retrieval Reward (Rret): The model is rewarded for using the <RETRIEVAL> tokens and ensuring the enclosed text actually exists in the original context.

Crucially, we employ a curriculum learning strategy. The model starts training on simpler, short-context QA datasets (e.g., DROP) and gradually transitions to more complex, long-context, multi-hop datasets (e.g., MS MARCO). This structured progression allows the model to build robust retrieval capabilities without catastrophic forgetting.

Experiments and Results

Performance on Real-World QA

To rigorously test our approach, we benchmarked CARE against strong baselines across a suite of demanding real-world, multi-hop QA tasks, including the original LLMs and other RAG methods. As shown in Table 1, CARE consistently achieves state-of-the-art performance across all tested models (Llama-3.1 8B, Qwen2.5 7B, and Qwen2.5 14B). Notably, with Llama-3.1 8B, CARE achieves a +15.29% average F1 improvement over the original model, with massive gains on complex datasets like 2WikiMQA (+29.42%) and MuSiQue (+18.92%).

Model Method MFQA HotpotQA 2WikiMQA MuSiQue Average
Llama-3.1 8B Original 45.57 54.64 45.87 32.08 44.54
ReSearch / / / / /
R1-Searcher 28.44 53.71 67.10 41.41 47.67
CRAG 44.04 37.88 25.95 24.10 32.99
CARE 49.94 63.09 75.29 51.00 59.83
Qwen2.5 7B Original 46.94 58.47 46.96 30.78 45.79
ReSearch 32.45 54.24 55.78 47.61 47.52
R1-Searcher 28.36 55.43 65.79 47.09 49.17
CRAG 47.90 43.97 33.00 28.44 38.33
CARE 48.11 63.45 70.11 45.57 56.81
Qwen2.5 14B Original 47.58 61.94 59.05 37.99 51.64
ReSearch / / / / /
R1-Searcher / / / / /
CRAG 50.89 44.74 34.68 28.17 39.62
CARE 48.81 67.75 78.68 51.27 61.63

Table 1: Evaluation on the real-world QA datasets. Best and second-best results are in bold and underline.

Robustness to Counterfactual Information

A key test of context fidelity is whether a model can ignore its own pre-trained knowledge when the context presents conflicting (counterfactual) information. We tested this on the CofCA benchmark. Table 2 shows that while traditional online search methods often degrade performance by retrieving conflicting external knowledge, CARE excels. It demonstrates superior context fidelity, with gains as high as +13.69% on Llama-3.1 8B, proving its ability to ground its reasoning firmly in the provided text.

Model Method CofCA (F1)
Llama-3.1 8B Original 48.14
R1-Searcher 45.25
CARE 61.83
Qwen2.5 7B Original 58.38
ReSearch 47.32
R1-Searcher 43.61
CRAG 56.01
CARE 64.56
Qwen2.5 14B Original 64.40
CRAG 51.99
CARE 67.75

Table 2: Evaluation on the counterfactual QA task. Best and second-best results are in bold and underline.

Ablation Studies

What makes CARE so effective? To dissect our framework and understand the contribution of each component, we conducted a series of ablation studies on Qwen2.5 7B. The results in Table 3 demonstrate that each part of CARE is crucial for optimal performance. While SFT alone offers only marginal benefits, adding Reinforcement Learning (`No Ret.`) provides a substantial boost. Incorporating our retrieval reward (`No Cur.`) and the full curriculum learning strategy (`CARE`) further improves performance and generalization, highlighting the synergistic effect of our complete method.

Settings SFT RL Ret. Cur. MFQA HotpotQA 2WikiMQA MuSiQue CofCA Average
Baseline 46.64 58.47 46.96 30.78 58.38 48.25
SFT Only 42.24 47.08 61.51 33.82 59.21 48.77
No Ret. 37.66 62.59 70.57 43.85 57.26 54.39
No Cur. 38.33 64.10 70.69 47.49 60.60 56.24
CARE 48.11 63.45 70.11 45.57 64.56 58.36

Table 3: Ablation studies on the QA tasks based on Qwen2.5 7B. "Ret." stands for retrieval reward, and "Cur." for curriculum learning.

Evidence Retrieval Evaluation

Does CARE actually retrieve better evidence? To measure this directly, we evaluated the quality of the retrieved text snippets on the LongBench HotpotQA benchmark using BLEU and ROUGE-L scores. Figure 3 clearly shows that across all model scales, CARE consistently achieves the highest scores, confirming that our framework effectively enhances the model's ability to extract relevant, high-quality evidence to support its reasoning.

Figure 3: Comparison of evidence retrieval performance using BLEU and ROUGE-L metrics.

Figure 3: Evidence Retrieval Evaluation. CARE consistently achieves the highest BLEU and ROUGE-L scores across all models, indicating superior evidence extraction quality.

Conclusion

In this work, we introduced CARE, a native retrieval-augmented reasoning framework designed to address the critical challenge of context fidelity in Large Language Models. By teaching models to dynamically identify and integrate evidence from the provided context, CARE significantly reduces hallucinations and improves answer accuracy without relying on complex external retrieval systems. Our two-phase training strategy, combining data-efficient Supervised Fine-Tuning with a curriculum-based Reinforcement Learning approach, proves to be highly effective across a range of benchmarks. The comprehensive results demonstrate that CARE not only outperforms strong baselines but also exhibits robust reasoning even in challenging counterfactual scenarios. This research represents a significant step toward building more reliable, trustworthy, and efficient LLMs that can faithfully ground their reasoning in the information they are given.

BibTeX

@inproceedings{wang2025improving,
        title     = {Improving Context Fidelity via Native Retrieval-Augmented Reasoning},
        author    = {Suyuchen Wang and Jinlin Wang and Xinyu Wang and Shiqi Li and 
                    Xiangru Tang and Sirui Hong and Xiao-Wen Chang and Chenglin Wu and 
                    Bang Liu},
        booktitle = {The 2025 Conference on Empirical Methods in Natural Language Processing},
        year      = {2025},
        url       = {https://openreview.net/forum?id=24BhNX3LBK}
      }