CARE: Improving Context Fidelity via Native Retrieval-Augmented Reasoning

Abstract

Large language models (LLMs) often struggle with context fidelity, producing inconsistent answers when responding to questions based on provided information. Existing approaches either rely on expensive supervised fine-tuning to generate evidence post-answer or train models to perform web searches without necessarily improving utilization of the given context. We propose CARE, a novel native retrieval-augmented reasoning framework that teaches LLMs to explicitly integrate in-context evidence within their reasoning process with the model's own retrieval capabilities. Our method requires minimal labeled evidence data while significantly enhancing both retrieval accuracy and answer generation performance through strategically retrieved in-context tokens in the reasoning chain. Extensive experiments on multiple real-world and counterfactual QA benchmarks demonstrate that our approach substantially outperforms supervised fine-tuning, traditional retrieval-augmented generation methods, and external retrieval solutions. This work represents a fundamental advancement in making LLMs more accurate, reliable, and efficient for knowledge-intensive tasks.

Our Approach: Native Retrieval-Augmented Reasoning

Conventional Retrieval-Augmented Generation (RAG) is powerful, but often feels like a patch. It bolts on external tools like vector databases, creating a pipeline that adds complexity, latency, and a critical point of failure. More importantly, it frequently underutilizes the rich information already present in the input context. Our key idea is to empower the LLM to perform this retrieval natively, using its inherent language understanding to identify and extract salient facts directly from the provided text.

CARE achieves this through a two-phase training process designed to be data-efficient and highly effective:

Figure 2: The training pipeline for CARE, showing SFT data generation and the subsequent RL phase.

Figure 2: The CARE training pipeline. Top: We generate SFT data by injecting ground-truth facts into reasoning chains. Bottom: The model is first trained in the SFT phase and then refined in the RL phase using multiple rewards and a curriculum learning strategy.

Phase 1: Supervised Fine-Tuning (SFT)

To kickstart the learning process and solve the "cold-start" problem for reinforcement learning, we first familiarize the model with the target output format. We generate a high-quality dataset by taking existing QA pairs with annotated supporting facts (like HotpotQA) and programmatically injecting these facts into a chain-of-thought reasoning process. We wrap the injected evidence with special tokens (<RETRIEVAL> and </RETRIEVAL>). This phase teaches the model the structure of evidence-based reasoning.

Phase 2: Reinforcement Learning (RL) with Curriculum

While SFT teaches the format, RL teaches the skill of self-retrieval. In this phase, the model learns to identify and retrieve relevant evidence on its own, without ground-truth labels. We use Group Relative Policy Optimization (GRPO) with a custom, multi-faceted reward function that encourages:

Answer Accuracy (R_acc): The generated answer must be correct.
Format Consistency (R_fmt): The reasoning must follow the desired <THINK> and <RETRIEVAL> structure.
Retrieval Reward (R_ret): The model is rewarded for using the <RETRIEVAL> tokens and ensuring the enclosed text actually exists in the original context.

Crucially, we employ a curriculum learning strategy. The model starts training on simpler, short-context QA datasets (e.g., DROP) and gradually transitions to more complex, long-context, multi-hop datasets (e.g., MS MARCO). This structured progression allows the model to build robust retrieval capabilities without catastrophic forgetting.

Experiments and Results

Performance on Real-World QA

To rigorously test our approach, we benchmarked CARE against strong baselines across a suite of demanding real-world, multi-hop QA tasks, including the original LLMs and other RAG methods. As shown in Table 1, CARE consistently achieves state-of-the-art performance across all tested models (Llama-3.1 8B, Qwen2.5 7B, and Qwen2.5 14B). Notably, with Llama-3.1 8B, CARE achieves a +15.29% average F1 improvement over the original model, with massive gains on complex datasets like 2WikiMQA (+29.42%) and MuSiQue (+18.92%).

Model	Method	MFQA	HotpotQA	2WikiMQA	MuSiQue	Average
Llama-3.1 8B	Original	45.57	54.64	45.87	32.08	44.54
	ReSearch	/	/	/	/	/
	R1-Searcher	28.44	53.71	67.10	41.41	47.67
	CRAG	44.04	37.88	25.95	24.10	32.99
	CARE	49.94	63.09	75.29	51.00	59.83
Qwen2.5 7B	Original	46.94	58.47	46.96	30.78	45.79
	ReSearch	32.45	54.24	55.78	47.61	47.52
	R1-Searcher	28.36	55.43	65.79	47.09	49.17
	CRAG	47.90	43.97	33.00	28.44	38.33
	CARE	48.11	63.45	70.11	45.57	56.81
Qwen2.5 14B	Original	47.58	61.94	59.05	37.99	51.64
	ReSearch	/	/	/	/	/
	R1-Searcher	/	/	/	/	/
	CRAG	50.89	44.74	34.68	28.17	39.62
	CARE	48.81	67.75	78.68	51.27	61.63

Table 1: Evaluation on the real-world QA datasets. Best and second-best results are in bold and underline.

Robustness to Counterfactual Information

A key test of context fidelity is whether a model can ignore its own pre-trained knowledge when the context presents conflicting (counterfactual) information. We tested this on the CofCA benchmark. Table 2 shows that while traditional online search methods often degrade performance by retrieving conflicting external knowledge, CARE excels. It demonstrates superior context fidelity, with gains as high as +13.69% on Llama-3.1 8B, proving its ability to ground its reasoning firmly in the provided text.

Model	Method	CofCA (F1)
Llama-3.1 8B	Original	48.14
	R1-Searcher	45.25
	CARE	61.83
Qwen2.5 7B	Original	58.38
	ReSearch	47.32
	R1-Searcher	43.61
	CRAG	56.01
	CARE	64.56
Qwen2.5 14B	Original	64.40
	CRAG	51.99
	CARE	67.75

Table 2: Evaluation on the counterfactual QA task. Best and second-best results are in bold and underline.

Ablation Studies

What makes CARE so effective? To dissect our framework and understand the contribution of each component, we conducted a series of ablation studies on Qwen2.5 7B. The results in Table 3 demonstrate that each part of CARE is crucial for optimal performance. While SFT alone offers only marginal benefits, adding Reinforcement Learning (`No Ret.`) provides a substantial boost. Incorporating our retrieval reward (`No Cur.`) and the full curriculum learning strategy (`CARE`) further improves performance and generalization, highlighting the synergistic effect of our complete method.

Settings	SFT	RL	Ret.	Cur.	MFQA	HotpotQA	2WikiMQA	MuSiQue	CofCA	Average
Baseline	✗	✗	✗	✗	46.64	58.47	46.96	30.78	58.38	48.25
SFT Only	✓	✗	✗	✗	42.24	47.08	61.51	33.82	59.21	48.77
No Ret.	✓	✓	✗	✗	37.66	62.59	70.57	43.85	57.26	54.39
No Cur.	✓	✓	✓	✗	38.33	64.10	70.69	47.49	60.60	56.24
CARE	✓	✓	✓	✓	48.11	63.45	70.11	45.57	64.56	58.36

Table 3: Ablation studies on the QA tasks based on Qwen2.5 7B. "Ret." stands for retrieval reward, and "Cur." for curriculum learning.

Evidence Retrieval Evaluation

Does CARE actually retrieve better evidence? To measure this directly, we evaluated the quality of the retrieved text snippets on the LongBench HotpotQA benchmark using BLEU and ROUGE-L scores. Figure 3 clearly shows that across all model scales, CARE consistently achieves the highest scores, confirming that our framework effectively enhances the model's ability to extract relevant, high-quality evidence to support its reasoning.

Figure 3: Comparison of evidence retrieval performance using BLEU and ROUGE-L metrics.

Figure 3: Evidence Retrieval Evaluation. CARE consistently achieves the highest BLEU and ROUGE-L scores across all models, indicating superior evidence extraction quality.

Conclusion

In this work, we introduced CARE, a native retrieval-augmented reasoning framework designed to address the critical challenge of context fidelity in Large Language Models. By teaching models to dynamically identify and integrate evidence from the provided context, CARE significantly reduces hallucinations and improves answer accuracy without relying on complex external retrieval systems. Our two-phase training strategy, combining data-efficient Supervised Fine-Tuning with a curriculum-based Reinforcement Learning approach, proves to be highly effective across a range of benchmarks. The comprehensive results demonstrate that CARE not only outperforms strong baselines but also exhibits robust reasoning even in challenging counterfactual scenarios. This research represents a significant step toward building more reliable, trustworthy, and efficient LLMs that can faithfully ground their reasoning in the information they are given.

BibTeX

@inproceedings{wang2025improving,
        title     = {Improving Context Fidelity via Native Retrieval-Augmented Reasoning},
        author    = {Suyuchen Wang and Jinlin Wang and Xinyu Wang and Shiqi Li and 
                    Xiangru Tang and Sirui Hong and Xiao-Wen Chang and Chenglin Wu and 
                    Bang Liu},
        booktitle = {The 2025 Conference on Empirical Methods in Natural Language Processing},
        year      = {2025},
        url       = {https://openreview.net/forum?id=24BhNX3LBK}
      }

Improving Context Fidelity via Native Retrieval-Augmented Reasoning

Figure 1: Unlike standard approaches that may ignore or misinterpret context, CARE explicitly retrieves relevant text snippets (highlighted) and weaves them into its thought process, leading to a factually grounded and reasonable answer.