Performance on Real-World QA
To rigorously test our approach, we benchmarked CARE against strong baselines across a suite of demanding real-world, multi-hop QA tasks, including the original LLMs and other RAG methods. As shown in Table 1, CARE consistently achieves state-of-the-art performance across all tested models (Llama-3.1 8B, Qwen2.5 7B, and Qwen2.5 14B). Notably, with Llama-3.1 8B, CARE achieves a +15.29% average F1 improvement over the original model, with massive gains on complex datasets like 2WikiMQA (+29.42%) and MuSiQue (+18.92%).
| Model | Method | MFQA | HotpotQA | 2WikiMQA | MuSiQue | Average | 
|---|---|---|---|---|---|---|
| Llama-3.1 8B | Original | 45.57 | 54.64 | 45.87 | 32.08 | 44.54 | 
| ReSearch | / | / | / | / | / | |
| R1-Searcher | 28.44 | 53.71 | 67.10 | 41.41 | 47.67 | |
| CRAG | 44.04 | 37.88 | 25.95 | 24.10 | 32.99 | |
| CARE | 49.94 | 63.09 | 75.29 | 51.00 | 59.83 | |
| Qwen2.5 7B | Original | 46.94 | 58.47 | 46.96 | 30.78 | 45.79 | 
| ReSearch | 32.45 | 54.24 | 55.78 | 47.61 | 47.52 | |
| R1-Searcher | 28.36 | 55.43 | 65.79 | 47.09 | 49.17 | |
| CRAG | 47.90 | 43.97 | 33.00 | 28.44 | 38.33 | |
| CARE | 48.11 | 63.45 | 70.11 | 45.57 | 56.81 | |
| Qwen2.5 14B | Original | 47.58 | 61.94 | 59.05 | 37.99 | 51.64 | 
| ReSearch | / | / | / | / | / | |
| R1-Searcher | / | / | / | / | / | |
| CRAG | 50.89 | 44.74 | 34.68 | 28.17 | 39.62 | |
| CARE | 48.81 | 67.75 | 78.68 | 51.27 | 61.63 | 
Table 1: Evaluation on the real-world QA datasets. Best and second-best results are in bold and underline.
Robustness to Counterfactual Information
A key test of context fidelity is whether a model can ignore its own pre-trained knowledge when the context presents conflicting (counterfactual) information. We tested this on the CofCA benchmark. Table 2 shows that while traditional online search methods often degrade performance by retrieving conflicting external knowledge, CARE excels. It demonstrates superior context fidelity, with gains as high as +13.69% on Llama-3.1 8B, proving its ability to ground its reasoning firmly in the provided text.
| Model | Method | CofCA (F1) | 
|---|---|---|
| Llama-3.1 8B | Original | 48.14 | 
| R1-Searcher | 45.25 | |
| CARE | 61.83 | |
| Qwen2.5 7B | Original | 58.38 | 
| ReSearch | 47.32 | |
| R1-Searcher | 43.61 | |
| CRAG | 56.01 | |
| CARE | 64.56 | |
| Qwen2.5 14B | Original | 64.40 | 
| CRAG | 51.99 | |
| CARE | 67.75 | 
Table 2: Evaluation on the counterfactual QA task. Best and second-best results are in bold and underline.
Ablation Studies
What makes CARE so effective? To dissect our framework and understand the contribution of each component, we conducted a series of ablation studies on Qwen2.5 7B. The results in Table 3 demonstrate that each part of CARE is crucial for optimal performance. While SFT alone offers only marginal benefits, adding Reinforcement Learning (`No Ret.`) provides a substantial boost. Incorporating our retrieval reward (`No Cur.`) and the full curriculum learning strategy (`CARE`) further improves performance and generalization, highlighting the synergistic effect of our complete method.
| Settings | SFT | RL | Ret. | Cur. | MFQA | HotpotQA | 2WikiMQA | MuSiQue | CofCA | Average | 
|---|---|---|---|---|---|---|---|---|---|---|
| Baseline | ✗ | ✗ | ✗ | ✗ | 46.64 | 58.47 | 46.96 | 30.78 | 58.38 | 48.25 | 
| SFT Only | ✓ | ✗ | ✗ | ✗ | 42.24 | 47.08 | 61.51 | 33.82 | 59.21 | 48.77 | 
| No Ret. | ✓ | ✓ | ✗ | ✗ | 37.66 | 62.59 | 70.57 | 43.85 | 57.26 | 54.39 | 
| No Cur. | ✓ | ✓ | ✓ | ✗ | 38.33 | 64.10 | 70.69 | 47.49 | 60.60 | 56.24 | 
| CARE | ✓ | ✓ | ✓ | ✓ | 48.11 | 63.45 | 70.11 | 45.57 | 64.56 | 58.36 | 
Table 3: Ablation studies on the QA tasks based on Qwen2.5 7B. "Ret." stands for retrieval reward, and "Cur." for curriculum learning.
Evidence Retrieval Evaluation
Does CARE actually retrieve better evidence? To measure this directly, we evaluated the quality of the retrieved text snippets on the LongBench HotpotQA benchmark using BLEU and ROUGE-L scores. Figure 3 clearly shows that across all model scales, CARE consistently achieves the highest scores, confirming that our framework effectively enhances the model's ability to extract relevant, high-quality evidence to support its reasoning.
 
          Figure 3: Evidence Retrieval Evaluation. CARE consistently achieves the highest BLEU and ROUGE-L scores across all models, indicating superior evidence extraction quality.
 
      