Merge pull request #4 from FoundationAgents/fix/readme-captions

better629 · web-flow · commit dfdd8549cc5b · 2025-09-06T09:51:34.000+08:00
docs: add captions for benchmark and ablation tables; fix curriculum …
diff --git a/README.md b/README.md
@@ -5,18 +5,22 @@
 Large language models (LLMs) often struggle with **context fidelity**, producing inconsistent or hallucinated answers even when relevant information is present.  
 We propose **CARE**, a native retrieval-augmented reasoning framework that integrates in-context evidence directly into the reasoning chain.  
 This work represents a step toward making LLMs more accurate, reliable, and efficient for knowledge-intensive tasks.
+---
+
 ### Results Overview
 <p align="center">
   <img src="assets/retrieval_results.png" alt="CARE Results" width="85%" style="display:inline-block;"/>
 </p>
+<p align="center"><em>Figure 1: Comparison of model performance across different settings. CARE demonstrates improved results over baselines on multiple QA benchmarks.</em></p>
 
 ### Method Overview
 <p align="center">
   <img src="assets/method.png" width="80%">
 </p>
 
----
+<p align="center"><em>Figure 2: A schematic illustration of the training data and process. The upper part shows SFT data generation (fact injection and special tokens), while the lower part shows the SFT training process together with reinforcement learning (RL) using multiple rewards.</em></p>
 
+---
 
 ## 🔧 Installation
 
@@ -26,7 +30,7 @@ Requirements:
   - `transformers>=4.51.0`
   - `flash-attn>=2.4.3`
   - `vllm>=0.8.3`
-
+---
 Clone and install:
 ```bash
 git clone https://github.com/FoundationAgents/CARE
@@ -44,6 +48,7 @@ Use the provided helper script to download Qwen models and datasets (DROP, MuSiQ
 ```bash
 python CARE/scripts/load_script/load_dataset.py
 ```
+
 ```bash
 python CARE/scripts/load_script/load_model.py
 ```
@@ -87,8 +92,8 @@ Edit the script to change:
 | **Qwen2.5 14B**  | Original    | 47.58     | *61.94*   | *59.05*   | *37.99*   | *51.64*   |
 |                  | CRAG        | **50.89** | 44.74     | 34.68     | 28.17     | 39.62     |
 |                  | **CARE**    | *48.81*   | **67.75** | **78.68** | **51.27** | **61.63** |
+<p align="center"><em>Table 1: Evaluation on real-world QA datasets. Results are grouped by the base LLM. The best and second-best results are shown in <b>bold</b> and <u>underline</u>, respectively. Slash (/) indicates unavailable checkpoints or unsupported models.</em></p>
 
----
 
 ### Ablation Study
 
@@ -99,6 +104,7 @@ Edit the script to change:
 | No Retrieval | ✓   | ✓  | ✗         | ✗          | 37.66    | 62.59      | _70.57_    | 43.85     | 57.26     | 54.39     |
 | No Curriculum| ✓   | ✓  | ✓         | ✗          | 38.33    | **64.10**  | **70.69**  | **47.49** | _60.60_   | _56.24_   |
 | **CARE**     | ✓   | ✓  | ✓         | ✓          | **48.11**| _63.45_    | 70.11      | _45.57_   | **64.56** | **58.36** |
+<p align="center"><em>Table 2: Ablation studies on QA tasks based on Qwen2.5-7B. The best and second-best results are shown in <b>bold</b> and <u>underline</u>, respectively. “Ret.” indicates the retrieval reward, and “Cur.” indicates curriculum learning.</em></p>
 
 📌 *Whether to enable curriculum learning can be controlled in*  
-[`CARE/verl/trainer/config.py`](CARE/verl/trainer/config.py).
+[`verl/trainer/config.py`](verl/trainer/config.py).
diff --git a/scripts/training_examples/run_qwen2_5_7b_retrieve_mix_musique.sh b/scripts/training_examples/run_qwen2_5_7b_retrieve_mix_musique.sh
@@ -16,7 +16,7 @@ export VLLM_USE_V1=0
 # Replace the path with either:
 #   1. Your local model checkpoint directory
 #   2. Or a Hugging Face Hub repo id, e.g. Qwen/Qwen2.5-7B-Instruct
-MODEL_PATH=/mnt/home/LLaMA-Factory/saves/Qwen2.5-7B-Instruct-Ret/full/sft
+MODEL_PATH=Qwen/Qwen2.5-7B-Instruct
 
 # -------------------------------
 # System prompt