This folder contains scripts for preprocessing the RAGBench/FinQA dataset to generate training and test data for hallucination detection.
The preprocessing pipeline consists of four main steps:
1. Preprocess → 2. Generate Response → 3. Generate Labels → 4. Filter
Script: preprocess.py
Prepares the raw dataset for response generation.
python scripts/preprocess/preprocess.py \
--output_dir ../../datasetsKey Arguments:
--output_dir: Directory for preprocessed output
Generate responses using either HuggingFace models or OpenAI GPT models.
Script: generate_response_hf.py
Use local HuggingFace models (e.g., Qwen, Llama) for response generation.
# Basic usage
python scripts/preprocess/generate_response_hf.py
# With custom parameters
python scripts/preprocess/generate_response_hf.py \
--train_path datasets/train/train.jsonl \
--test_path datasets/test/test.jsonl \
--model_name "Qwen/Qwen3-0.6B" \
--output_dir datasets \
--train_samples 3000 \
--test_samples 1176 \
--max_new_tokens 1024 \Key Arguments:
--model_name: HuggingFace model name (default:Qwen/Qwen3-0.6B)--device: Device to use (auto,cpu,cuda,mps)--max_new_tokens: Maximum tokens to generate--skip_train: Skip training dataset processing--skip_test: Skip test dataset processing
Script: generate_response_gpt.py
Use OpenAI API for response generation (requires API key).
# Basic usage
python scripts/preprocess/generate_response_gpt.py
# With custom parameters
python scripts/preprocess/generate_response_gpt.py \
--train_path datasets/train/train.jsonl \
--test_path datasets/test/test.jsonl \
--model_name "gpt-4.1-mini" \
--output_dir datasets \
--train_samples 3000 \
--test_samples 1176 \
--max_new_tokens 1024 \
--temperature 0.3Key Arguments:
--model_name: OpenAI model name (default:gpt-4o-mini)--temperature: Sampling temperature (default: 0.3)--max_new_tokens: Maximum tokens to generate
Setup:
- Set your OpenAI API key:
Or create a
export OPENAI_API_KEY='your-api-key-here'
.envfile in the project root:OPENAI_API_KEY=your-api-key-here
Output:
- Training:
datasets/train/train{N}_w_response_{model}.jsonl - Test:
datasets/test/test{N}_w_response_{model}.jsonl
Script: generate_labels.py
Generate hallucination labels using LettuceDetect and LLM-as-a-judge.
# Basic usage
python scripts/preprocess/generate_labels.py
# With custom parameters
python scripts/preprocess/generate_labels.py \
--train_path datasets/train/train3000_w_response.jsonl \
--test_path datasets/test/test1176_w_response.jsonl \
--output_dir datasets \
--llm_client "groq" \
--llm_model "llama-3.1-70b-versatile" \
--lettuce_method "transformer"Key Arguments:
--lettuce_method: LettuceDetect method (transformer,embedding, etc.)--lettuce_model: LettuceDetect model path--llm_client: LLM client for judge (openaiorgroq)--llm_model: LLM model name for judge evaluation
Setup:
- For OpenAI: Set
OPENAI_API_KEYin environment or.env - For Groq: Set
GROQ_API_KEYin environment or.env
Output:
- Training:
datasets/train/train{N}_w_labels.jsonl - Test:
datasets/test/test{N}_w_labels.jsonl
Script: filter.py
Filter datasets based on LLM judge agreement and confidence thresholds.
# Basic usage
python scripts/preprocess/filter.py
# With custom parameters
python scripts/preprocess/filter.py \
--train_path datasets/train/train3000_w_labels.jsonl \
--test_path datasets/test/test1176_w_labels.jsonl \
--output_dir datasets \
--use_confidence_threshold \
--confidence_threshold 0.8Key Arguments:
--use_confidence_threshold: Enable confidence-based filtering--confidence_threshold: Minimum confidence score (0.0-1.0)--skip_train: Skip training dataset filtering--skip_test: Skip test dataset filtering
Output:
- Training:
datasets/train/train{N}_w_labels_filtered.jsonl - Test:
datasets/test/test{N}_w_labels_filtered.jsonl
Run the entire pipeline from start to finish:
# Step 1: Preprocess (if needed)
python scripts/preprocess/preprocess.py
# Step 2: Generate responses using GPT
export OPENAI_API_KEY='your-key-here'
python scripts/preprocess/generate_response_gpt.py \
--model_name "gpt-4.1-mini" \
--train_samples 3000 \
--test_samples 1176
# Step 3: Generate labels
export GROQ_API_KEY='your-groq-key-here'
python scripts/preprocess/generate_labels.py \
--llm_client "groq" \
--llm_model "llama-3.1-70b-versatile"
# Step 4: Filter datasets
python scripts/preprocess/filter.py \
--use_confidence_threshold \
--confidence_threshold 0.8# Install core dependencies
pip install -r ../../requirements.txtCreate a .env file in the project root:
OPENAI_API_KEY=your-openai-key-here
GROQ_API_KEY=your-groq-key-here
Or export them in your shell:
export OPENAI_API_KEY='your-openai-key-here'
export GROQ_API_KEY='your-groq-key-here'After running the complete pipeline, you'll have:
datasets/
├── train/
│ ├── train.jsonl # Original data
│ ├── train3000_w_response.jsonl # With responses
│ ├── train3000_w_labels.jsonl # With labels
│ └── train3000_w_labels_filtered.jsonl # Filtered (final)
└── test/
├── test.jsonl # Original data
├── test1176_w_response.jsonl # With responses
├── test1176_w_labels.jsonl # With labels
└── test1176_w_labels_filtered.jsonl # Filtered (final)
helper.py: Utility functions for text processing and similarity computation
For more information, see the main project README.