Preprocessing Pipeline

This folder contains scripts for preprocessing the RAGBench/FinQA dataset to generate training and test data for hallucination detection.

Pipeline Overview

The preprocessing pipeline consists of four main steps:

1. Preprocess → 2. Generate Response → 3. Generate Labels → 4. Filter

Step 1: Preprocess

Script: preprocess.py

Prepares the raw dataset for response generation.

python scripts/preprocess/preprocess.py \
  --output_dir ../../datasets

Key Arguments:

--output_dir: Directory for preprocessed output

Step 2: Generate Response

Generate responses using either HuggingFace models or OpenAI GPT models.

Option A: HuggingFace Models

Script: generate_response_hf.py

Use local HuggingFace models (e.g., Qwen, Llama) for response generation.

# Basic usage
python scripts/preprocess/generate_response_hf.py

# With custom parameters
python scripts/preprocess/generate_response_hf.py \
  --train_path datasets/train/train.jsonl \
  --test_path datasets/test/test.jsonl \
  --model_name "Qwen/Qwen3-0.6B" \
  --output_dir datasets \
  --train_samples 3000 \
  --test_samples 1176 \
  --max_new_tokens 1024 \

Key Arguments:

--model_name: HuggingFace model name (default: Qwen/Qwen3-0.6B)
--device: Device to use (auto, cpu, cuda, mps)
--max_new_tokens: Maximum tokens to generate
--skip_train: Skip training dataset processing
--skip_test: Skip test dataset processing

Option B: OpenAI GPT Models

Script: generate_response_gpt.py

Use OpenAI API for response generation (requires API key).

# Basic usage
python scripts/preprocess/generate_response_gpt.py

# With custom parameters
python scripts/preprocess/generate_response_gpt.py \
  --train_path datasets/train/train.jsonl \
  --test_path datasets/test/test.jsonl \
  --model_name "gpt-4.1-mini" \
  --output_dir datasets \
  --train_samples 3000 \
  --test_samples 1176 \
  --max_new_tokens 1024 \
  --temperature 0.3

Key Arguments:

--model_name: OpenAI model name (default: gpt-4o-mini)
--temperature: Sampling temperature (default: 0.3)
--max_new_tokens: Maximum tokens to generate

Setup:

Set your OpenAI API key:

export OPENAI_API_KEY='your-api-key-here'

Or create a .env file in the project root:

OPENAI_API_KEY=your-api-key-here

Output:

Training: datasets/train/train{N}_w_response_{model}.jsonl
Test: datasets/test/test{N}_w_response_{model}.jsonl

Step 3: Generate Labels

Script: generate_labels.py

Generate hallucination labels using LettuceDetect and LLM-as-a-judge.

# Basic usage
python scripts/preprocess/generate_labels.py

# With custom parameters
python scripts/preprocess/generate_labels.py \
  --train_path datasets/train/train3000_w_response.jsonl \
  --test_path datasets/test/test1176_w_response.jsonl \
  --output_dir datasets \
  --llm_client "groq" \
  --llm_model "llama-3.1-70b-versatile" \
  --lettuce_method "transformer"

Key Arguments:

--lettuce_method: LettuceDetect method (transformer, embedding, etc.)
--lettuce_model: LettuceDetect model path
--llm_client: LLM client for judge (openai or groq)
--llm_model: LLM model name for judge evaluation

Setup:

For OpenAI: Set OPENAI_API_KEY in environment or .env
For Groq: Set GROQ_API_KEY in environment or .env

Output:

Training: datasets/train/train{N}_w_labels.jsonl
Test: datasets/test/test{N}_w_labels.jsonl

Step 4: Filter

Script: filter.py

Filter datasets based on LLM judge agreement and confidence thresholds.

# Basic usage
python scripts/preprocess/filter.py

# With custom parameters
python scripts/preprocess/filter.py \
  --train_path datasets/train/train3000_w_labels.jsonl \
  --test_path datasets/test/test1176_w_labels.jsonl \
  --output_dir datasets \
  --use_confidence_threshold \
  --confidence_threshold 0.8

Key Arguments:

--use_confidence_threshold: Enable confidence-based filtering
--confidence_threshold: Minimum confidence score (0.0-1.0)
--skip_train: Skip training dataset filtering
--skip_test: Skip test dataset filtering

Output:

Training: datasets/train/train{N}_w_labels_filtered.jsonl
Test: datasets/test/test{N}_w_labels_filtered.jsonl

Complete Pipeline Example

Run the entire pipeline from start to finish:

# Step 1: Preprocess (if needed)
python scripts/preprocess/preprocess.py

# Step 2: Generate responses using GPT
export OPENAI_API_KEY='your-key-here'
python scripts/preprocess/generate_response_gpt.py \
  --model_name "gpt-4.1-mini" \
  --train_samples 3000 \
  --test_samples 1176

# Step 3: Generate labels
export GROQ_API_KEY='your-groq-key-here'
python scripts/preprocess/generate_labels.py \
  --llm_client "groq" \
  --llm_model "llama-3.1-70b-versatile"

# Step 4: Filter datasets
python scripts/preprocess/filter.py \
  --use_confidence_threshold \
  --confidence_threshold 0.8

Environment Setup

Required Dependencies

# Install core dependencies
pip install -r ../../requirements.txt

API Keys

Create a .env file in the project root:

OPENAI_API_KEY=your-openai-key-here
GROQ_API_KEY=your-groq-key-here

Or export them in your shell:

export OPENAI_API_KEY='your-openai-key-here'
export GROQ_API_KEY='your-groq-key-here'

Output Files

After running the complete pipeline, you'll have:

datasets/
├── train/
│   ├── train.jsonl                           # Original data
│   ├── train3000_w_response.jsonl            # With responses
│   ├── train3000_w_labels.jsonl              # With labels
│   └── train3000_w_labels_filtered.jsonl     # Filtered (final)
└── test/
    ├── test.jsonl                            # Original data
    ├── test1176_w_response.jsonl             # With responses
    ├── test1176_w_labels.jsonl               # With labels
    └── test1176_w_labels_filtered.jsonl      # Filtered (final)

Additional Scripts

Helper Scripts

helper.py: Utility functions for text processing and similarity computation

For more information, see the main project README.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preprocessing Pipeline

Pipeline Overview

Step 1: Preprocess

Step 2: Generate Response

Option A: HuggingFace Models

Option B: OpenAI GPT Models

Step 3: Generate Labels

Step 4: Filter

Complete Pipeline Example

Environment Setup

Required Dependencies

API Keys

Output Files

Additional Scripts

Helper Scripts

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Preprocessing Pipeline

Pipeline Overview

Step 1: Preprocess

Step 2: Generate Response

Option A: HuggingFace Models

Option B: OpenAI GPT Models

Step 3: Generate Labels

Step 4: Filter

Complete Pipeline Example

Environment Setup

Required Dependencies

API Keys

Output Files

Additional Scripts

Helper Scripts