Feat/ragtruth by FrejaThoresen · Pull Request #20 · alexandrainst/faithful_eval

FrejaThoresen · 2026-05-13T13:08:44Z

No description provided.

Copilot

Pull request overview

This PR extends the hallucination-detection pipeline to optionally incorporate translated RAGTruth data alongside synthetic MultiWikiQA hallucinations, while also updating prompt templates and adding several supporting scripts/utilities for generation, evaluation, and development workflows.

Changes:

Add RAGTruth preprocessing + translation tooling and integrate translated RAGTruth into the training script as an optional data source.
Add new scripts for baseline evaluation, token-level ground-truth evaluation, and producing hallucination-span annotated datasets.
Update prompt templates + prompt formatting, introduce logging utilities, and adjust generation/dataset code (including parallelism and dependency updates).

Reviewed changes

Copilot reviewed 53 out of 56 changed files in this pull request and generated 13 comments.

Show a summary per file

File	Description
uv.lock	Updates locked dependencies (adds debugpy/tenacity/termcolor/timm, bumps openai, adds pillow/torchvision, etc.).
src/scripts/translate.py	New script to translate hallucination datasets while attempting to preserve span labels via `<HAL>` tags.
src/scripts/train_hallucination_detector.py	Adds optional mixing of translated RAGTruth with synthetic data; updates logging and model naming.
src/scripts/preprocess_ragtruth.py	New script to convert raw RAGTruth JSONL into the project’s hallucination-data JSON format.
src/scripts/generate_hallucination_dataset.py	New script that generates answers then exports hallucinated spans predicted by the detector.
src/scripts/generate_dataset.py	Adjusts output filename scheme and passes configurable `max_workers` to generation.
src/scripts/evaluate_ground_truth.py	New script to validate a detector against synthetic token-level ground-truth labels.
src/scripts/detect_hallucinations.py	Formats generated answers into ragtruth-like rows (prompt/answer) before detection.
src/scripts/baseline.py	New script running hallucination detection on gold answers as a baseline.
src/prompts/qa_prompt_bg.txt	Updates QA prompt template to use `${text}` + `${question}`.
src/prompts/qa_prompt_bs.txt	Updates QA prompt template to use `${text}` + `${question}`.
src/prompts/qa_prompt_ca.txt	Updates QA prompt template to use `${text}` + `${question}`.
src/prompts/qa_prompt_cs.txt	Updates QA prompt template to use `${text}` + `${question}`.
src/prompts/qa_prompt_da.txt	Updates QA prompt template to use `${text}` + `${question}`.
src/prompts/qa_prompt_de.txt	Updates QA prompt template to use `${text}` + `${question}`.
src/prompts/qa_prompt_el.txt	Updates QA prompt template to use `${text}` + `${question}`.
src/prompts/qa_prompt_en.txt	Updates QA prompt template to use `${text}` + `${question}`.
src/prompts/qa_prompt_es.txt	Updates QA prompt template to use `${text}` + `${question}`.
src/prompts/qa_prompt_et.txt	Updates QA prompt template to use `${text}` + `${question}`.
src/prompts/qa_prompt_fi.txt	Updates QA prompt template to use `${text}` + `${question}`.
src/prompts/qa_prompt_fo.txt	Updates QA prompt template to use `${text}` + `${question}`.
src/prompts/qa_prompt_fr.txt	Updates QA prompt template to use `${text}` + `${question}`.
src/prompts/qa_prompt_hr.txt	Updates QA prompt template to use `${text}` + `${question}`.
src/prompts/qa_prompt_hu.txt	Updates QA prompt template to use `${text}` + `${question}`.
src/prompts/qa_prompt_is.txt	Updates QA prompt template to use `${text}` + `${question}`.
src/prompts/qa_prompt_it.txt	Updates QA prompt template to use `${text}` + `${question}`.
src/prompts/qa_prompt_lt.txt	Updates QA prompt template to use `${text}` + `${question}`.
src/prompts/qa_prompt_lv.txt	Updates QA prompt template to use `${text}` + `${question}`.
src/prompts/qa_prompt_nl.txt	Updates QA prompt template to use `${text}` + `${question}`.
src/prompts/qa_prompt_no.txt	Updates QA prompt template to use `${text}` + `${question}`.
src/prompts/qa_prompt_pl.txt	Updates QA prompt template to use `${text}` + `${question}`.
src/prompts/qa_prompt_pt.txt	Updates QA prompt template to use `${text}` + `${question}`.
src/prompts/qa_prompt_ro.txt	Updates QA prompt template to use `${text}` + `${question}`.
src/prompts/qa_prompt_sk.txt	Updates QA prompt template to use `${text}` + `${question}`.
src/prompts/qa_prompt_sl.txt	Updates QA prompt template to use `${text}` + `${question}`.
src/prompts/qa_prompt_sq.txt	Adds Albanian QA prompt template using `${text}` + `${question}`.
src/prompts/qa_prompt_sr.txt	Updates QA prompt template to use `${text}` + `${question}`.
src/prompts/qa_prompt_sv.txt	Updates QA prompt template to use `${text}` + `${question}`.
src/prompts/qa_prompt_uk.txt	Updates QA prompt template to use `${text}` + `${question}`.
src/factuality_eval/train.py	Adds Albanian language support and introduces `format_dataset_to_ragtruth_without_labels`.
src/factuality_eval/prompt_utils.py	Adds Albanian and changes QA prompt substitution schema to `question` + `text`.
src/factuality_eval/model_generation.py	Adds long-input guard, adjusts sampling params, lazy-loads local models, and adds GPU max-memory control.
src/factuality_eval/logging_utils.py	New logging utilities (stdio capture, colored headers, tqdm styling).
src/factuality_eval/hallucination_detection.py	Switches detection to prompt-based API, validates required columns, and guards divide-by-zero in metrics.
src/factuality_eval/dataset_generation.py	Adds parallel hallucination generation and refines hallucinated span extraction/normalization.
README.md	Replaces literature-review content with end-to-end project documentation and workflows.
pyproject.toml	Adds new dependencies and updates ruff ignore settings for long lines in new scripts.
mise.toml	Adds mise toolchain config (node + CLI tools).
docs/research.md	Adds research notes document replacing the previous literature study doc.
docs/literature_study.md	Removes prior literature-study markdown file.
Dockerfile	Replaces runtime container with a mise/uv-based dev container that launches `opencode`.
docker-compose.yml	Adds compose service for the new Docker-based dev workflow.
config/hallucination_detection.yaml	Adds `max_workers`, updates epochs/model defaults, and enables ragtruth + multiwikiqa with a ragtruth path.
.pre-commit-config.yaml	Bumps pre-commit hook revisions (ruff/nbstripout/mypy).
.gitignore	Ignores `data/ground_truth/*` while keeping a `.gitkeep`.

Comments suppressed due to low confidence (1)

src/scripts/translate.py:359

labels is reset to [] here, and span reconstruction later uses sample.labels, discarding the merged/sorted labels used to insert <HAL> tags. This breaks round-tripping when overlaps were merged (and can misalign tag order vs label order). Keep the merged/sorted label list and pass it into find_hallucination_tags.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

FrejaThoresen · 2026-05-13T13:28:45Z

+    if input_length > 8192:
+        return ""


@copilot apply changes based on this feedback

+    target_dataset_name = f"{config.base_dataset.id}-synthetic-hallucinations"
+    hallucination_detector_hugging_face_path = (
+        f"{config.hub_organisation}/"
+        f"{config.models.hallu_detect_model}-{target_dataset_name}-{config.language}"
+    )


+def get_hallucinated_labels(
+    hallucinated_dict: dict,
+) -> tuple[list[dict], list[str]] | None:
    """Get the hallucinated labels from the generation result.



FrejaThoresen · 2026-05-13T13:30:14Z

+    api_key = os.getenv("OPENAI_API_KEY")
+    base_url = os.getenv("OPENAI_BASE_URL") or "https://api.openai.com/v1"
+    print(api_key)
+    print(base_url)
+    return {


@copilot apply changes based on this feedback

+def _build_detector_model_path(config: DictConfig) -> str:
+    target_dataset_name = f"{config.base_dataset.id}-synthetic-hallucinations"
+    return (
+        f"{config.hub_organisation}/"
+        f"{config.models.hallu_detect_model}-{target_dataset_name}-{config.language}"


        tmpl = PromptUtils.load_prompt(f"qa_prompt_{lang.lower()}.txt")
-        return tmpl.substitute(
-            question=question, num_passages=len(context), context=ctx_block
-        )
+        return tmpl.substitute(question=question, text=ctx_block)


+    prompt: bool = False,
+) -> str:
+    """Translate text using OpenAI-compatible HTTP API with automatic retries.
+
+    :param text: Text to translate


+logging.basicConfig(level=logging.DEBUG)
+
+
+logger = logging.getLogger()
+logging.getLogger().setLevel(logging.DEBUG)
+
+


+    if kwargs.get("ascii") in (None, True):
+        kwargs["ascii"] = _TQDM_ASCII
+    _tqdm_original_init(self, *args, **kwargs)
+
+


Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

* Bump nltk from 3.9.1 to 3.9.4 (#17) Bumps [nltk](https://github.com/nltk/nltk) from 3.9.1 to 3.9.4. - [Changelog](https://github.com/nltk/nltk/blob/develop/ChangeLog) - [Commits](nltk/nltk@3.9.1...3.9.4) --- updated-dependencies: - dependency-name: nltk dependency-version: 3.9.4 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Feat/ragtruth (#20) * Update docker file * Scripts from KLabs for ragtruth processing * Ground truth eval without ragtruth * merge * Files from ucloud * Import fix * Bugfix in prompt util * Don't shuffle * Shuffle after split * Add switch for ragtruth * Linting * Linting * Update readme * Add summary of ground truth analysis * Clean up ground truth scripts * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

* Bump nltk from 3.9.1 to 3.9.4 (#17) Bumps [nltk](https://github.com/nltk/nltk) from 3.9.1 to 3.9.4. - [Changelog](https://github.com/nltk/nltk/blob/develop/ChangeLog) - [Commits](nltk/nltk@3.9.1...3.9.4) --- updated-dependencies: - dependency-name: nltk dependency-version: 3.9.4 dependency-type: direct:production ... * Feat/ragtruth (#20) * Update docker file * Scripts from KLabs for ragtruth processing * Ground truth eval without ragtruth * merge * Files from ucloud * Import fix * Bugfix in prompt util * Don't shuffle * Shuffle after split * Add switch for ragtruth * Linting * Linting * Update readme * Add summary of ground truth analysis * Clean up ground truth scripts * Potential fix for pull request finding --------- --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

FrejaThoresen added 15 commits May 5, 2026 11:37

Update docker file

eb7c6d2

Scripts from KLabs for ragtruth processing

e5da66b

Ground truth eval without ragtruth

5dd643f

merge

48132a3

Files from ucloud

3739f72

Import fix

d92c7f7

Bugfix in prompt util

2c1be3d

Don't shuffle

d553b0a

Shuffle after split

fd7e2e4

Add switch for ragtruth

1105d8d

Linting

5974cf7

Linting

6d2c7d6

Update readme

0c1a310

Add summary of ground truth analysis

d853a87

Clean up ground truth scripts

dcbb284

Copilot AI review requested due to automatic review settings May 13, 2026 13:08

Copilot started reviewing on behalf of FrejaThoresen May 13, 2026 13:09 View session

Copilot AI reviewed May 13, 2026

View reviewed changes

Potential fix for pull request finding

5cd95d2

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

FrejaThoresen merged commit 57854cc into main May 13, 2026
4 checks passed

FrejaThoresen mentioned this pull request May 26, 2026

RAGtruth integration (#25) #26

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/ragtruth#20

Feat/ragtruth#20
FrejaThoresen merged 16 commits into
mainfrom
feat/ragtruth

FrejaThoresen commented May 13, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

FrejaThoresen May 13, 2026

Uh oh!

FrejaThoresen May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FrejaThoresen commented May 13, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

FrejaThoresen May 13, 2026

Choose a reason for hiding this comment

Uh oh!

FrejaThoresen May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants