Skip to content

Feat/ragtruth#20

Merged
FrejaThoresen merged 16 commits into
mainfrom
feat/ragtruth
May 13, 2026
Merged

Feat/ragtruth#20
FrejaThoresen merged 16 commits into
mainfrom
feat/ragtruth

Conversation

@FrejaThoresen
Copy link
Copy Markdown
Collaborator

No description provided.

Copilot AI review requested due to automatic review settings May 13, 2026 13:08
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the hallucination-detection pipeline to optionally incorporate translated RAGTruth data alongside synthetic MultiWikiQA hallucinations, while also updating prompt templates and adding several supporting scripts/utilities for generation, evaluation, and development workflows.

Changes:

  • Add RAGTruth preprocessing + translation tooling and integrate translated RAGTruth into the training script as an optional data source.
  • Add new scripts for baseline evaluation, token-level ground-truth evaluation, and producing hallucination-span annotated datasets.
  • Update prompt templates + prompt formatting, introduce logging utilities, and adjust generation/dataset code (including parallelism and dependency updates).

Reviewed changes

Copilot reviewed 53 out of 56 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
uv.lock Updates locked dependencies (adds debugpy/tenacity/termcolor/timm, bumps openai, adds pillow/torchvision, etc.).
src/scripts/translate.py New script to translate hallucination datasets while attempting to preserve span labels via <HAL> tags.
src/scripts/train_hallucination_detector.py Adds optional mixing of translated RAGTruth with synthetic data; updates logging and model naming.
src/scripts/preprocess_ragtruth.py New script to convert raw RAGTruth JSONL into the project’s hallucination-data JSON format.
src/scripts/generate_hallucination_dataset.py New script that generates answers then exports hallucinated spans predicted by the detector.
src/scripts/generate_dataset.py Adjusts output filename scheme and passes configurable max_workers to generation.
src/scripts/evaluate_ground_truth.py New script to validate a detector against synthetic token-level ground-truth labels.
src/scripts/detect_hallucinations.py Formats generated answers into ragtruth-like rows (prompt/answer) before detection.
src/scripts/baseline.py New script running hallucination detection on gold answers as a baseline.
src/prompts/qa_prompt_bg.txt Updates QA prompt template to use ${text} + ${question}.
src/prompts/qa_prompt_bs.txt Updates QA prompt template to use ${text} + ${question}.
src/prompts/qa_prompt_ca.txt Updates QA prompt template to use ${text} + ${question}.
src/prompts/qa_prompt_cs.txt Updates QA prompt template to use ${text} + ${question}.
src/prompts/qa_prompt_da.txt Updates QA prompt template to use ${text} + ${question}.
src/prompts/qa_prompt_de.txt Updates QA prompt template to use ${text} + ${question}.
src/prompts/qa_prompt_el.txt Updates QA prompt template to use ${text} + ${question}.
src/prompts/qa_prompt_en.txt Updates QA prompt template to use ${text} + ${question}.
src/prompts/qa_prompt_es.txt Updates QA prompt template to use ${text} + ${question}.
src/prompts/qa_prompt_et.txt Updates QA prompt template to use ${text} + ${question}.
src/prompts/qa_prompt_fi.txt Updates QA prompt template to use ${text} + ${question}.
src/prompts/qa_prompt_fo.txt Updates QA prompt template to use ${text} + ${question}.
src/prompts/qa_prompt_fr.txt Updates QA prompt template to use ${text} + ${question}.
src/prompts/qa_prompt_hr.txt Updates QA prompt template to use ${text} + ${question}.
src/prompts/qa_prompt_hu.txt Updates QA prompt template to use ${text} + ${question}.
src/prompts/qa_prompt_is.txt Updates QA prompt template to use ${text} + ${question}.
src/prompts/qa_prompt_it.txt Updates QA prompt template to use ${text} + ${question}.
src/prompts/qa_prompt_lt.txt Updates QA prompt template to use ${text} + ${question}.
src/prompts/qa_prompt_lv.txt Updates QA prompt template to use ${text} + ${question}.
src/prompts/qa_prompt_nl.txt Updates QA prompt template to use ${text} + ${question}.
src/prompts/qa_prompt_no.txt Updates QA prompt template to use ${text} + ${question}.
src/prompts/qa_prompt_pl.txt Updates QA prompt template to use ${text} + ${question}.
src/prompts/qa_prompt_pt.txt Updates QA prompt template to use ${text} + ${question}.
src/prompts/qa_prompt_ro.txt Updates QA prompt template to use ${text} + ${question}.
src/prompts/qa_prompt_sk.txt Updates QA prompt template to use ${text} + ${question}.
src/prompts/qa_prompt_sl.txt Updates QA prompt template to use ${text} + ${question}.
src/prompts/qa_prompt_sq.txt Adds Albanian QA prompt template using ${text} + ${question}.
src/prompts/qa_prompt_sr.txt Updates QA prompt template to use ${text} + ${question}.
src/prompts/qa_prompt_sv.txt Updates QA prompt template to use ${text} + ${question}.
src/prompts/qa_prompt_uk.txt Updates QA prompt template to use ${text} + ${question}.
src/factuality_eval/train.py Adds Albanian language support and introduces format_dataset_to_ragtruth_without_labels.
src/factuality_eval/prompt_utils.py Adds Albanian and changes QA prompt substitution schema to question + text.
src/factuality_eval/model_generation.py Adds long-input guard, adjusts sampling params, lazy-loads local models, and adds GPU max-memory control.
src/factuality_eval/logging_utils.py New logging utilities (stdio capture, colored headers, tqdm styling).
src/factuality_eval/hallucination_detection.py Switches detection to prompt-based API, validates required columns, and guards divide-by-zero in metrics.
src/factuality_eval/dataset_generation.py Adds parallel hallucination generation and refines hallucinated span extraction/normalization.
README.md Replaces literature-review content with end-to-end project documentation and workflows.
pyproject.toml Adds new dependencies and updates ruff ignore settings for long lines in new scripts.
mise.toml Adds mise toolchain config (node + CLI tools).
docs/research.md Adds research notes document replacing the previous literature study doc.
docs/literature_study.md Removes prior literature-study markdown file.
Dockerfile Replaces runtime container with a mise/uv-based dev container that launches opencode.
docker-compose.yml Adds compose service for the new Docker-based dev workflow.
config/hallucination_detection.yaml Adds max_workers, updates epochs/model defaults, and enables ragtruth + multiwikiqa with a ragtruth path.
.pre-commit-config.yaml Bumps pre-commit hook revisions (ruff/nbstripout/mypy).
.gitignore Ignores data/ground_truth/* while keeping a .gitkeep.
Comments suppressed due to low confidence (1)

src/scripts/translate.py:359

  • labels is reset to [] here, and span reconstruction later uses sample.labels, discarding the merged/sorted labels used to insert <HAL> tags. This breaks round-tripping when overlaps were merged (and can misalign tag order vs label order). Keep the merged/sorted label list and pass it into find_hallucination_tags.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/factuality_eval/dataset_generation.py Outdated
Comment on lines +74 to +75
if input_length > 8192:
return ""
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot apply changes based on this feedback

Comment thread src/scripts/baseline.py
Comment on lines +62 to +66
target_dataset_name = f"{config.base_dataset.id}-synthetic-hallucinations"
hallucination_detector_hugging_face_path = (
f"{config.hub_organisation}/"
f"{config.models.hallu_detect_model}-{target_dataset_name}-{config.language}"
)
Comment on lines +298 to 302
def get_hallucinated_labels(
hallucinated_dict: dict,
) -> tuple[list[dict], list[str]] | None:
"""Get the hallucinated labels from the generation result.

Comment thread src/scripts/translate.py
Comment on lines +107 to +111
api_key = os.getenv("OPENAI_API_KEY")
base_url = os.getenv("OPENAI_BASE_URL") or "https://api.openai.com/v1"
print(api_key)
print(base_url)
return {
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot apply changes based on this feedback

Comment on lines +34 to +38
def _build_detector_model_path(config: DictConfig) -> str:
target_dataset_name = f"{config.base_dataset.id}-synthetic-hallucinations"
return (
f"{config.hub_organisation}/"
f"{config.models.hallu_detect_model}-{target_dataset_name}-{config.language}"
Comment on lines 155 to +156
tmpl = PromptUtils.load_prompt(f"qa_prompt_{lang.lower()}.txt")
return tmpl.substitute(
question=question, num_passages=len(context), context=ctx_block
)
return tmpl.substitute(question=question, text=ctx_block)
Comment thread src/scripts/translate.py
Comment on lines +137 to +141
prompt: bool = False,
) -> str:
"""Translate text using OpenAI-compatible HTTP API with automatic retries.

:param text: Text to translate
Comment on lines +13 to +19
logging.basicConfig(level=logging.DEBUG)


logger = logging.getLogger()
logging.getLogger().setLevel(logging.DEBUG)


Comment on lines +29 to +33
if kwargs.get("ascii") in (None, True):
kwargs["ascii"] = _TQDM_ASCII
_tqdm_original_init(self, *args, **kwargs)


Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
@FrejaThoresen FrejaThoresen merged commit 57854cc into main May 13, 2026
4 checks passed
FrejaThoresen added a commit that referenced this pull request May 26, 2026
* Bump nltk from 3.9.1 to 3.9.4 (#17)

Bumps [nltk](https://github.com/nltk/nltk) from 3.9.1 to 3.9.4.
- [Changelog](https://github.com/nltk/nltk/blob/develop/ChangeLog)
- [Commits](nltk/nltk@3.9.1...3.9.4)

---
updated-dependencies:
- dependency-name: nltk
  dependency-version: 3.9.4
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Feat/ragtruth (#20)

* Update docker file

* Scripts from KLabs for ragtruth processing

* Ground truth eval without ragtruth

* merge

* Files from ucloud

* Import fix

* Bugfix in prompt util

* Don't shuffle

* Shuffle after split

* Add switch for ragtruth

* Linting

* Linting

* Update readme

* Add summary of ground truth analysis

* Clean up ground truth scripts

* Potential fix for pull request finding

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
FrejaThoresen added a commit that referenced this pull request May 26, 2026
* Bump nltk from 3.9.1 to 3.9.4 (#17)

Bumps [nltk](https://github.com/nltk/nltk) from 3.9.1 to 3.9.4.
- [Changelog](https://github.com/nltk/nltk/blob/develop/ChangeLog)
- [Commits](nltk/nltk@3.9.1...3.9.4)

---
updated-dependencies:
- dependency-name: nltk
  dependency-version: 3.9.4
  dependency-type: direct:production
...




* Feat/ragtruth (#20)

* Update docker file

* Scripts from KLabs for ragtruth processing

* Ground truth eval without ragtruth

* merge

* Files from ucloud

* Import fix

* Bugfix in prompt util

* Don't shuffle

* Shuffle after split

* Add switch for ragtruth

* Linting

* Linting

* Update readme

* Add summary of ground truth analysis

* Clean up ground truth scripts

* Potential fix for pull request finding



---------



---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants