Feat/ragtruth#20
Merged
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR extends the hallucination-detection pipeline to optionally incorporate translated RAGTruth data alongside synthetic MultiWikiQA hallucinations, while also updating prompt templates and adding several supporting scripts/utilities for generation, evaluation, and development workflows.
Changes:
- Add RAGTruth preprocessing + translation tooling and integrate translated RAGTruth into the training script as an optional data source.
- Add new scripts for baseline evaluation, token-level ground-truth evaluation, and producing hallucination-span annotated datasets.
- Update prompt templates + prompt formatting, introduce logging utilities, and adjust generation/dataset code (including parallelism and dependency updates).
Reviewed changes
Copilot reviewed 53 out of 56 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
| uv.lock | Updates locked dependencies (adds debugpy/tenacity/termcolor/timm, bumps openai, adds pillow/torchvision, etc.). |
| src/scripts/translate.py | New script to translate hallucination datasets while attempting to preserve span labels via <HAL> tags. |
| src/scripts/train_hallucination_detector.py | Adds optional mixing of translated RAGTruth with synthetic data; updates logging and model naming. |
| src/scripts/preprocess_ragtruth.py | New script to convert raw RAGTruth JSONL into the project’s hallucination-data JSON format. |
| src/scripts/generate_hallucination_dataset.py | New script that generates answers then exports hallucinated spans predicted by the detector. |
| src/scripts/generate_dataset.py | Adjusts output filename scheme and passes configurable max_workers to generation. |
| src/scripts/evaluate_ground_truth.py | New script to validate a detector against synthetic token-level ground-truth labels. |
| src/scripts/detect_hallucinations.py | Formats generated answers into ragtruth-like rows (prompt/answer) before detection. |
| src/scripts/baseline.py | New script running hallucination detection on gold answers as a baseline. |
| src/prompts/qa_prompt_bg.txt | Updates QA prompt template to use ${text} + ${question}. |
| src/prompts/qa_prompt_bs.txt | Updates QA prompt template to use ${text} + ${question}. |
| src/prompts/qa_prompt_ca.txt | Updates QA prompt template to use ${text} + ${question}. |
| src/prompts/qa_prompt_cs.txt | Updates QA prompt template to use ${text} + ${question}. |
| src/prompts/qa_prompt_da.txt | Updates QA prompt template to use ${text} + ${question}. |
| src/prompts/qa_prompt_de.txt | Updates QA prompt template to use ${text} + ${question}. |
| src/prompts/qa_prompt_el.txt | Updates QA prompt template to use ${text} + ${question}. |
| src/prompts/qa_prompt_en.txt | Updates QA prompt template to use ${text} + ${question}. |
| src/prompts/qa_prompt_es.txt | Updates QA prompt template to use ${text} + ${question}. |
| src/prompts/qa_prompt_et.txt | Updates QA prompt template to use ${text} + ${question}. |
| src/prompts/qa_prompt_fi.txt | Updates QA prompt template to use ${text} + ${question}. |
| src/prompts/qa_prompt_fo.txt | Updates QA prompt template to use ${text} + ${question}. |
| src/prompts/qa_prompt_fr.txt | Updates QA prompt template to use ${text} + ${question}. |
| src/prompts/qa_prompt_hr.txt | Updates QA prompt template to use ${text} + ${question}. |
| src/prompts/qa_prompt_hu.txt | Updates QA prompt template to use ${text} + ${question}. |
| src/prompts/qa_prompt_is.txt | Updates QA prompt template to use ${text} + ${question}. |
| src/prompts/qa_prompt_it.txt | Updates QA prompt template to use ${text} + ${question}. |
| src/prompts/qa_prompt_lt.txt | Updates QA prompt template to use ${text} + ${question}. |
| src/prompts/qa_prompt_lv.txt | Updates QA prompt template to use ${text} + ${question}. |
| src/prompts/qa_prompt_nl.txt | Updates QA prompt template to use ${text} + ${question}. |
| src/prompts/qa_prompt_no.txt | Updates QA prompt template to use ${text} + ${question}. |
| src/prompts/qa_prompt_pl.txt | Updates QA prompt template to use ${text} + ${question}. |
| src/prompts/qa_prompt_pt.txt | Updates QA prompt template to use ${text} + ${question}. |
| src/prompts/qa_prompt_ro.txt | Updates QA prompt template to use ${text} + ${question}. |
| src/prompts/qa_prompt_sk.txt | Updates QA prompt template to use ${text} + ${question}. |
| src/prompts/qa_prompt_sl.txt | Updates QA prompt template to use ${text} + ${question}. |
| src/prompts/qa_prompt_sq.txt | Adds Albanian QA prompt template using ${text} + ${question}. |
| src/prompts/qa_prompt_sr.txt | Updates QA prompt template to use ${text} + ${question}. |
| src/prompts/qa_prompt_sv.txt | Updates QA prompt template to use ${text} + ${question}. |
| src/prompts/qa_prompt_uk.txt | Updates QA prompt template to use ${text} + ${question}. |
| src/factuality_eval/train.py | Adds Albanian language support and introduces format_dataset_to_ragtruth_without_labels. |
| src/factuality_eval/prompt_utils.py | Adds Albanian and changes QA prompt substitution schema to question + text. |
| src/factuality_eval/model_generation.py | Adds long-input guard, adjusts sampling params, lazy-loads local models, and adds GPU max-memory control. |
| src/factuality_eval/logging_utils.py | New logging utilities (stdio capture, colored headers, tqdm styling). |
| src/factuality_eval/hallucination_detection.py | Switches detection to prompt-based API, validates required columns, and guards divide-by-zero in metrics. |
| src/factuality_eval/dataset_generation.py | Adds parallel hallucination generation and refines hallucinated span extraction/normalization. |
| README.md | Replaces literature-review content with end-to-end project documentation and workflows. |
| pyproject.toml | Adds new dependencies and updates ruff ignore settings for long lines in new scripts. |
| mise.toml | Adds mise toolchain config (node + CLI tools). |
| docs/research.md | Adds research notes document replacing the previous literature study doc. |
| docs/literature_study.md | Removes prior literature-study markdown file. |
| Dockerfile | Replaces runtime container with a mise/uv-based dev container that launches opencode. |
| docker-compose.yml | Adds compose service for the new Docker-based dev workflow. |
| config/hallucination_detection.yaml | Adds max_workers, updates epochs/model defaults, and enables ragtruth + multiwikiqa with a ragtruth path. |
| .pre-commit-config.yaml | Bumps pre-commit hook revisions (ruff/nbstripout/mypy). |
| .gitignore | Ignores data/ground_truth/* while keeping a .gitkeep. |
Comments suppressed due to low confidence (1)
src/scripts/translate.py:359
labelsis reset to[]here, and span reconstruction later usessample.labels, discarding the merged/sorted labels used to insert<HAL>tags. This breaks round-tripping when overlaps were merged (and can misalign tag order vs label order). Keep the merged/sorted label list and pass it intofind_hallucination_tags.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+74
to
+75
| if input_length > 8192: | ||
| return "" |
Collaborator
Author
There was a problem hiding this comment.
@copilot apply changes based on this feedback
Comment on lines
+62
to
+66
| target_dataset_name = f"{config.base_dataset.id}-synthetic-hallucinations" | ||
| hallucination_detector_hugging_face_path = ( | ||
| f"{config.hub_organisation}/" | ||
| f"{config.models.hallu_detect_model}-{target_dataset_name}-{config.language}" | ||
| ) |
Comment on lines
+298
to
302
| def get_hallucinated_labels( | ||
| hallucinated_dict: dict, | ||
| ) -> tuple[list[dict], list[str]] | None: | ||
| """Get the hallucinated labels from the generation result. | ||
|
|
Comment on lines
+107
to
+111
| api_key = os.getenv("OPENAI_API_KEY") | ||
| base_url = os.getenv("OPENAI_BASE_URL") or "https://api.openai.com/v1" | ||
| print(api_key) | ||
| print(base_url) | ||
| return { |
Collaborator
Author
There was a problem hiding this comment.
@copilot apply changes based on this feedback
Comment on lines
+34
to
+38
| def _build_detector_model_path(config: DictConfig) -> str: | ||
| target_dataset_name = f"{config.base_dataset.id}-synthetic-hallucinations" | ||
| return ( | ||
| f"{config.hub_organisation}/" | ||
| f"{config.models.hallu_detect_model}-{target_dataset_name}-{config.language}" |
Comment on lines
155
to
+156
| tmpl = PromptUtils.load_prompt(f"qa_prompt_{lang.lower()}.txt") | ||
| return tmpl.substitute( | ||
| question=question, num_passages=len(context), context=ctx_block | ||
| ) | ||
| return tmpl.substitute(question=question, text=ctx_block) |
Comment on lines
+137
to
+141
| prompt: bool = False, | ||
| ) -> str: | ||
| """Translate text using OpenAI-compatible HTTP API with automatic retries. | ||
|
|
||
| :param text: Text to translate |
Comment on lines
+13
to
+19
| logging.basicConfig(level=logging.DEBUG) | ||
|
|
||
|
|
||
| logger = logging.getLogger() | ||
| logging.getLogger().setLevel(logging.DEBUG) | ||
|
|
||
|
|
Comment on lines
+29
to
+33
| if kwargs.get("ascii") in (None, True): | ||
| kwargs["ascii"] = _TQDM_ASCII | ||
| _tqdm_original_init(self, *args, **kwargs) | ||
|
|
||
|
|
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copilot stopped work on behalf of
FrejaThoresen due to an error
May 13, 2026 13:28
Copilot stopped work on behalf of
FrejaThoresen due to an error
May 13, 2026 13:30
FrejaThoresen
added a commit
that referenced
this pull request
May 26, 2026
* Bump nltk from 3.9.1 to 3.9.4 (#17) Bumps [nltk](https://github.com/nltk/nltk) from 3.9.1 to 3.9.4. - [Changelog](https://github.com/nltk/nltk/blob/develop/ChangeLog) - [Commits](nltk/nltk@3.9.1...3.9.4) --- updated-dependencies: - dependency-name: nltk dependency-version: 3.9.4 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Feat/ragtruth (#20) * Update docker file * Scripts from KLabs for ragtruth processing * Ground truth eval without ragtruth * merge * Files from ucloud * Import fix * Bugfix in prompt util * Don't shuffle * Shuffle after split * Add switch for ragtruth * Linting * Linting * Update readme * Add summary of ground truth analysis * Clean up ground truth scripts * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
FrejaThoresen
added a commit
that referenced
this pull request
May 26, 2026
* Bump nltk from 3.9.1 to 3.9.4 (#17) Bumps [nltk](https://github.com/nltk/nltk) from 3.9.1 to 3.9.4. - [Changelog](https://github.com/nltk/nltk/blob/develop/ChangeLog) - [Commits](nltk/nltk@3.9.1...3.9.4) --- updated-dependencies: - dependency-name: nltk dependency-version: 3.9.4 dependency-type: direct:production ... * Feat/ragtruth (#20) * Update docker file * Scripts from KLabs for ragtruth processing * Ground truth eval without ragtruth * merge * Files from ucloud * Import fix * Bugfix in prompt util * Don't shuffle * Shuffle after split * Add switch for ragtruth * Linting * Linting * Update readme * Add summary of ground truth analysis * Clean up ground truth scripts * Potential fix for pull request finding --------- --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.