Skip to content

fix: AutoRAG context correctness always 0 for .txt input documents#69

Merged
openshift-merge-bot[bot] merged 2 commits into
opendatahub-io:mainfrom
witold-nowogorski:main
May 22, 2026
Merged

fix: AutoRAG context correctness always 0 for .txt input documents#69
openshift-merge-bot[bot] merged 2 commits into
opendatahub-io:mainfrom
witold-nowogorski:main

Conversation

@witold-nowogorski
Copy link
Copy Markdown

@witold-nowogorski witold-nowogorski commented May 12, 2026

Related issue: https://redhat.atlassian.net/browse/RHOAIENG-60783

Description of your changes:

Two bugs caused context_correctness to score 0 when using .txt input documents:

  1. text_extraction appended .md to every output file, so hash_0.txt became hash_0.txt.md. ai4rag then added another .md internally, producing hash_0.txt.md.md as the document ID - mismatching the benchmark's hash_0.txt.
  2. documents_discovery sampling priority compared full S3 keys (e.g. docs/hash_0.txt) against bare benchmark filenames (hash_0.txt), so priority sorting never worked.

Changes

  • text_extraction: .txt files now keep their original extension; other formats still produce .md.
  • documents_indexing / search_space_preparation / rag_templates_optimization: document ID derivation updated to match - stem for .md files, name for .txt files.
  • documents_discovery: sampling sort now uses Path(c["Key"]).name to match against benchmark document IDs.
  • Tests: existing tests updated; 12 new sampling tests added.

Checklist:

Pre-Submission Checklist

Additional Checklist Items for New or Updated Components/Pipelines

  • metadata.yaml includes fresh lastVerified timestamp
  • All required files
    are present and complete
  • OWNERS file lists appropriate maintainers
  • README provides clear documentation with usage examples
  • Component follows snake_case naming convention
  • No security vulnerabilities in dependencies
  • Containerfile included if using a custom base image

Summary by CodeRabbit

  • New Features

    • Preserve plain-text (.txt) files in the document pipeline and discover Markdown files case‑insensitively.
  • Bug Fixes

    • Corrected document identifier handling so non‑Markdown filenames keep their extensions.
    • Document sampling now respects provided test-data ordering when prioritizing documents.
  • Tests

    • Expanded tests for sampling behavior, size caps, and test-data prioritization.

Signed-off-by: Witold Nowogorski <wnowogor@redhat.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 12, 2026

📝 Walkthrough

Walkthrough

The PR updates document discovery to derive basenames from S3 keys and sort discovered files by matching test-data basenames. Text extraction now fast-paths .txt inputs (writes original .txt name, returns) and tests reflect that naming. Indexing discovery uses iterdir() with case-insensitive .md filtering. Training helpers now set metadata["document_id"] = stem for .md and name (with extension) for other files. New unit tests validate sampling, prioritization, and extraction behaviors.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Security findings

  • Path traversal / untrusted S3 keys (CWE-22): code uses Path(c["Key"]).name to extract basenames from S3 keys. If S3 keys can contain crafted separators or traversal-like sequences, downstream filename-derived values may be unsafe. Action: validate or normalize S3 keys before use, and restrict allowed characters/patterns; enforce bucket/key policies to reject atypical keys.

  • Insufficient input validation for document_id (CWE-20): document_id is derived directly from filenames (stem or name) without sanitization. Action: normalize and validate document_id (e.g., whitelist characters, length limits, percent-encode or escape dangerous characters) before storing or using in queries/filepaths.

  • Case-sensitive extension checks (CWE-20): suffix comparisons rely on exact-case matches; mixed-case extensions may be misclassified. Action: perform suffix.lower() comparisons (e.g., path.suffix.lower() == ".md" / ".txt") wherever extension branching occurs.

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically identifies the main fix: resolving a bug where AutoRAG context_correctness scored 0 for .txt input documents.
Description check ✅ Passed The description thoroughly explains both bugs, their root causes, and all code changes across multiple components. It clearly links to the related issue and includes comprehensive technical detail.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
components/data_processing/autorag/documents_discovery/tests/test_component_unit.py (1)

305-395: ⚡ Quick win

Add a regression test for duplicate basenames across different prefixes.

Current priority tests don’t cover a/important.txt + b/important.txt with benchmark ID important.txt. That case is where basename matching becomes ambiguous and can break sampling correctness; lock this with an explicit expected behavior test (fail fast or deterministic tie-break).

As per coding guidelines, "REVIEW PRIORITIES: 3. Bug-prone patterns and error handling gaps".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@components/data_processing/autorag/documents_discovery/tests/test_component_unit.py`
around lines 305 - 395, Add a regression test named something like
test_test_data_duplicate_basenames that uses the same pattern as the other tests
and verifies behavior when two S3 keys share the same basename (e.g.,
"a/important.txt" and "b/important.txt") while test_data references
"important.txt"; call the existing helper _run with those contents and
test_data_json=[{"question":"q","correct_answer_document_ids":["important.txt"]}]
and sampling_max_size such that only one doc is selected, then assert
payload["count"] == 1 and that the single selected
payload["documents"][0]["key"] is either "a/important.txt" or "b/important.txt"
(i.e., deterministic single selection), which ensures duplicate-basename
handling is covered.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@components/data_processing/autorag/documents_discovery/component.py`:
- Line 109: The sort currently checks only Path(c["Key"]).name and does a list
membership inside the lambda which is both ambiguous (different S3 objects can
share the same basename) and inefficient; fix by precomputing the exact set of
S3 keys that should be treated as test hits (e.g. test_keys_set = {c["Key"] for
c in supported_files if Path(c["Key"]).name in test_data_docs_names}) and then
replace the sort key with supported_files.sort(key=lambda c: c["Key"] in
test_keys_set) so membership is O(1) and matching is done against the full S3
key (use the existing symbols supported_files, Path, c["Key"], and
test_data_docs_names).

In `@components/data_processing/autorag/documents_indexing/component.py`:
- Line 132: The current paths collection (variable paths in
documents_indexing/component.py) uses case-sensitive glob patterns
("*.md","*.txt") against Path(extracted_text.path) and thus skips files like
REPORT.TXT; update the logic that builds paths to include case-insensitive
suffix matching by enumerating files (e.g., iterate
Path(extracted_text.path).glob("*") or use rglob) and then filter by
p.suffix.lower() in (" .md",". .txt") (or equivalent) so any uppercase or
mixed-case extensions are accepted; ensure you keep the sorted(...) wrapper and
continue to reference extracted_text.path and the same paths variable so
downstream code is unchanged.

---

Nitpick comments:
In
`@components/data_processing/autorag/documents_discovery/tests/test_component_unit.py`:
- Around line 305-395: Add a regression test named something like
test_test_data_duplicate_basenames that uses the same pattern as the other tests
and verifies behavior when two S3 keys share the same basename (e.g.,
"a/important.txt" and "b/important.txt") while test_data references
"important.txt"; call the existing helper _run with those contents and
test_data_json=[{"question":"q","correct_answer_document_ids":["important.txt"]}]
and sampling_max_size such that only one doc is selected, then assert
payload["count"] == 1 and that the single selected
payload["documents"][0]["key"] is either "a/important.txt" or "b/important.txt"
(i.e., deterministic single selection), which ensures duplicate-basename
handling is covered.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 786d06a2-b1d2-4040-9a9e-f8233c5cc98f

📥 Commits

Reviewing files that changed from the base of the PR and between 06379b9 and 9f220bb.

📒 Files selected for processing (7)
  • components/data_processing/autorag/documents_discovery/component.py
  • components/data_processing/autorag/documents_discovery/tests/test_component_unit.py
  • components/data_processing/autorag/documents_indexing/component.py
  • components/data_processing/autorag/text_extraction/component.py
  • components/data_processing/autorag/text_extraction/tests/test_component_unit.py
  • components/training/autorag/rag_templates_optimization/component.py
  • components/training/autorag/search_space_preparation/component.py

Comment thread components/data_processing/autorag/documents_discovery/component.py Outdated
Comment thread components/data_processing/autorag/documents_indexing/component.py Outdated
Copy link
Copy Markdown

@DorotaDR DorotaDR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@witold-nowogorski Please respond to the comments from CodeRabbit; after that, I can approve the PR.

@DorotaDR
Copy link
Copy Markdown

Please also share the results of the integration / e2e AutoRAG tests showing that the context correctness is non-zero.

Copy link
Copy Markdown

@filip-komarzyniec filip-komarzyniec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR seems to be introducing many if clauses differentiating between txt and md files extensions.
I'm not really sure this is the most optimal approach. I've left some comments to think through -- maybe we can make it work without so many if-s as they complicate the overall logic and open doors for more bugs.

Please also link a successful test run (or any other sign that the proposed changes were tested).

test_data_docs_names = get_test_data_docs_names()
if test_data_docs_names:
supported_files.sort(key=lambda c: c["Key"] not in test_data_docs_names)
supported_files.sort(key=lambda c: Path(c["Key"]).name not in test_data_docs_names)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess what CodeRabbit points out here is that the logic might be incorrect when there are files with the same base names (which in some cases will be valid in S3 storage).

How do we ensure we correctly process same-named files? This might be worth addressing somehow.

The example can be the following folder structure:

documents/
├─ sub_folder1/
│  ├─ document1.txt   # this one is included in the benchmark.json file
├─ sub_folder2/
│  ├─ document1.txt  # this one is NOT included in the benchmark.json file

)

paths = sorted(Path(extracted_text.path).glob("*.md"))
paths = sorted(p for ext in ("*.md", "*.txt") for p in Path(extracted_text.path).glob(ext))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either this is not accurate or docstrings are out-of-date. In the extracted_text folder do we store only *md files or *.txt as well?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

text_extraction writes all outputs as *.md (.txt files are copied as-is but saved with a .md extension). The docstring has been updated to reflect this.

Document(
page_content=p.read_text(encoding="utf-8", errors="replace"),
metadata={"document_id": p.stem},
metadata={"document_id": p.stem if p.suffix == ".md" else p.name},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why exactly do we treat *.md files differently than *.txt ones?

If this is because some other component appends or not extensions then I'd say we change the other component so that it correctly decided whether to append extension or not.

The current approach is complicated and creates weird coupling dependant on the other component's internal logic.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The root cause was text_extraction preserving the .txt extension for plain-text files while appending .md for everything else. I changed it to write all outputs uniformly as <original_name>.md, which removed the need for any conditional in documents_indexing.

Comment on lines +237 to +242
if input_file.suffix.lower() == ".txt":
output_file = output_dir / input_file.name
output_file.write_text(input_file.read_text(encoding="utf-8"), encoding="utf-8")
return True, None

output_file = output_dir / f"{input_file.name}.md"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason we keep differentiating between *.txt and *.md files?

Wouldn't it be easier to write everything as *.md? What we wanted to achieve here is not to docling-process *.txt files as it makes no sense (and afaik is not really possible).

Signed-off-by: Witold Nowogorski <wnowogor@redhat.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
components/data_processing/autorag/documents_indexing/component.py (1)

132-133: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

.txt artifacts are excluded from indexing.

This filter only includes .md; preserved .txt outputs won’t be indexed, which breaks txt benchmark matching and can drive context_correctness back to zero.

Proposed fix
-    paths = sorted(p for p in base.iterdir() if p.is_file() and p.suffix.lower() == ".md")
+    paths = sorted(p for p in base.iterdir() if p.is_file() and p.suffix.lower() in {".md", ".txt"})
@@
-                metadata={"document_id": p.stem},
+                metadata={"document_id": p.stem if p.suffix.lower() == ".md" else p.name},
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/data_processing/autorag/documents_indexing/component.py` around
lines 132 - 133, The current file-filtering in the documents indexing code
(variables base and paths) only selects ".md" files, which excludes preserved
".txt" artifacts; update the filter used where paths is computed so it accepts
both ".md" and ".txt" (case-insensitive) — e.g., check p.suffix.lower() in a set
or tuple like (".md", ".txt") — ensuring files with either extension are
included and the rest of the logic that iterates over paths (the document
indexing flow) continues to work unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@components/data_processing/autorag/text_extraction/tests/test_component_unit.py`:
- Line 477: Update the unit test assertions that currently expect legacy
".txt.md" filenames (e.g. the assertion using output_files[0].name ==
"file1.txt.md") to validate the new mixed output behavior: assert that one of
the outputs is the Markdown derivative (endswith ".md") and that the original
text file is preserved with a ".txt" filename (e.g. check for any item in
output_files with name == "file1.txt" or .endswith(".txt")). Apply the same
change to the other occurrences noted around the test (the assertions at the
blocks near lines 663-665 and 674-675) so tests accept a set containing both the
.md and the .txt expected names rather than forcing ".txt.md".

---

Duplicate comments:
In `@components/data_processing/autorag/documents_indexing/component.py`:
- Around line 132-133: The current file-filtering in the documents indexing code
(variables base and paths) only selects ".md" files, which excludes preserved
".txt" artifacts; update the filter used where paths is computed so it accepts
both ".md" and ".txt" (case-insensitive) — e.g., check p.suffix.lower() in a set
or tuple like (".md", ".txt") — ensuring files with either extension are
included and the rest of the logic that iterates over paths (the document
indexing flow) continues to work unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 3e191882-c0ab-46e7-bfcd-9df89bf40caf

📥 Commits

Reviewing files that changed from the base of the PR and between 9f220bb and cb97cbe.

📒 Files selected for processing (4)
  • components/data_processing/autorag/documents_discovery/component.py
  • components/data_processing/autorag/documents_indexing/component.py
  • components/data_processing/autorag/text_extraction/component.py
  • components/data_processing/autorag/text_extraction/tests/test_component_unit.py
💤 Files with no reviewable changes (1)
  • components/data_processing/autorag/text_extraction/component.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • components/data_processing/autorag/documents_discovery/component.py

@LukaszCmielowski
Copy link
Copy Markdown

@filip-komarzyniec please finalize the review.

test_data_docs_names = get_test_data_docs_names()
if test_data_docs_names:
supported_files.sort(key=lambda c: c["Key"] not in test_data_docs_names)
test_keys_set = {c["Key"] for c in supported_files if Path(c["Key"]).name in test_data_docs_names}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess what CodeRabbit pointed out here is that the logic might be incorrect when there are files with the same base names (which in some cases will be valid in S3 storage).

How do we ensure we correctly process same-named files? This might be worth addressing somehow.

The example can be the following folder structure:

documents/
├─ sub_folder1/
│  ├─ document1.txt
├─ sub_folder2/
│  ├─ document1.txt


paths = sorted(Path(extracted_text.path).glob("*.md"))
base = Path(extracted_text.path)
paths = sorted(p for p in base.iterdir() if p.is_file() and p.suffix.lower() == ".md")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If extracted_text directory stores only *.md files now then why do we need and p.suffix.lower() == ".md" check?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be redundant but I will preserve it just to make sure we are loading markdowns only in case something else would appear in the extracted_text directory, such as some OS artifacts, some metadata files etc.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense!

if path.is_dir():
for doc_path in path.iterdir():
with doc_path.open("r", encoding="utf-8") as doc:
doc_id = doc_path.stem if doc_path.suffix == ".md" else doc_path.name
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, why do we need the *.md checks now?

According to discussion in some previous comments all files in extracted_text should be markdown

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, it's redundant today, but cheap enough to keep as a guard against unexpected files appearing in extracted_text (OS artifacts, future changes or something dropped silently by text extraction process).

@filip-komarzyniec
Copy link
Copy Markdown

filip-komarzyniec commented May 20, 2026

a couple more comments but nothing really severe. Apart from that the changes /lgtm

@LukaszCmielowski
Copy link
Copy Markdown

@witold-nowogorski can we have those addressed so I could merge ?.

@LukaszCmielowski
Copy link
Copy Markdown

/lgtm

@LukaszCmielowski
Copy link
Copy Markdown

/ok-to-test

@LukaszCmielowski
Copy link
Copy Markdown

/retest

@witold-nowogorski
Copy link
Copy Markdown
Author

/retest

@LukaszCmielowski
Copy link
Copy Markdown

/ok-to-test cancel

@LukaszCmielowski
Copy link
Copy Markdown

/ok-to-test

@LukaszCmielowski
Copy link
Copy Markdown

/approve

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented May 22, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: LukaszCmielowski

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot Bot merged commit a028d33 into opendatahub-io:main May 22, 2026
26 of 28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants