fix: AutoRAG context correctness always 0 for .txt input documents by witold-nowogorski · Pull Request #69 · opendatahub-io/pipelines-components

witold-nowogorski · 2026-05-12T07:51:19Z

Related issue: https://redhat.atlassian.net/browse/RHOAIENG-60783

Description of your changes:

Two bugs caused context_correctness to score 0 when using .txt input documents:

text_extraction appended .md to every output file, so hash_0.txt became hash_0.txt.md. ai4rag then added another .md internally, producing hash_0.txt.md.md as the document ID - mismatching the benchmark's hash_0.txt.
documents_discovery sampling priority compared full S3 keys (e.g. docs/hash_0.txt) against bare benchmark filenames (hash_0.txt), so priority sorting never worked.

Changes

text_extraction: .txt files now keep their original extension; other formats still produce .md.
documents_indexing / search_space_preparation / rag_templates_optimization: document ID derivation updated to match - stem for .md files, name for .txt files.
documents_discovery: sampling sort now uses Path(c["Key"]).name to match against benchmark document IDs.
Tests: existing tests updated; 12 new sampling tests added.

Checklist:

Pre-Submission Checklist

All tests and CI checks pass
Pre-commit hooks pass without errors
You have signed off your commits
The title for your pull request (PR) should follow our title convention.
Learn more about the pull request title convention used in this repository.

Additional Checklist Items for New or Updated Components/Pipelines

metadata.yaml includes fresh lastVerified timestamp
All required files
are present and complete
OWNERS file lists appropriate maintainers
README provides clear documentation with usage examples
Component follows snake_case naming convention
No security vulnerabilities in dependencies
Containerfile included if using a custom base image

Summary by CodeRabbit

New Features
- Preserve plain-text (.txt) files in the document pipeline and discover Markdown files case‑insensitively.
Bug Fixes
- Corrected document identifier handling so non‑Markdown filenames keep their extensions.
- Document sampling now respects provided test-data ordering when prioritizing documents.
Tests
- Expanded tests for sampling behavior, size caps, and test-data prioritization.

Signed-off-by: Witold Nowogorski <wnowogor@redhat.com>

coderabbitai · 2026-05-12T07:51:49Z

📝 Walkthrough

Walkthrough

The PR updates document discovery to derive basenames from S3 keys and sort discovered files by matching test-data basenames. Text extraction now fast-paths .txt inputs (writes original .txt name, returns) and tests reflect that naming. Indexing discovery uses iterdir() with case-insensitive .md filtering. Training helpers now set metadata["document_id"] = stem for .md and name (with extension) for other files. New unit tests validate sampling, prioritization, and extraction behaviors.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Security findings

Path traversal / untrusted S3 keys (CWE-22): code uses Path(c["Key"]).name to extract basenames from S3 keys. If S3 keys can contain crafted separators or traversal-like sequences, downstream filename-derived values may be unsafe. Action: validate or normalize S3 keys before use, and restrict allowed characters/patterns; enforce bucket/key policies to reject atypical keys.
Insufficient input validation for document_id (CWE-20): document_id is derived directly from filenames (stem or name) without sanitization. Action: normalize and validate document_id (e.g., whitelist characters, length limits, percent-encode or escape dangerous characters) before storing or using in queries/filepaths.
Case-sensitive extension checks (CWE-20): suffix comparisons rely on exact-case matches; mixed-case extensions may be misclassified. Action: perform suffix.lower() comparisons (e.g., path.suffix.lower() == ".md" / ".txt") wherever extension branching occurs.

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically identifies the main fix: resolving a bug where AutoRAG context_correctness scored 0 for .txt input documents.
Description check	✅ Passed	The description thoroughly explains both bugs, their root causes, and all code changes across multiple components. It clearly links to the related issue and includes comprehensive technical detail.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

components/data_processing/autorag/documents_discovery/tests/test_component_unit.py (1)
305-395: ⚡ Quick win

Add a regression test for duplicate basenames across different prefixes.

Current priority tests don’t cover a/important.txt + b/important.txt with benchmark ID important.txt. That case is where basename matching becomes ambiguous and can break sampling correctness; lock this with an explicit expected behavior test (fail fast or deterministic tie-break).

As per coding guidelines, "REVIEW PRIORITIES: 3. Bug-prone patterns and error handling gaps".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@components/data_processing/autorag/documents_discovery/tests/test_component_unit.py`
around lines 305 - 395, Add a regression test named something like
test_test_data_duplicate_basenames that uses the same pattern as the other tests
and verifies behavior when two S3 keys share the same basename (e.g.,
"a/important.txt" and "b/important.txt") while test_data references
"important.txt"; call the existing helper _run with those contents and
test_data_json=[{"question":"q","correct_answer_document_ids":["important.txt"]}]
and sampling_max_size such that only one doc is selected, then assert
payload["count"] == 1 and that the single selected
payload["documents"][0]["key"] is either "a/important.txt" or "b/important.txt"
(i.e., deterministic single selection), which ensures duplicate-basename
handling is covered.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@components/data_processing/autorag/documents_discovery/component.py`:
- Line 109: The sort currently checks only Path(c["Key"]).name and does a list
membership inside the lambda which is both ambiguous (different S3 objects can
share the same basename) and inefficient; fix by precomputing the exact set of
S3 keys that should be treated as test hits (e.g. test_keys_set = {c["Key"] for
c in supported_files if Path(c["Key"]).name in test_data_docs_names}) and then
replace the sort key with supported_files.sort(key=lambda c: c["Key"] in
test_keys_set) so membership is O(1) and matching is done against the full S3
key (use the existing symbols supported_files, Path, c["Key"], and
test_data_docs_names).

In `@components/data_processing/autorag/documents_indexing/component.py`:
- Line 132: The current paths collection (variable paths in
documents_indexing/component.py) uses case-sensitive glob patterns
("*.md","*.txt") against Path(extracted_text.path) and thus skips files like
REPORT.TXT; update the logic that builds paths to include case-insensitive
suffix matching by enumerating files (e.g., iterate
Path(extracted_text.path).glob("*") or use rglob) and then filter by
p.suffix.lower() in (" .md",". .txt") (or equivalent) so any uppercase or
mixed-case extensions are accepted; ensure you keep the sorted(...) wrapper and
continue to reference extracted_text.path and the same paths variable so
downstream code is unchanged.

---

Nitpick comments:
In
`@components/data_processing/autorag/documents_discovery/tests/test_component_unit.py`:
- Around line 305-395: Add a regression test named something like
test_test_data_duplicate_basenames that uses the same pattern as the other tests
and verifies behavior when two S3 keys share the same basename (e.g.,
"a/important.txt" and "b/important.txt") while test_data references
"important.txt"; call the existing helper _run with those contents and
test_data_json=[{"question":"q","correct_answer_document_ids":["important.txt"]}]
and sampling_max_size such that only one doc is selected, then assert
payload["count"] == 1 and that the single selected
payload["documents"][0]["key"] is either "a/important.txt" or "b/important.txt"
(i.e., deterministic single selection), which ensures duplicate-basename
handling is covered.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 786d06a2-b1d2-4040-9a9e-f8233c5cc98f

📥 Commits

Reviewing files that changed from the base of the PR and between 06379b9 and 9f220bb.

📒 Files selected for processing (7)

components/data_processing/autorag/documents_discovery/component.py
components/data_processing/autorag/documents_discovery/tests/test_component_unit.py
components/data_processing/autorag/documents_indexing/component.py
components/data_processing/autorag/text_extraction/component.py
components/data_processing/autorag/text_extraction/tests/test_component_unit.py
components/training/autorag/rag_templates_optimization/component.py
components/training/autorag/search_space_preparation/component.py

DorotaDR

@witold-nowogorski Please respond to the comments from CodeRabbit; after that, I can approve the PR.

DorotaDR · 2026-05-14T08:17:48Z

Please also share the results of the integration / e2e AutoRAG tests showing that the context correctness is non-zero.

filip-komarzyniec

The PR seems to be introducing many if clauses differentiating between txt and md files extensions.
I'm not really sure this is the most optimal approach. I've left some comments to think through -- maybe we can make it work without so many if-s as they complicate the overall logic and open doors for more bugs.

Please also link a successful test run (or any other sign that the proposed changes were tested).

filip-komarzyniec · 2026-05-15T12:06:42Z

    test_data_docs_names = get_test_data_docs_names()
    if test_data_docs_names:
-        supported_files.sort(key=lambda c: c["Key"] not in test_data_docs_names)
+        supported_files.sort(key=lambda c: Path(c["Key"]).name not in test_data_docs_names)


I guess what CodeRabbit points out here is that the logic might be incorrect when there are files with the same base names (which in some cases will be valid in S3 storage).

How do we ensure we correctly process same-named files? This might be worth addressing somehow.

The example can be the following folder structure:

documents/ ├─ sub_folder1/ │ ├─ document1.txt # this one is included in the benchmark.json file ├─ sub_folder2/ │ ├─ document1.txt # this one is NOT included in the benchmark.json file

filip-komarzyniec · 2026-05-15T12:13:04Z

    )

-    paths = sorted(Path(extracted_text.path).glob("*.md"))
+    paths = sorted(p for ext in ("*.md", "*.txt") for p in Path(extracted_text.path).glob(ext))


Either this is not accurate or docstrings are out-of-date. In the extracted_text folder do we store only *md files or *.txt as well?

text_extraction writes all outputs as *.md (.txt files are copied as-is but saved with a .md extension). The docstring has been updated to reflect this.

filip-komarzyniec · 2026-05-15T12:14:03Z

            Document(
                page_content=p.read_text(encoding="utf-8", errors="replace"),
-                metadata={"document_id": p.stem},
+                metadata={"document_id": p.stem if p.suffix == ".md" else p.name},


Why exactly do we treat *.md files differently than *.txt ones?

If this is because some other component appends or not extensions then I'd say we change the other component so that it correctly decided whether to append extension or not.

The current approach is complicated and creates weird coupling dependant on the other component's internal logic.

The root cause was text_extraction preserving the .txt extension for plain-text files while appending .md for everything else. I changed it to write all outputs uniformly as <original_name>.md, which removed the need for any conditional in documents_indexing.

filip-komarzyniec · 2026-05-15T12:32:21Z

            if input_file.suffix.lower() == ".txt":
+                output_file = output_dir / input_file.name
                output_file.write_text(input_file.read_text(encoding="utf-8"), encoding="utf-8")
                return True, None

+            output_file = output_dir / f"{input_file.name}.md"


Is there any reason we keep differentiating between *.txt and *.md files?

Wouldn't it be easier to write everything as *.md? What we wanted to achieve here is not to docling-process *.txt files as it makes no sense (and afaik is not really possible).

Signed-off-by: Witold Nowogorski <wnowogor@redhat.com>

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

components/data_processing/autorag/documents_indexing/component.py (1)

132-133: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

.txt artifacts are excluded from indexing.

This filter only includes .md; preserved .txt outputs won’t be indexed, which breaks txt benchmark matching and can drive context_correctness back to zero.

Proposed fix

-    paths = sorted(p for p in base.iterdir() if p.is_file() and p.suffix.lower() == ".md")
+    paths = sorted(p for p in base.iterdir() if p.is_file() and p.suffix.lower() in {".md", ".txt"})
@@
-                metadata={"document_id": p.stem},
+                metadata={"document_id": p.stem if p.suffix.lower() == ".md" else p.name},

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/data_processing/autorag/documents_indexing/component.py` around
lines 132 - 133, The current file-filtering in the documents indexing code
(variables base and paths) only selects ".md" files, which excludes preserved
".txt" artifacts; update the filter used where paths is computed so it accepts
both ".md" and ".txt" (case-insensitive) — e.g., check p.suffix.lower() in a set
or tuple like (".md", ".txt") — ensuring files with either extension are
included and the rest of the logic that iterates over paths (the document
indexing flow) continues to work unchanged.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@components/data_processing/autorag/text_extraction/tests/test_component_unit.py`:
- Line 477: Update the unit test assertions that currently expect legacy
".txt.md" filenames (e.g. the assertion using output_files[0].name ==
"file1.txt.md") to validate the new mixed output behavior: assert that one of
the outputs is the Markdown derivative (endswith ".md") and that the original
text file is preserved with a ".txt" filename (e.g. check for any item in
output_files with name == "file1.txt" or .endswith(".txt")). Apply the same
change to the other occurrences noted around the test (the assertions at the
blocks near lines 663-665 and 674-675) so tests accept a set containing both the
.md and the .txt expected names rather than forcing ".txt.md".

---

Duplicate comments:
In `@components/data_processing/autorag/documents_indexing/component.py`:
- Around line 132-133: The current file-filtering in the documents indexing code
(variables base and paths) only selects ".md" files, which excludes preserved
".txt" artifacts; update the filter used where paths is computed so it accepts
both ".md" and ".txt" (case-insensitive) — e.g., check p.suffix.lower() in a set
or tuple like (".md", ".txt") — ensuring files with either extension are
included and the rest of the logic that iterates over paths (the document
indexing flow) continues to work unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 3e191882-c0ab-46e7-bfcd-9df89bf40caf

📥 Commits

Reviewing files that changed from the base of the PR and between 9f220bb and cb97cbe.

📒 Files selected for processing (4)

components/data_processing/autorag/documents_discovery/component.py
components/data_processing/autorag/documents_indexing/component.py
components/data_processing/autorag/text_extraction/component.py
components/data_processing/autorag/text_extraction/tests/test_component_unit.py

💤 Files with no reviewable changes (1)

components/data_processing/autorag/text_extraction/component.py

🚧 Files skipped from review as they are similar to previous changes (1)

components/data_processing/autorag/documents_discovery/component.py

LukaszCmielowski · 2026-05-20T08:49:06Z

@filip-komarzyniec please finalize the review.

filip-komarzyniec · 2026-05-20T09:18:45Z

    test_data_docs_names = get_test_data_docs_names()
    if test_data_docs_names:
-        supported_files.sort(key=lambda c: c["Key"] not in test_data_docs_names)
+        test_keys_set = {c["Key"] for c in supported_files if Path(c["Key"]).name in test_data_docs_names}


I guess what CodeRabbit pointed out here is that the logic might be incorrect when there are files with the same base names (which in some cases will be valid in S3 storage).

How do we ensure we correctly process same-named files? This might be worth addressing somehow.

The example can be the following folder structure:

documents/ ├─ sub_folder1/ │ ├─ document1.txt ├─ sub_folder2/ │ ├─ document1.txt

filip-komarzyniec · 2026-05-20T09:20:00Z


-    paths = sorted(Path(extracted_text.path).glob("*.md"))
+    base = Path(extracted_text.path)
+    paths = sorted(p for p in base.iterdir() if p.is_file() and p.suffix.lower() == ".md")


If extracted_text directory stores only *.md files now then why do we need and p.suffix.lower() == ".md" check?

It might be redundant but I will preserve it just to make sure we are loading markdowns only in case something else would appear in the extracted_text directory, such as some OS artifacts, some metadata files etc.

makes sense!

filip-komarzyniec · 2026-05-20T09:21:17Z

        if path.is_dir():
            for doc_path in path.iterdir():
                with doc_path.open("r", encoding="utf-8") as doc:
+                    doc_id = doc_path.stem if doc_path.suffix == ".md" else doc_path.name


Again, why do we need the *.md checks now?

According to discussion in some previous comments all files in extracted_text should be markdown

Agreed, it's redundant today, but cheap enough to keep as a guard against unexpected files appearing in extracted_text (OS artifacts, future changes or something dropped silently by text extraction process).

filip-komarzyniec · 2026-05-20T09:23:59Z

a couple more comments but nothing really severe. Apart from that the changes /lgtm

LukaszCmielowski · 2026-05-20T10:55:25Z

@witold-nowogorski can we have those addressed so I could merge ?.

LukaszCmielowski · 2026-05-20T10:55:51Z

/lgtm

LukaszCmielowski · 2026-05-20T10:56:10Z

/ok-to-test

LukaszCmielowski · 2026-05-21T08:03:48Z

/retest

witold-nowogorski · 2026-05-21T13:22:53Z

/retest

LukaszCmielowski · 2026-05-22T07:48:24Z

/ok-to-test cancel

LukaszCmielowski · 2026-05-22T07:48:47Z

/ok-to-test

LukaszCmielowski · 2026-05-22T07:49:18Z

/approve

openshift-ci · 2026-05-22T07:49:25Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: LukaszCmielowski

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~components/data_processing/autorag/documents_discovery/OWNERS~~ [LukaszCmielowski]
~~components/data_processing/autorag/documents_indexing/OWNERS~~ [LukaszCmielowski]
~~components/data_processing/autorag/text_extraction/OWNERS~~ [LukaszCmielowski]
~~components/training/autorag/rag_templates_optimization/OWNERS~~ [LukaszCmielowski]
~~components/training/autorag/search_space_preparation/OWNERS~~ [LukaszCmielowski]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

preserve .txt extension in text extraction | unit tests for sampling

9f220bb

Signed-off-by: Witold Nowogorski <wnowogor@redhat.com>

openshift-ci Bot requested review from DorotaDR and filip-komarzyniec May 12, 2026 07:51

coderabbitai Bot reviewed May 12, 2026

View reviewed changes

Comment thread components/data_processing/autorag/documents_discovery/component.py Outdated

Comment thread components/data_processing/autorag/documents_indexing/component.py Outdated

DorotaDR reviewed May 14, 2026

View reviewed changes

filip-komarzyniec reviewed May 15, 2026

View reviewed changes

update text_extraction

cb97cbe

Signed-off-by: Witold Nowogorski <wnowogor@redhat.com>

coderabbitai Bot reviewed May 18, 2026

View reviewed changes

Comment thread components/data_processing/autorag/text_extraction/tests/test_component_unit.py

witold-nowogorski requested a review from filip-komarzyniec May 18, 2026 06:24

filip-komarzyniec reviewed May 20, 2026

View reviewed changes

openshift-ci Bot assigned LukaszCmielowski May 20, 2026

openshift-ci Bot added the lgtm label May 20, 2026

openshift-ci Bot added the ok-to-test label May 20, 2026

witold-nowogorski requested a review from filip-komarzyniec May 21, 2026 13:19

openshift-ci Bot added needs-ok-to-test and removed ok-to-test labels May 22, 2026

openshift-ci Bot added the ok-to-test label May 22, 2026

openshift-ci Bot removed the needs-ok-to-test label May 22, 2026

openshift-ci Bot added the approved label May 22, 2026

openshift-merge-bot Bot merged commit a028d33 into opendatahub-io:main May 22, 2026
26 of 28 checks passed

Conversation

witold-nowogorski commented May 12, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pre-Submission Checklist

Additional Checklist Items for New or Updated Components/Pipelines

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

Security findings

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

DorotaDR left a comment

Choose a reason for hiding this comment

Uh oh!

DorotaDR commented May 14, 2026

Uh oh!

filip-komarzyniec left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

LukaszCmielowski commented May 20, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

filip-komarzyniec commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LukaszCmielowski commented May 20, 2026

Uh oh!

LukaszCmielowski commented May 20, 2026

Uh oh!

LukaszCmielowski commented May 20, 2026

Uh oh!

LukaszCmielowski commented May 21, 2026

Uh oh!

witold-nowogorski commented May 21, 2026

Uh oh!

LukaszCmielowski commented May 22, 2026

Uh oh!

LukaszCmielowski commented May 22, 2026

Uh oh!

LukaszCmielowski commented May 22, 2026

Uh oh!

openshift-ci Bot commented May 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

witold-nowogorski commented May 12, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 12, 2026 •

edited

Loading

filip-komarzyniec commented May 20, 2026 •

edited

Loading