fix: AutoRAG context correctness always 0 for .txt input documents#69
Conversation
Signed-off-by: Witold Nowogorski <wnowogor@redhat.com>
📝 WalkthroughWalkthroughThe PR updates document discovery to derive basenames from S3 keys and sort discovered files by matching test-data basenames. Text extraction now fast-paths .txt inputs (writes original .txt name, returns) and tests reflect that naming. Indexing discovery uses iterdir() with case-insensitive .md filtering. Training helpers now set metadata["document_id"] = stem for .md and name (with extension) for other files. New unit tests validate sampling, prioritization, and extraction behaviors. Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Security findings
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
components/data_processing/autorag/documents_discovery/tests/test_component_unit.py (1)
305-395: ⚡ Quick winAdd a regression test for duplicate basenames across different prefixes.
Current priority tests don’t cover
a/important.txt+b/important.txtwith benchmark IDimportant.txt. That case is where basename matching becomes ambiguous and can break sampling correctness; lock this with an explicit expected behavior test (fail fast or deterministic tie-break).As per coding guidelines, "REVIEW PRIORITIES: 3. Bug-prone patterns and error handling gaps".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@components/data_processing/autorag/documents_discovery/tests/test_component_unit.py` around lines 305 - 395, Add a regression test named something like test_test_data_duplicate_basenames that uses the same pattern as the other tests and verifies behavior when two S3 keys share the same basename (e.g., "a/important.txt" and "b/important.txt") while test_data references "important.txt"; call the existing helper _run with those contents and test_data_json=[{"question":"q","correct_answer_document_ids":["important.txt"]}] and sampling_max_size such that only one doc is selected, then assert payload["count"] == 1 and that the single selected payload["documents"][0]["key"] is either "a/important.txt" or "b/important.txt" (i.e., deterministic single selection), which ensures duplicate-basename handling is covered.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@components/data_processing/autorag/documents_discovery/component.py`:
- Line 109: The sort currently checks only Path(c["Key"]).name and does a list
membership inside the lambda which is both ambiguous (different S3 objects can
share the same basename) and inefficient; fix by precomputing the exact set of
S3 keys that should be treated as test hits (e.g. test_keys_set = {c["Key"] for
c in supported_files if Path(c["Key"]).name in test_data_docs_names}) and then
replace the sort key with supported_files.sort(key=lambda c: c["Key"] in
test_keys_set) so membership is O(1) and matching is done against the full S3
key (use the existing symbols supported_files, Path, c["Key"], and
test_data_docs_names).
In `@components/data_processing/autorag/documents_indexing/component.py`:
- Line 132: The current paths collection (variable paths in
documents_indexing/component.py) uses case-sensitive glob patterns
("*.md","*.txt") against Path(extracted_text.path) and thus skips files like
REPORT.TXT; update the logic that builds paths to include case-insensitive
suffix matching by enumerating files (e.g., iterate
Path(extracted_text.path).glob("*") or use rglob) and then filter by
p.suffix.lower() in (" .md",". .txt") (or equivalent) so any uppercase or
mixed-case extensions are accepted; ensure you keep the sorted(...) wrapper and
continue to reference extracted_text.path and the same paths variable so
downstream code is unchanged.
---
Nitpick comments:
In
`@components/data_processing/autorag/documents_discovery/tests/test_component_unit.py`:
- Around line 305-395: Add a regression test named something like
test_test_data_duplicate_basenames that uses the same pattern as the other tests
and verifies behavior when two S3 keys share the same basename (e.g.,
"a/important.txt" and "b/important.txt") while test_data references
"important.txt"; call the existing helper _run with those contents and
test_data_json=[{"question":"q","correct_answer_document_ids":["important.txt"]}]
and sampling_max_size such that only one doc is selected, then assert
payload["count"] == 1 and that the single selected
payload["documents"][0]["key"] is either "a/important.txt" or "b/important.txt"
(i.e., deterministic single selection), which ensures duplicate-basename
handling is covered.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 786d06a2-b1d2-4040-9a9e-f8233c5cc98f
📒 Files selected for processing (7)
components/data_processing/autorag/documents_discovery/component.pycomponents/data_processing/autorag/documents_discovery/tests/test_component_unit.pycomponents/data_processing/autorag/documents_indexing/component.pycomponents/data_processing/autorag/text_extraction/component.pycomponents/data_processing/autorag/text_extraction/tests/test_component_unit.pycomponents/training/autorag/rag_templates_optimization/component.pycomponents/training/autorag/search_space_preparation/component.py
DorotaDR
left a comment
There was a problem hiding this comment.
@witold-nowogorski Please respond to the comments from CodeRabbit; after that, I can approve the PR.
|
Please also share the results of the integration / e2e AutoRAG tests showing that the context correctness is non-zero. |
filip-komarzyniec
left a comment
There was a problem hiding this comment.
The PR seems to be introducing many if clauses differentiating between txt and md files extensions.
I'm not really sure this is the most optimal approach. I've left some comments to think through -- maybe we can make it work without so many if-s as they complicate the overall logic and open doors for more bugs.
Please also link a successful test run (or any other sign that the proposed changes were tested).
| test_data_docs_names = get_test_data_docs_names() | ||
| if test_data_docs_names: | ||
| supported_files.sort(key=lambda c: c["Key"] not in test_data_docs_names) | ||
| supported_files.sort(key=lambda c: Path(c["Key"]).name not in test_data_docs_names) |
There was a problem hiding this comment.
I guess what CodeRabbit points out here is that the logic might be incorrect when there are files with the same base names (which in some cases will be valid in S3 storage).
How do we ensure we correctly process same-named files? This might be worth addressing somehow.
The example can be the following folder structure:
documents/
├─ sub_folder1/
│ ├─ document1.txt # this one is included in the benchmark.json file
├─ sub_folder2/
│ ├─ document1.txt # this one is NOT included in the benchmark.json file
| ) | ||
|
|
||
| paths = sorted(Path(extracted_text.path).glob("*.md")) | ||
| paths = sorted(p for ext in ("*.md", "*.txt") for p in Path(extracted_text.path).glob(ext)) |
There was a problem hiding this comment.
Either this is not accurate or docstrings are out-of-date. In the extracted_text folder do we store only *md files or *.txt as well?
There was a problem hiding this comment.
text_extraction writes all outputs as *.md (.txt files are copied as-is but saved with a .md extension). The docstring has been updated to reflect this.
| Document( | ||
| page_content=p.read_text(encoding="utf-8", errors="replace"), | ||
| metadata={"document_id": p.stem}, | ||
| metadata={"document_id": p.stem if p.suffix == ".md" else p.name}, |
There was a problem hiding this comment.
Why exactly do we treat *.md files differently than *.txt ones?
If this is because some other component appends or not extensions then I'd say we change the other component so that it correctly decided whether to append extension or not.
The current approach is complicated and creates weird coupling dependant on the other component's internal logic.
There was a problem hiding this comment.
The root cause was text_extraction preserving the .txt extension for plain-text files while appending .md for everything else. I changed it to write all outputs uniformly as <original_name>.md, which removed the need for any conditional in documents_indexing.
| if input_file.suffix.lower() == ".txt": | ||
| output_file = output_dir / input_file.name | ||
| output_file.write_text(input_file.read_text(encoding="utf-8"), encoding="utf-8") | ||
| return True, None | ||
|
|
||
| output_file = output_dir / f"{input_file.name}.md" |
There was a problem hiding this comment.
Is there any reason we keep differentiating between *.txt and *.md files?
Wouldn't it be easier to write everything as *.md? What we wanted to achieve here is not to docling-process *.txt files as it makes no sense (and afaik is not really possible).
Signed-off-by: Witold Nowogorski <wnowogor@redhat.com>
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
components/data_processing/autorag/documents_indexing/component.py (1)
132-133:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
.txtartifacts are excluded from indexing.This filter only includes
.md; preserved.txtoutputs won’t be indexed, which breaks txt benchmark matching and can drivecontext_correctnessback to zero.Proposed fix
- paths = sorted(p for p in base.iterdir() if p.is_file() and p.suffix.lower() == ".md") + paths = sorted(p for p in base.iterdir() if p.is_file() and p.suffix.lower() in {".md", ".txt"}) @@ - metadata={"document_id": p.stem}, + metadata={"document_id": p.stem if p.suffix.lower() == ".md" else p.name},🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@components/data_processing/autorag/documents_indexing/component.py` around lines 132 - 133, The current file-filtering in the documents indexing code (variables base and paths) only selects ".md" files, which excludes preserved ".txt" artifacts; update the filter used where paths is computed so it accepts both ".md" and ".txt" (case-insensitive) — e.g., check p.suffix.lower() in a set or tuple like (".md", ".txt") — ensuring files with either extension are included and the rest of the logic that iterates over paths (the document indexing flow) continues to work unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@components/data_processing/autorag/text_extraction/tests/test_component_unit.py`:
- Line 477: Update the unit test assertions that currently expect legacy
".txt.md" filenames (e.g. the assertion using output_files[0].name ==
"file1.txt.md") to validate the new mixed output behavior: assert that one of
the outputs is the Markdown derivative (endswith ".md") and that the original
text file is preserved with a ".txt" filename (e.g. check for any item in
output_files with name == "file1.txt" or .endswith(".txt")). Apply the same
change to the other occurrences noted around the test (the assertions at the
blocks near lines 663-665 and 674-675) so tests accept a set containing both the
.md and the .txt expected names rather than forcing ".txt.md".
---
Duplicate comments:
In `@components/data_processing/autorag/documents_indexing/component.py`:
- Around line 132-133: The current file-filtering in the documents indexing code
(variables base and paths) only selects ".md" files, which excludes preserved
".txt" artifacts; update the filter used where paths is computed so it accepts
both ".md" and ".txt" (case-insensitive) — e.g., check p.suffix.lower() in a set
or tuple like (".md", ".txt") — ensuring files with either extension are
included and the rest of the logic that iterates over paths (the document
indexing flow) continues to work unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 3e191882-c0ab-46e7-bfcd-9df89bf40caf
📒 Files selected for processing (4)
components/data_processing/autorag/documents_discovery/component.pycomponents/data_processing/autorag/documents_indexing/component.pycomponents/data_processing/autorag/text_extraction/component.pycomponents/data_processing/autorag/text_extraction/tests/test_component_unit.py
💤 Files with no reviewable changes (1)
- components/data_processing/autorag/text_extraction/component.py
🚧 Files skipped from review as they are similar to previous changes (1)
- components/data_processing/autorag/documents_discovery/component.py
|
@filip-komarzyniec please finalize the review. |
| test_data_docs_names = get_test_data_docs_names() | ||
| if test_data_docs_names: | ||
| supported_files.sort(key=lambda c: c["Key"] not in test_data_docs_names) | ||
| test_keys_set = {c["Key"] for c in supported_files if Path(c["Key"]).name in test_data_docs_names} |
There was a problem hiding this comment.
I guess what CodeRabbit pointed out here is that the logic might be incorrect when there are files with the same base names (which in some cases will be valid in S3 storage).
How do we ensure we correctly process same-named files? This might be worth addressing somehow.
The example can be the following folder structure:
documents/
├─ sub_folder1/
│ ├─ document1.txt
├─ sub_folder2/
│ ├─ document1.txt
|
|
||
| paths = sorted(Path(extracted_text.path).glob("*.md")) | ||
| base = Path(extracted_text.path) | ||
| paths = sorted(p for p in base.iterdir() if p.is_file() and p.suffix.lower() == ".md") |
There was a problem hiding this comment.
If extracted_text directory stores only *.md files now then why do we need and p.suffix.lower() == ".md" check?
There was a problem hiding this comment.
It might be redundant but I will preserve it just to make sure we are loading markdowns only in case something else would appear in the extracted_text directory, such as some OS artifacts, some metadata files etc.
| if path.is_dir(): | ||
| for doc_path in path.iterdir(): | ||
| with doc_path.open("r", encoding="utf-8") as doc: | ||
| doc_id = doc_path.stem if doc_path.suffix == ".md" else doc_path.name |
There was a problem hiding this comment.
Again, why do we need the *.md checks now?
According to discussion in some previous comments all files in extracted_text should be markdown
There was a problem hiding this comment.
Agreed, it's redundant today, but cheap enough to keep as a guard against unexpected files appearing in extracted_text (OS artifacts, future changes or something dropped silently by text extraction process).
|
a couple more comments but nothing really severe. Apart from that the changes /lgtm |
|
@witold-nowogorski can we have those addressed so I could merge ?. |
|
/lgtm |
|
/ok-to-test |
|
/retest |
|
/retest |
|
/ok-to-test cancel |
|
/ok-to-test |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: LukaszCmielowski The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
a028d33
into
opendatahub-io:main
Related issue: https://redhat.atlassian.net/browse/RHOAIENG-60783
Description of your changes:
Two bugs caused
context_correctnessto score 0 when using .txt input documents:hash_0.txtbecamehash_0.txt.md. ai4rag then added another .md internally, producinghash_0.txt.md.mdas the document ID - mismatching the benchmark'shash_0.txt.docs/hash_0.txt) against bare benchmark filenames (hash_0.txt), so priority sorting never worked.Changes
Path(c["Key"]).nameto match against benchmark document IDs.Checklist:
Pre-Submission Checklist
Learn more about the pull request title convention used in this repository.
Additional Checklist Items for New or Updated Components/Pipelines
metadata.yamlincludes freshlastVerifiedtimestampare present and complete
snake_casenaming conventionSummary by CodeRabbit
New Features
Bug Fixes
Tests