feat: AutoRAG - text extraction - enable table structure detection#111
Conversation
Signed-off-by: Witold Nowogorski <wnowogor@redhat.com>
📝 WalkthroughWalkthroughThis PR enables table structure extraction in the Docling PDF text processing pipeline by setting Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Signed-off-by: Witold Nowogorski <wnowogor@redhat.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@components/data_processing/autorag/text_extraction/component.py`:
- Line 168: The test expectation is stale: the pipeline contract in
components/data_processing/autorag/text_extraction/component.py now sets
do_table_structure=True, but the unit test in
components/data_processing/autorag/text_extraction/tests/test_component_unit.py
still asserts False. Update the test assertion(s) that reference
do_table_structure (look for occurrences like assert ...['do_table_structure']
or assert component.do_table_structure is False) to expect True, and run the
tests to ensure no other assertions depend on the old value.
In `@pyproject.toml`:
- Line 16: The dependency "opencv-python-headless" in pyproject.toml is
unpinned; update its requirement to a constrained version range to avoid
drifting into vulnerable OpenCV builds (e.g., change "opencv-python-headless" to
something like "opencv-python-headless>=4.1.1.26,<5" or pin an exact safe
version/hash in your lockfile); modify the dependency entry in pyproject.toml
accordingly and ensure the lockfile/hashes are regenerated so installations are
deterministic.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: ca8a22e7-613b-4c2a-92bf-f1966c937c35
📒 Files selected for processing (2)
components/data_processing/autorag/text_extraction/component.pypyproject.toml
| artifacts_path=ap, | ||
| do_ocr=False, | ||
| do_table_structure=False, | ||
| do_table_structure=True, |
There was a problem hiding this comment.
Update downstream unit test expectation for do_table_structure.
Line 168 flips the PDF pipeline contract to do_table_structure=True, but components/data_processing/autorag/text_extraction/tests/test_component_unit.py (around Line 333-404 in the provided snippet) still asserts False. This will leave the test suite validating stale behavior.
Proposed test fix
- assert call_kwargs["do_table_structure"] is False
+ assert call_kwargs["do_table_structure"] is True🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@components/data_processing/autorag/text_extraction/component.py` at line 168,
The test expectation is stale: the pipeline contract in
components/data_processing/autorag/text_extraction/component.py now sets
do_table_structure=True, but the unit test in
components/data_processing/autorag/text_extraction/tests/test_component_unit.py
still asserts False. Update the test assertion(s) that reference
do_table_structure (look for occurrences like assert ...['do_table_structure']
or assert component.do_table_structure is False) to expect True, and run the
tests to ensure no other assertions depend on the old value.
| dependencies = [ | ||
| "kfp>=2.16.1", | ||
| "kfp-kubernetes>=2.16.1", | ||
| "opencv-python-headless", |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Verify available versions and recent vulnerabilities before choosing exact bounds.
curl -s https://pypi.org/pypi/opencv-python-headless/json | jq -r '.info.version, (.releases | keys[] )' | tail -n 20
# OSV vulnerability lookup for the package
curl -s https://api.osv.dev/v1/query -H 'Content-Type: application/json' \
-d '{"package":{"name":"opencv-python-headless","ecosystem":"PyPI"}}' | jqRepository: opendatahub-io/pipelines-components
Length of output: 50391
Pin opencv-python-headless in pyproject.toml (line 16).
- Unversioned dependency makes installs non-deterministic and can drift into OpenCV builds with known memory-safety bugs (e.g., CVE-2019-14493 / CWE-476, CVE-2019-9423 / CWE-787, CVE-2019-14491 / CWE-125), which OSV reports fixed starting at 4.1.1.26.
- Add explicit bounds or central constraints, e.g.
opencv-python-headless>=4.1.1.26,<5(or pin exact version in your lockfile + hashes).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pyproject.toml` at line 16, The dependency "opencv-python-headless" in
pyproject.toml is unpinned; update its requirement to a constrained version
range to avoid drifting into vulnerable OpenCV builds (e.g., change
"opencv-python-headless" to something like "opencv-python-headless>=4.1.1.26,<5"
or pin an exact safe version/hash in your lockfile); modify the dependency entry
in pyproject.toml accordingly and ensure the lockfile/hashes are regenerated so
installations are deterministic.
| dependencies = [ | ||
| "kfp>=2.16.1", | ||
| "kfp-kubernetes>=2.16.1", | ||
| "opencv-python-headless", |
There was a problem hiding this comment.
Along with pyproject.toml, we also need to lock the new dependencies (and all their transitive dependencies) in pipelines/training/autorag/documents_rag_optimization_pipeline/requirements.txt.
I recommend running the installation in a fresh Linux environment and generating a new requirements file using pip freeze. This is required for the downstream hermetic build, which pre-fetches all dependencies for the offline image build step.
For example:
podman run --rm --user root -v "${WORK}:/work" registry.access.redhat.com/ubi9/python-312:1-1779945122 bash -c "pip install -r /work/requirements.txt && pip freeze > /tmp/req.txt && cp /tmp/req.txt /work/requirements.txt"
|
Before merge - it would be good to test the change on a disconnected cluster. |
|
@filip-komarzyniec could you please address remaining comments ?. |
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Description of your changes:
Checklist:
Pre-Submission Checklist
Learn more about the pull request title convention used in this repository.
Additional Checklist Items for New or Updated Components/Pipelines
metadata.yamlincludes freshlastVerifiedtimestampare present and complete
snake_casenaming conventionSummary by CodeRabbit
New Features
Chores