Skip to content

feat: AutoRAG - text extraction - enable table structure detection#111

Open
witold-nowogorski wants to merge 4 commits into
opendatahub-io:mainfrom
witold-nowogorski:autorag-text-extraction-enable-table-structure-detection
Open

feat: AutoRAG - text extraction - enable table structure detection#111
witold-nowogorski wants to merge 4 commits into
opendatahub-io:mainfrom
witold-nowogorski:autorag-text-extraction-enable-table-structure-detection

Conversation

@witold-nowogorski

@witold-nowogorski witold-nowogorski commented May 29, 2026

Copy link
Copy Markdown

Description of your changes:

Checklist:

Pre-Submission Checklist

Additional Checklist Items for New or Updated Components/Pipelines

  • metadata.yaml includes fresh lastVerified timestamp
  • All required files
    are present and complete
  • OWNERS file lists appropriate maintainers
  • README provides clear documentation with usage examples
  • Component follows snake_case naming convention
  • No security vulnerabilities in dependencies
  • Containerfile included if using a custom base image

Summary by CodeRabbit

  • New Features

    • PDF table structures are now processed during document conversion for improved table extraction.
  • Chores

    • Added a new image processing library dependency.

@coderabbitai

coderabbitai Bot commented May 29, 2026

Copy link
Copy Markdown
📝 Walkthrough

Walkthrough

This PR enables table structure extraction in the Docling PDF text processing pipeline by setting do_table_structure=True and adds the required opencv-python-headless system dependency. The configuration change activates Docling's DocumentConverter to parse and preserve table layouts when converting PDFs to Markdown. The dependency addition provides image processing support for this feature.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is a blank template with unchecked checklists and no actual description of the changes, missing the required 'Description of your changes' section. Fill in the 'Description of your changes' section with details about what was changed, why, and any relevant context or testing performed.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically summarizes the main change: enabling table structure detection in AutoRAG's text extraction component.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Signed-off-by: Witold Nowogorski <wnowogor@redhat.com>
@openshift-ci

openshift-ci Bot commented May 29, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign mprahl for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@witold-nowogorski witold-nowogorski changed the title Autorag text extraction enable table structure detection feat: AutoRAG - text extraction - enable table structure detection May 29, 2026
@witold-nowogorski witold-nowogorski marked this pull request as ready for review May 29, 2026 07:27

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@components/data_processing/autorag/text_extraction/component.py`:
- Line 168: The test expectation is stale: the pipeline contract in
components/data_processing/autorag/text_extraction/component.py now sets
do_table_structure=True, but the unit test in
components/data_processing/autorag/text_extraction/tests/test_component_unit.py
still asserts False. Update the test assertion(s) that reference
do_table_structure (look for occurrences like assert ...['do_table_structure']
or assert component.do_table_structure is False) to expect True, and run the
tests to ensure no other assertions depend on the old value.

In `@pyproject.toml`:
- Line 16: The dependency "opencv-python-headless" in pyproject.toml is
unpinned; update its requirement to a constrained version range to avoid
drifting into vulnerable OpenCV builds (e.g., change "opencv-python-headless" to
something like "opencv-python-headless>=4.1.1.26,<5" or pin an exact safe
version/hash in your lockfile); modify the dependency entry in pyproject.toml
accordingly and ensure the lockfile/hashes are regenerated so installations are
deterministic.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: ca8a22e7-613b-4c2a-92bf-f1966c937c35

📥 Commits

Reviewing files that changed from the base of the PR and between 1c1e7ae and 5ed7a75.

📒 Files selected for processing (2)
  • components/data_processing/autorag/text_extraction/component.py
  • pyproject.toml

artifacts_path=ap,
do_ocr=False,
do_table_structure=False,
do_table_structure=True,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Update downstream unit test expectation for do_table_structure.

Line 168 flips the PDF pipeline contract to do_table_structure=True, but components/data_processing/autorag/text_extraction/tests/test_component_unit.py (around Line 333-404 in the provided snippet) still asserts False. This will leave the test suite validating stale behavior.

Proposed test fix
-        assert call_kwargs["do_table_structure"] is False
+        assert call_kwargs["do_table_structure"] is True
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/data_processing/autorag/text_extraction/component.py` at line 168,
The test expectation is stale: the pipeline contract in
components/data_processing/autorag/text_extraction/component.py now sets
do_table_structure=True, but the unit test in
components/data_processing/autorag/text_extraction/tests/test_component_unit.py
still asserts False. Update the test assertion(s) that reference
do_table_structure (look for occurrences like assert ...['do_table_structure']
or assert component.do_table_structure is False) to expect True, and run the
tests to ensure no other assertions depend on the old value.

Comment thread pyproject.toml
dependencies = [
"kfp>=2.16.1",
"kfp-kubernetes>=2.16.1",
"opencv-python-headless",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Verify available versions and recent vulnerabilities before choosing exact bounds.
curl -s https://pypi.org/pypi/opencv-python-headless/json | jq -r '.info.version, (.releases | keys[] )' | tail -n 20

# OSV vulnerability lookup for the package
curl -s https://api.osv.dev/v1/query -H 'Content-Type: application/json' \
  -d '{"package":{"name":"opencv-python-headless","ecosystem":"PyPI"}}' | jq

Repository: opendatahub-io/pipelines-components

Length of output: 50391


Pin opencv-python-headless in pyproject.toml (line 16).

  • Unversioned dependency makes installs non-deterministic and can drift into OpenCV builds with known memory-safety bugs (e.g., CVE-2019-14493 / CWE-476, CVE-2019-9423 / CWE-787, CVE-2019-14491 / CWE-125), which OSV reports fixed starting at 4.1.1.26.
  • Add explicit bounds or central constraints, e.g. opencv-python-headless>=4.1.1.26,<5 (or pin exact version in your lockfile + hashes).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pyproject.toml` at line 16, The dependency "opencv-python-headless" in
pyproject.toml is unpinned; update its requirement to a constrained version
range to avoid drifting into vulnerable OpenCV builds (e.g., change
"opencv-python-headless" to something like "opencv-python-headless>=4.1.1.26,<5"
or pin an exact safe version/hash in your lockfile); modify the dependency entry
in pyproject.toml accordingly and ensure the lockfile/hashes are regenerated so
installations are deterministic.

Comment thread pyproject.toml
dependencies = [
"kfp>=2.16.1",
"kfp-kubernetes>=2.16.1",
"opencv-python-headless",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Along with pyproject.toml, we also need to lock the new dependencies (and all their transitive dependencies) in pipelines/training/autorag/documents_rag_optimization_pipeline/requirements.txt.

I recommend running the installation in a fresh Linux environment and generating a new requirements file using pip freeze. This is required for the downstream hermetic build, which pre-fetches all dependencies for the offline image build step.

For example:
podman run --rm --user root -v "${WORK}:/work" registry.access.redhat.com/ubi9/python-312:1-1779945122 bash -c "pip install -r /work/requirements.txt && pip freeze > /tmp/req.txt && cp /tmp/req.txt /work/requirements.txt"

@DorotaDR

Copy link
Copy Markdown

Before merge - it would be good to test the change on a disconnected cluster.

@LukaszCmielowski

Copy link
Copy Markdown

@filip-komarzyniec could you please address remaining comments ?.

@openshift-ci

openshift-ci Bot commented Jun 16, 2026

Copy link
Copy Markdown

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants