Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -165,7 +165,7 @@ def _build_docling_format_options():
pdf_pipeline_options = ThreadedPdfPipelineOptions(
artifacts_path=ap,
do_ocr=False,
do_table_structure=False,
do_table_structure=True,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Update downstream unit test expectation for do_table_structure.

Line 168 flips the PDF pipeline contract to do_table_structure=True, but components/data_processing/autorag/text_extraction/tests/test_component_unit.py (around Line 333-404 in the provided snippet) still asserts False. This will leave the test suite validating stale behavior.

Proposed test fix
-        assert call_kwargs["do_table_structure"] is False
+        assert call_kwargs["do_table_structure"] is True
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/data_processing/autorag/text_extraction/component.py` at line 168,
The test expectation is stale: the pipeline contract in
components/data_processing/autorag/text_extraction/component.py now sets
do_table_structure=True, but the unit test in
components/data_processing/autorag/text_extraction/tests/test_component_unit.py
still asserts False. Update the test assertion(s) that reference
do_table_structure (look for occurrences like assert ...['do_table_structure']
or assert component.do_table_structure is False) to expect True, and run the
tests to ensure no other assertions depend on the old value.

accelerator_options=AcceleratorOptions(device="cpu", num_threads=2),
)
paginated_pipeline_options = PaginatedPipelineOptions(
Expand Down
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ requires-python = ">=3.11"
dependencies = [
"kfp>=2.16.1",
"kfp-kubernetes>=2.16.1",
"opencv-python-headless",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Verify available versions and recent vulnerabilities before choosing exact bounds.
curl -s https://pypi.org/pypi/opencv-python-headless/json | jq -r '.info.version, (.releases | keys[] )' | tail -n 20

# OSV vulnerability lookup for the package
curl -s https://api.osv.dev/v1/query -H 'Content-Type: application/json' \
  -d '{"package":{"name":"opencv-python-headless","ecosystem":"PyPI"}}' | jq

Repository: opendatahub-io/pipelines-components

Length of output: 50391


Pin opencv-python-headless in pyproject.toml (line 16).

  • Unversioned dependency makes installs non-deterministic and can drift into OpenCV builds with known memory-safety bugs (e.g., CVE-2019-14493 / CWE-476, CVE-2019-9423 / CWE-787, CVE-2019-14491 / CWE-125), which OSV reports fixed starting at 4.1.1.26.
  • Add explicit bounds or central constraints, e.g. opencv-python-headless>=4.1.1.26,<5 (or pin exact version in your lockfile + hashes).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pyproject.toml` at line 16, The dependency "opencv-python-headless" in
pyproject.toml is unpinned; update its requirement to a constrained version
range to avoid drifting into vulnerable OpenCV builds (e.g., change
"opencv-python-headless" to something like "opencv-python-headless>=4.1.1.26,<5"
or pin an exact safe version/hash in your lockfile); modify the dependency entry
in pyproject.toml accordingly and ensure the lockfile/hashes are regenerated so
installations are deterministic.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Along with pyproject.toml, we also need to lock the new dependencies (and all their transitive dependencies) in pipelines/training/autorag/documents_rag_optimization_pipeline/requirements.txt.

I recommend running the installation in a fresh Linux environment and generating a new requirements file using pip freeze. This is required for the downstream hermetic build, which pre-fetches all dependencies for the offline image build step.

For example:
podman run --rm --user root -v "${WORK}:/work" registry.access.redhat.com/ubi9/python-312:1-1779945122 bash -c "pip install -r /work/requirements.txt && pip freeze > /tmp/req.txt && cp /tmp/req.txt /work/requirements.txt"

]

[project.optional-dependencies]
Expand Down
Loading