Skip to content

feat(automl): Aggregate and align progress stages for AutoML and AutoRAG pipelines#137

Open
LukaszCmielowski wants to merge 10 commits into
opendatahub-io:mainfrom
LukaszCmielowski:autox_clean_up_stages
Open

feat(automl): Aggregate and align progress stages for AutoML and AutoRAG pipelines#137
LukaszCmielowski wants to merge 10 commits into
opendatahub-io:mainfrom
LukaszCmielowski:autox_clean_up_stages

Conversation

@LukaszCmielowski

@LukaszCmielowski LukaszCmielowski commented Jun 17, 2026

Copy link
Copy Markdown

Description of your changes:

Summary

  • Consolidates dashboard progress stages across tabular/time-series AutoML and AutoRAG pipelines by merging related substages into fewer, meaningful stages (e.g. data loaders: prepare_data + split_and_export; training: load_datamodel_selectionrefit_and_evaluate).
  • Updates run_status_templates JSON manifests and all affected components to record the new stage IDs; aligns model_selection sub-steps (feature_engineering, model_training, stacking, evaluation) across tabular and time-series training.
  • Aligns component_status tracking between AutoML and AutoRAG: unified UTC Z timestamps, improved stage tracker behavior, display_name on AutoRAG status artifacts, and pipeline README sections documenting component_stage_map / component_status artifacts.
  • Restores kfp>=2.16.1,<2.17 upper bound alongside protobuf<6.32 for s390x build compatibility.

Stage changes (high level)

Area Before After
AutoML data loaders (tabular + time series) 5 stages 2 stages (prepare_data, split_and_export)
AutoML training (tabular + time series) 4 stages 3 stages (load_data, model_selection, refit_and_evaluate)
AutoRAG pipeline 15 stages 6 stages (one primary stage per component)

Checklist:

Pre-Submission Checklist

Additional Checklist Items for New or Updated Components/Pipelines

  • metadata.yaml includes fresh lastVerified timestamp
  • All required files
    are present and complete
  • OWNERS file lists appropriate maintainers
  • README provides clear documentation with usage examples
  • Component follows snake_case naming convention
  • No security vulnerabilities in dependencies
  • Containerfile included if using a custom base image

Summary by CodeRabbit

Release Notes

  • New Features

    • Improved component status reporting, including clearer stage display names and consistent UTC timestamps.
    • Expanded progress/dashboard artifacts to better reflect stage progress.
  • Breaking Changes

    • Updated workflow stage identifiers across multiple data loading/training components and pipeline templates.
    • Several components now require component_status as an output artifact (no longer optional).
  • Documentation

    • Refreshed README and pipeline templates to match the new stage/step structures.
    • Added/updated guidance for correlating progress artifacts with dashboard views.
  • Bug Fixes

    • More robust handling of missing/corrupted status files and improved persisted error messaging.

Signed-off-by: Lukasz Cmielowski <lcmielow@redhat.com>
Assisted-by: Cursor
Signed-off-by: Lukasz Cmielowski <lcmielow@redhat.com>
Assisted-by: Cursor
@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown

Warning

Review limit reached

@LukaszCmielowski, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 10 minutes and 48 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: da59b8eb-5d50-4b20-8148-4537a43d7e68

📥 Commits

Reviewing files that changed from the base of the PR and between d2f9dc7 and 974f684.

📒 Files selected for processing (12)
  • components/data_processing/automl/tabular_data_loader/component.py
  • components/data_processing/automl/timeseries_data_loader/component.py
  • components/data_processing/autorag/documents_discovery/component.py
  • components/data_processing/autorag/test_data_loader/component.py
  • components/data_processing/autorag/text_extraction/component.py
  • components/training/automl/autogluon_leaderboard_evaluation/component.py
  • components/training/automl/autogluon_models_training/component.py
  • components/training/automl/autogluon_timeseries_models_training/component.py
  • components/training/autorag/leaderboard_evaluation/component.py
  • components/training/autorag/rag_templates_optimization/component.py
  • components/training/autorag/search_space_preparation/component.py
  • components/training/autorag/shared/tests/test_component_status.py
📝 Walkthrough

Walkthrough

The PR consolidates stage identifiers across automl and autorag pipeline components. Data loader stages transition from validate_inputs/read_and_sample/cleanse/split/write_outputs to prepare_data and split_and_export. Model training stages merge refit_full and evaluate_models into refit_and_evaluate, with model_evaluation renamed to evaluation. AutoRAG components consolidate into single-stage workflows: discover_documents, load_benchmark, extract_documents, prepare_search_space, and optimize_templates. All components now set component_status.metadata["display_name"] and require component_status as a mandatory output. The automl shared ComponentStatusTracker adds ComponentStatusEncoder to handle bytes, Path, datetime, and set serialization; introduces utc_now_z() for ISO-8601 UTC timestamps with Z suffix; extends mark_active_failed to accept exceptions and formats errors as ExceptionType: message; updates stage() to catch BaseException; and hardens load_component_status() with JSON error handling. AutoRAG's ComponentStatusTracker removes optional artifact_path=None behavior, requiring path as mandatory string. Pipeline JSON templates and all tests are updated to reflect new stage hierarchies and enhanced tracking functionality.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~28 minutes

🚥 Pre-merge checks | ✅ 10
✅ Passed checks (10 passed)
Check name Status Explanation
Title check ✅ Passed Title accurately describes the main change: aggregating/aligning progress stages across AutoML and AutoRAG pipelines with clear scope.
Description check ✅ Passed Description comprehensively addresses all required template sections: detailed summary, stage reduction table, technical updates, completed checklists, and sign-off confirmation.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Contribution Quality And Spam Detection ✅ Passed Zero spam signals detected across all categories; PR shows coordinated architectural refactoring of 32 files with deep codebase knowledge, comprehensive test updates, and responsive engagement with...
No Hardcoded Secrets ✅ Passed No hardcoded secrets detected. PR changes consolidate status tracking stages across AutoML/AutoRAG pipelines. All credentials properly read from environment variables; no API keys, tokens, password...
No Weak Cryptography ✅ Passed PR introduces no banned crypto primitives (MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB), custom crypto implementations, or non-constant-time secret comparisons. base64 used only for JSON serialization...
No Injection Vectors ✅ Passed No injection vectors detected. No SQL string concatenation, shell=True with user input, eval/exec, unsafe pickle/yaml, or XSS patterns found. Exception handling safely converts BaseException to str...
No Privileged Containers ✅ Passed No Kubernetes manifests, Helm templates, or Dockerfiles were modified. All ~23 files changed are documentation (.md), Python code (.py), or JSON pipeline templates (.json). Comprehensive search fou...
No Sensitive Data In Logs ✅ Passed No logging statements expose sensitive data. Metadata logged contains only operational metrics (row counts, model counts, step names); no credentials, tokens, API keys, PII, or raw request/response...

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@LukaszCmielowski LukaszCmielowski marked this pull request as ready for review June 17, 2026 11:27
@LukaszCmielowski

Copy link
Copy Markdown
Author

@nickmazzi fyi.

@LukaszCmielowski

Copy link
Copy Markdown
Author

/cc @Mateusz-Switala

@openshift-ci openshift-ci Bot requested a review from Mateusz-Switala June 17, 2026 11:28
@LukaszCmielowski

Copy link
Copy Markdown
Author

/cc @dryszka

@openshift-ci openshift-ci Bot requested a review from dryszka June 17, 2026 11:32

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
components/data_processing/automl/timeseries_data_loader/component.py (1)

389-394: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

split_and_export is marked completed before export side effects finish.

Line 390 records "split_and_export" as "completed" before writing sampled_test_dataset.path, selection_path, and extra_path. If any write fails afterward, status can falsely report completion.

Proposed fix
-        status.record(
-            "split_and_export",
-            "completed",
-            test_size=test_size,
-            selection_train_size=selection_train_size,
-        )
-
         # Save test dataset to artifact
         test_df.to_csv(sampled_test_dataset.path, index=False)
@@
         selection_train_df.to_csv(selection_path, index=False)
         extra_train_df.to_csv(extra_path, index=False)
+
+        status.record(
+            "split_and_export",
+            "completed",
+            test_size=test_size,
+            selection_train_size=selection_train_size,
+        )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/data_processing/automl/timeseries_data_loader/component.py` around
lines 389 - 394, The status.record() call marking split_and_export as completed
is executed before the actual export side effects (writing
sampled_test_dataset.path, selection_path, and extra_path) have finished. Move
the status.record() call that records split_and_export as completed to after all
the dataset write operations have successfully completed to ensure the status
accurately reflects whether the export operations succeeded or failed.
🧹 Nitpick comments (2)
components/data_processing/automl/tabular_data_loader/tests/test_component_unit.py (1)

214-216: ⚡ Quick win

Stage-ID migration test is too permissive.

Current assertions only check inclusion. The test should fail if deprecated stage IDs are still emitted.

Proposed test hardening
         stage_ids = [stage["id"] for stage in payload["stages"]]
-        assert "prepare_data" in stage_ids
-        assert "split_and_export" in stage_ids
+        assert set(stage_ids) == {"prepare_data", "split_and_export"}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@components/data_processing/automl/tabular_data_loader/tests/test_component_unit.py`
around lines 214 - 216, The stage_ids assertions in the test are currently too
permissive as they only verify that expected stage IDs are included in the list
using the `in` operator. Replace the individual `assert` statements for
"prepare_data" and "split_and_export" with a single assertion that strictly
validates the stage_ids list contains exactly and only these expected stage IDs.
Use set comparison or list equality to ensure no deprecated stage IDs are
present in the payload, so the test will fail if any unwanted stage IDs are
emitted.
components/training/automl/shared/tests/test_component_status.py (1)

86-97: ⚡ Quick win

Add a BaseException regression test for AutoML stage() behavior.

The changed implementation catches BaseException, but this file only validates Exception subclasses. Add a KeyboardInterrupt case to prevent regressions in failure recording.

Suggested patch
 class TestComponentStatusTracker:
@@
     def test_stage_context_manager_records_failed(self, tmp_path: Path) -> None:
         """stage() records failed when an exception escapes the block."""
         tracker = ComponentStatusTracker(str(tmp_path), "automl_data_loader")
         with pytest.raises(ValueError, match="bad split"):
             with tracker.stage("split_and_export"):
                 raise ValueError("bad split")
         tracker.save()

         data = json.loads((tmp_path / COMPONENT_STATUS_FILENAME).read_text(encoding="utf-8"))
         assert data["stages"][-1]["status"] == "failed"
         assert "bad split" in data["stages"][-1]["error"]
+
+    def test_stage_marks_failed_on_base_exception_subclass(self, tmp_path: Path) -> None:
+        """stage() records failed when a BaseException subclass escapes the block."""
+        with pytest.raises(KeyboardInterrupt):
+            with ComponentStatusTracker(str(tmp_path), "automl_data_loader") as tracker:
+                with tracker.stage("split_and_export"):
+                    raise KeyboardInterrupt
+
+        data = load_component_status(str(tmp_path))
+        assert data["stages"][-1]["status"] == "failed"
+        assert data["stages"][-1]["error"] == "KeyboardInterrupt"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/training/automl/shared/tests/test_component_status.py` around
lines 86 - 97, The test_stage_context_manager_records_failed method in the test
file only validates that Exception subclasses are properly recorded as failed
stages, but the implementation catches BaseException which includes
non-Exception subclasses like KeyboardInterrupt. Add a new test method that
verifies the stage() context manager properly records a failed status when a
BaseException subclass (use KeyboardInterrupt as an example) escapes the block,
following the same pattern as the existing test: create a
ComponentStatusTracker, use pytest.raises to catch KeyboardInterrupt within the
stage context manager, save the tracker, and assert that the stage is recorded
as failed with the appropriate error message.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@components/data_processing/autorag/documents_discovery/component.py`:
- Around line 195-197: The assignment of display_name to component_status
metadata is currently positioned after the with status: block exits, which means
if the status context fails, this metadata is never set and failed runs lose the
display label. Move the entire if block that checks if component_status is not
None and assigns component_status.metadata["display_name"] = "Documents
Discovery Status" to execute before the with status: context manager block,
ensuring the display name is set regardless of whether the status block succeeds
or fails.

In `@components/data_processing/autorag/test_data_loader/component.py`:
- Around line 166-168: The display_name metadata is being set after entering the
with status context manager, which means if an exception occurs within that
block, the metadata will not be set since that line will be skipped. Move the
line that sets component_status.metadata["display_name"] to "Test Data Loader
Status" to before the with status: statement so the metadata is always applied
regardless of whether an exception is raised during execution.

In `@components/data_processing/autorag/text_extraction/component.py`:
- Around line 467-469: The metadata assignment for display_name on
component_status is currently positioned after the with status context manager
block, which means it will not execute if an exception is raised inside the
block during failed runs. Move the if statement that checks component_status is
not None and assigns component_status.metadata["display_name"] = "Text
Extraction Status" to the beginning of the with status: block to ensure this
metadata is set regardless of whether an exception occurs during processing.

In `@components/training/automl/shared/tests/test_component_status.py`:
- Around line 98-103: The test_disabled_tracker_skips_save test only verifies
that no file is created in tmp_path, but since the ComponentStatusTracker
fallback path uses the current working directory (Path(".")), the test cannot
detect if save() accidentally writes there. Change the test to also verify that
the status file is not created in the current working directory by either
changing the working directory context to tmp_path before calling save() or by
explicitly asserting that the file does not exist at the current directory path
in addition to checking tmp_path.

In `@components/training/autorag/leaderboard_evaluation/component.py`:
- Around line 521-523: The metadata assignment for the display_name in the
component_status object is currently placed after the tracked block (the with
statement), which means exceptions occurring within the tracked block will
prevent this metadata from being set on failed runs. Move the line that sets
component_status.metadata["display_name"] to "Leaderboard Evaluation Status" to
occur before the with status block begins, ensuring this metadata is always set
regardless of whether an exception occurs during execution.

In `@components/training/autorag/search_space_preparation/component.py`:
- Around line 392-394: The display_name metadata assignment to component_status
is currently placed after the with status: block, causing it to only be set on
successful execution. Move the if component_status is not None: block containing
the metadata assignment for display_name to before the with status: block
begins, so that the metadata is initialized regardless of whether the stage
execution succeeds or fails.

In `@components/training/autorag/shared/tests/test_component_status.py`:
- Around line 36-49: The tests test_disabled_tracker_skips_save and
test_component_status_tracker_from_none only assert that
COMPONENT_STATUS_FILENAME does not exist in tmp_path, but they do not validate
that it was not written to the fallback location (Path(".")). If save()
regresses and writes to the current working directory instead of checking
artifact_path, these tests would still pass. Add an additional assertion in both
test methods to verify that COMPONENT_STATUS_FILENAME does not exist at the
fallback write location (Path(".") or the actual fallback path used by save()
when artifact_path is None) to ensure the disabled tracker truly does not write
to disk anywhere.

---

Outside diff comments:
In `@components/data_processing/automl/timeseries_data_loader/component.py`:
- Around line 389-394: The status.record() call marking split_and_export as
completed is executed before the actual export side effects (writing
sampled_test_dataset.path, selection_path, and extra_path) have finished. Move
the status.record() call that records split_and_export as completed to after all
the dataset write operations have successfully completed to ensure the status
accurately reflects whether the export operations succeeded or failed.

---

Nitpick comments:
In
`@components/data_processing/automl/tabular_data_loader/tests/test_component_unit.py`:
- Around line 214-216: The stage_ids assertions in the test are currently too
permissive as they only verify that expected stage IDs are included in the list
using the `in` operator. Replace the individual `assert` statements for
"prepare_data" and "split_and_export" with a single assertion that strictly
validates the stage_ids list contains exactly and only these expected stage IDs.
Use set comparison or list equality to ensure no deprecated stage IDs are
present in the payload, so the test will fail if any unwanted stage IDs are
emitted.

In `@components/training/automl/shared/tests/test_component_status.py`:
- Around line 86-97: The test_stage_context_manager_records_failed method in the
test file only validates that Exception subclasses are properly recorded as
failed stages, but the implementation catches BaseException which includes
non-Exception subclasses like KeyboardInterrupt. Add a new test method that
verifies the stage() context manager properly records a failed status when a
BaseException subclass (use KeyboardInterrupt as an example) escapes the block,
following the same pattern as the existing test: create a
ComponentStatusTracker, use pytest.raises to catch KeyboardInterrupt within the
stage context manager, save the tracker, and assert that the stage is recorded
as failed with the appropriate error message.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: d6e05db3-4683-491c-8975-bed82c9c33e6

📥 Commits

Reviewing files that changed from the base of the PR and between b3cd7ef and 72fbfb5.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (28)
  • components/data_processing/automl/tabular_data_loader/README.md
  • components/data_processing/automl/tabular_data_loader/component.py
  • components/data_processing/automl/tabular_data_loader/tests/test_component_unit.py
  • components/data_processing/automl/timeseries_data_loader/README.md
  • components/data_processing/automl/timeseries_data_loader/component.py
  • components/data_processing/autorag/documents_discovery/component.py
  • components/data_processing/autorag/test_data_loader/component.py
  • components/data_processing/autorag/text_extraction/component.py
  • components/training/automl/autogluon_models_training/README.md
  • components/training/automl/autogluon_models_training/component.py
  • components/training/automl/autogluon_timeseries_models_training/README.md
  • components/training/automl/autogluon_timeseries_models_training/component.py
  • components/training/automl/shared/component_status.py
  • components/training/automl/shared/run_status_templates/pipelines/autogluon-tabular-training-pipeline.json
  • components/training/automl/shared/run_status_templates/pipelines/autogluon-timeseries-training-pipeline.json
  • components/training/automl/shared/tests/test_component_status.py
  • components/training/automl/shared/tests/test_run_status.py
  • components/training/autorag/leaderboard_evaluation/component.py
  • components/training/autorag/rag_templates_optimization/component.py
  • components/training/autorag/rag_templates_optimization/tests/test_component_unit.py
  • components/training/autorag/search_space_preparation/component.py
  • components/training/autorag/shared/component_status.py
  • components/training/autorag/shared/run_status_templates/pipelines/documents-rag-optimization-pipeline.json
  • components/training/autorag/shared/tests/test_component_status.py
  • pipelines/training/automl/autogluon_timeseries_training_pipeline/README.md
  • pipelines/training/automl/autogluon_timeseries_training_pipeline/pipeline.py
  • pipelines/training/autorag/documents_rag_optimization_pipeline/README.md
  • pyproject.toml

Comment thread components/data_processing/autorag/documents_discovery/component.py Outdated
Comment thread components/data_processing/autorag/test_data_loader/component.py Outdated
Comment thread components/data_processing/autorag/text_extraction/component.py Outdated
Comment thread components/training/automl/shared/tests/test_component_status.py Outdated
Comment thread components/training/autorag/leaderboard_evaluation/component.py Outdated
Comment thread components/training/autorag/search_space_preparation/component.py Outdated
Comment thread components/training/autorag/shared/tests/test_component_status.py Outdated
Signed-off-by: Lukasz Cmielowski <lcmielow@redhat.com>
Assisted-by: Cursor
Signed-off-by: Lukasz Cmielowski <lcmielow@redhat.com>
Assisted-by: Cursor
Comment thread pyproject.toml Outdated
Comment thread components/training/automl/shared/component_status.py Outdated

@dryszka dryszka left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see the comments, mainly pyproject.toml one.

Signed-off-by: Lukasz Cmielowski <lcmielow@redhat.com>
Assisted-by: Cursor
@openshift-ci

openshift-ci Bot commented Jun 17, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from dryszka. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Resolved conflict in timeseries_data_loader/component.py:
- Kept enhanced sample_rows logic with ISO timestamp conversion (from PR opendatahub-io#132)
- Kept display_name metadata at start of context (from this PR)
- Kept write_outputs status tracking (from PR opendatahub-io#132)

Merged changes:
- PR opendatahub-io#132: AutoML timeseries notebook backtesting charts
- PR opendatahub-io#138: ai4rag 0.6.4 and ogx-client 1.1.0 updates

Signed-off-by: Lukasz Cmielowski <lcmielow@redhat.com>
Assisted-by: Cursor
Address Daniel feedback: component_status should never be None
in production KFP pipelines. Remove unnecessary None handling to simplify
code and enforce proper usage.

Signed-off-by: Lukasz Cmielowski <lcmielow@redhat.com>
Assisted-by: Cursor
Signed-off-by: Lukasz Cmielowski <lcmielow@redhat.com>
Assisted-by: Cursor

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
components/training/autorag/shared/component_status.py (1)

222-232: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Code execution from file path via importlib — document trust boundary.

spec.loader.exec_module(module) executes arbitrary Python from embedded_artifact.path. CWE-94 applies if that path can be influenced by untrusted input. In KFP embedded artifacts, the path is framework-controlled, but this trust assumption should be explicit.

Add a docstring note:

 def load_embedded_component_status_module(embedded_artifact: Any) -> ModuleType:
-    """Load ``component_status`` from a KFP embedded artifact path via importlib."""
+    """Load ``component_status`` from a KFP embedded artifact path via importlib.
+
+    Warning:
+        This function executes code from the filesystem. The caller must ensure
+        ``embedded_artifact.path`` originates from a trusted KFP artifact source.
+    """
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/training/autorag/shared/component_status.py` around lines 222 -
232, The function load_embedded_component_status_module executes arbitrary
Python code from the embedded_artifact path via spec.loader.exec_module(module),
which is a potential security concern if untrusted input influences the path.
Add a docstring note to the function explicitly documenting the trust boundary
assumption - specifically that the embedded_artifact.path is
framework-controlled by KFP and therefore trusted, making this safe from CWE-94
attacks.

Source: Coding guidelines

components/data_processing/automl/timeseries_data_loader/component.py (1)

144-151: ⚠️ Potential issue | 🔴 Critical | 🏗️ Heavy lift

CWE-295: SSL verification bypass enables MITM attacks.

Retrying with verify=False disables certificate validation, exposing AWS credentials and dataset contents to interception. An attacker controlling network infrastructure could present a fraudulent certificate, intercept AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY, and exfiltrate training data.

Remove the SSLError catch block entirely. SSL errors indicate misconfigured certificates, expired CA bundles, or active attacks—fail fast rather than silently downgrade security.

🔒 Remediation
-        s3_client = get_s3_client()
-        try:
-            response = s3_client.get_object(Bucket=bucket_name, Key=file_key)
-        except SSLError:
-            logger.warning(
-                "SSL error when downloading s3://%s/%s, retrying with verify=False",
-                bucket_name,
-                file_key,
-            )
-            no_verify_client = get_s3_client(verify=False)
-            response = no_verify_client.get_object(Bucket=bucket_name, Key=file_key)
+        s3_client = get_s3_client()
+        response = s3_client.get_object(Bucket=bucket_name, Key=file_key)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/data_processing/automl/timeseries_data_loader/component.py` around
lines 144 - 151, The SSLError exception handler at line 144-151 disables SSL
certificate verification by retrying with verify=False in the get_s3_client
call, which is a security vulnerability that could expose AWS credentials and
training data to MITM attacks. Remove the entire SSLError except block
completely, including the logger.warning call and the no_verify_client retry
logic. This will allow SSL certificate errors to fail immediately rather than
silently downgrading security by bypassing certificate validation.

Source: Coding guidelines

🧹 Nitpick comments (3)
components/training/autorag/shared/component_status.py (2)

142-160: 💤 Low value

Same redundant re-record pattern as AutoML tracker.

Line 156: condition != STATUS_COMPLETED will re-record FAILED on already-failed stages.

-        if self.stages and self.stages[-1].get("status") != STATUS_COMPLETED:
+        if self.stages and self.stages[-1].get("status") not in (STATUS_COMPLETED, STATUS_FAILED):
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/training/autorag/shared/component_status.py` around lines 142 -
160, The condition in the `mark_active_failed` method that checks `if
self.stages and self.stages[-1].get("status") != STATUS_COMPLETED:` will cause a
stage to be re-recorded as FAILED if it is already in a FAILED state, since the
condition only excludes STATUS_COMPLETED. Modify this condition to also exclude
STATUS_FAILED, so that we only record failure on the last stage if it is neither
completed nor already failed.

37-54: ⚖️ Poor tradeoff

Near-duplicate implementation with automl/shared/component_status.py.

ComponentStatusEncoder, utc_now_z(), ComponentStatusTracker, and load_component_status() are nearly identical across AutoML and AutoRAG shared modules. Consider extracting a common base module under a shared location (e.g., components/training/common/component_status_base.py) to reduce divergence risk.

If pipeline isolation is intentional, add a comment documenting why duplication is preferred.

Also applies to: 65-187, 268-282

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/training/autorag/shared/component_status.py` around lines 37 - 54,
The ComponentStatusEncoder class, along with utc_now_z(),
ComponentStatusTracker, and load_component_status(), are duplicated across both
automl/shared/component_status.py and autorag/shared/component_status.py
modules, which creates divergence risk. Either consolidate these implementations
by extracting them into a common base module (such as
components/training/common/component_status_base.py) and importing from there in
both locations, or if the duplication is intentional due to pipeline isolation
requirements, add a clear comment in each module explaining why the duplication
is preferred and cannot be shared.
components/training/automl/shared/component_status.py (1)

163-180: 💤 Low value

Condition at line 176 re-records FAILED on already-failed stages.

When no active stage is found and self.stages[-1].get("status") is STATUS_FAILED, the condition != STATUS_COMPLETED evaluates true, causing a redundant FAILED record. Consider:

-        if self.stages and self.stages[-1].get("status") != STATUS_COMPLETED:
+        if self.stages and self.stages[-1].get("status") not in (STATUS_COMPLETED, STATUS_FAILED):
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/training/automl/shared/component_status.py` around lines 163 -
180, The condition in the mark_active_failed method that checks
self.stages[-1].get("status") != STATUS_COMPLETED on line 176 will redundantly
record a FAILED status on a stage that is already in a FAILED state. Modify the
condition to exclude both STATUS_COMPLETED and STATUS_FAILED statuses, so that
the method only calls record with STATUS_FAILED for stages that are neither
completed nor already failed.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@components/data_processing/automl/timeseries_data_loader/component.py`:
- Around line 425-426: Remove the two consecutive status.record calls that
reference "write_outputs" stage at lines 425-426. These orphaned write_outputs
status records (both the "started" and "completed" entries) contradict the PR's
consolidation objective of reducing stages from 5 to 2 (prepare_data and
split_and_export only). Since split_and_export is already completed at line 391,
these write_outputs records should be deleted entirely to maintain consistency
with the 2-stage design.

In `@components/data_processing/autorag/test_data_loader/component.py`:
- Around line 79-82: The display_name is being set on component_status.metadata
on line 81, but the shared tracker serializes status.metadata when saving, which
means display_name never gets persisted to the JSON artifact. Change the
assignment from component_status.metadata["display_name"] to
status.metadata["display_name"] so that the display_name is properly included
when the status object is serialized and saved.

In `@components/training/autorag/leaderboard_evaluation/component.py`:
- Around line 339-342: The display_name metadata is being written to
component_status.metadata instead of status.metadata. Since
ComponentStatusTracker persists status.metadata to the component_status.json
file, writing only to component_status.metadata will cause the display_name to
be missing from serialized stage artifacts. Change line 341 from writing to
component_status.metadata["display_name"] to status.metadata["display_name"] to
ensure the display_name is properly persisted when the status context manager
saves the metadata.

In `@components/training/autorag/search_space_preparation/component.py`:
- Around line 337-340: The display_name metadata is being set on
component_status.metadata instead of status.metadata on line 339. Since the
shared tracker object (status) is what gets serialized to JSON, you need to set
the metadata on the status object instead. Change the line that sets
component_status.metadata["display_name"] to set status.metadata["display_name"]
instead, so that the display_name is included when the status artifact is
serialized.

---

Outside diff comments:
In `@components/data_processing/automl/timeseries_data_loader/component.py`:
- Around line 144-151: The SSLError exception handler at line 144-151 disables
SSL certificate verification by retrying with verify=False in the get_s3_client
call, which is a security vulnerability that could expose AWS credentials and
training data to MITM attacks. Remove the entire SSLError except block
completely, including the logger.warning call and the no_verify_client retry
logic. This will allow SSL certificate errors to fail immediately rather than
silently downgrading security by bypassing certificate validation.

In `@components/training/autorag/shared/component_status.py`:
- Around line 222-232: The function load_embedded_component_status_module
executes arbitrary Python code from the embedded_artifact path via
spec.loader.exec_module(module), which is a potential security concern if
untrusted input influences the path. Add a docstring note to the function
explicitly documenting the trust boundary assumption - specifically that the
embedded_artifact.path is framework-controlled by KFP and therefore trusted,
making this safe from CWE-94 attacks.

---

Nitpick comments:
In `@components/training/automl/shared/component_status.py`:
- Around line 163-180: The condition in the mark_active_failed method that
checks self.stages[-1].get("status") != STATUS_COMPLETED on line 176 will
redundantly record a FAILED status on a stage that is already in a FAILED state.
Modify the condition to exclude both STATUS_COMPLETED and STATUS_FAILED
statuses, so that the method only calls record with STATUS_FAILED for stages
that are neither completed nor already failed.

In `@components/training/autorag/shared/component_status.py`:
- Around line 142-160: The condition in the `mark_active_failed` method that
checks `if self.stages and self.stages[-1].get("status") != STATUS_COMPLETED:`
will cause a stage to be re-recorded as FAILED if it is already in a FAILED
state, since the condition only excludes STATUS_COMPLETED. Modify this condition
to also exclude STATUS_FAILED, so that we only record failure on the last stage
if it is neither completed nor already failed.
- Around line 37-54: The ComponentStatusEncoder class, along with utc_now_z(),
ComponentStatusTracker, and load_component_status(), are duplicated across both
automl/shared/component_status.py and autorag/shared/component_status.py
modules, which creates divergence risk. Either consolidate these implementations
by extracting them into a common base module (such as
components/training/common/component_status_base.py) and importing from there in
both locations, or if the duplication is intentional due to pipeline isolation
requirements, add a clear comment in each module explaining why the duplication
is preferred and cannot be shared.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: c1d1772e-67e7-45f2-b13f-68d75429b606

📥 Commits

Reviewing files that changed from the base of the PR and between 01c4d67 and d2f9dc7.

📒 Files selected for processing (17)
  • components/data_processing/automl/timeseries_data_loader/component.py
  • components/data_processing/autorag/documents_discovery/README.md
  • components/data_processing/autorag/documents_discovery/component.py
  • components/data_processing/autorag/test_data_loader/README.md
  • components/data_processing/autorag/test_data_loader/component.py
  • components/data_processing/autorag/text_extraction/component.py
  • components/training/automl/autogluon_timeseries_models_training/README.md
  • components/training/automl/autogluon_timeseries_models_training/component.py
  • components/training/automl/shared/component_status.py
  • components/training/automl/shared/tests/test_component_status.py
  • components/training/autorag/leaderboard_evaluation/component.py
  • components/training/autorag/rag_templates_optimization/README.md
  • components/training/autorag/rag_templates_optimization/component.py
  • components/training/autorag/search_space_preparation/component.py
  • components/training/autorag/shared/component_status.py
  • components/training/autorag/shared/tests/test_component_status.py
  • pipelines/training/autorag/documents_rag_optimization_pipeline/README.md
💤 Files with no reviewable changes (2)
  • components/training/autorag/shared/tests/test_component_status.py
  • components/training/automl/shared/tests/test_component_status.py
✅ Files skipped from review due to trivial changes (5)
  • components/training/autorag/rag_templates_optimization/README.md
  • components/data_processing/autorag/test_data_loader/README.md
  • components/data_processing/autorag/documents_discovery/README.md
  • pipelines/training/autorag/documents_rag_optimization_pipeline/README.md
  • components/training/automl/autogluon_timeseries_models_training/README.md
🚧 Files skipped from review as they are similar to previous changes (4)
  • components/data_processing/autorag/text_extraction/component.py
  • components/data_processing/autorag/documents_discovery/component.py
  • components/training/autorag/rag_templates_optimization/component.py
  • components/training/automl/autogluon_timeseries_models_training/component.py

Comment thread components/data_processing/automl/timeseries_data_loader/component.py Outdated
Comment thread components/data_processing/autorag/test_data_loader/component.py
Comment thread components/training/autorag/leaderboard_evaluation/component.py
Comment thread components/training/autorag/search_space_preparation/component.py
Signed-off-by: Lukasz Cmielowski <lcmielow@redhat.com>
Assisted-by: Cursor
Signed-off-by: Lukasz Cmielowski <lcmielow@redhat.com>
Assisted-by: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants