Skip to content

fix: harden workflow archive DB retries#38170

Open
zhaohao1004 wants to merge 2 commits into
mainfrom
zhaohao1004/archive-db-resilience
Open

fix: harden workflow archive DB retries#38170
zhaohao1004 wants to merge 2 commits into
mainfrom
zhaohao1004/archive-db-resilience

Conversation

@zhaohao1004

Copy link
Copy Markdown
Contributor

Summary

This PR hardens the workflow run archive command against transient DB disconnects during archive planning and bundle processing.

What changed

  • Added a shared workflow archive DB retry helper that only retries SQLAlchemy/psycopg operational disconnects.
  • Switched archive plan and tenant-prefix resolution to explicit short-lived sessions with bounded retry.
  • Changed tenant plan resolution failure to raise ClickException, so Kubernetes CronJobs do not mark failed planning as success.
  • Kept post-archive scoped-session cleanup best-effort so a completed archive is not turned into a non-zero exit by stale cleanup.
  • Added focused tests for DB retry, non-DB error classification, plan/count retry, and ClickException behavior.

Root cause

The archive command could still fail on stale DB connections in planning paths, while some planning failures were printed and returned from the Click command, causing a successful process exit. The retry classification also matched generic network-looking messages without first verifying that the exception was DB-related.

Validation

  • uv run --project api --group dev ruff check api/commands/retention.py api/services/retention/workflow_run/archive_paid_plan_workflow_run.py api/services/retention/workflow_run/db_retry.py api/tests/integration_tests/services/retention/test_workflow_run_archiver.py api/tests/unit_tests/commands/test_archive_workflow_runs.py
  • uv run --project api --group dev pytest api/tests/integration_tests/services/retention/test_workflow_run_archiver.py api/tests/unit_tests/commands/test_archive_workflow_runs.py
  • uv run --project api python -m py_compile api/commands/retention.py api/services/retention/workflow_run/archive_paid_plan_workflow_run.py api/services/retention/workflow_run/db_retry.py api/tests/integration_tests/services/retention/test_workflow_run_archiver.py api/tests/unit_tests/commands/test_archive_workflow_runs.py

@zhaohao1004 zhaohao1004 force-pushed the zhaohao1004/archive-db-resilience branch from d42741a to 510408e Compare June 29, 2026 12:26
@github-actions

github-actions Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Pyrefly Diff

base → PR
--- /tmp/pyrefly_base.txt	2026-06-29 12:31:34.117644145 +0000
+++ /tmp/pyrefly_pr.txt	2026-06-29 12:31:19.039572357 +0000
@@ -559,9 +559,9 @@
 ERROR `unpatch` may be uninitialized [unbound-name]
   --> tests/integration_tests/plugin/__mock/http.py:67:9
 ERROR Argument `FakeArchiveStorage` is not assignable to parameter `storage` with type `ArchiveStorage | None` in function `services.retention.workflow_run.archive_paid_plan_workflow_run.WorkflowRunArchiver._archive_bundle` [bad-argument-type]
-   --> tests/integration_tests/services/retention/test_workflow_run_archiver.py:412:56
+   --> tests/integration_tests/services/retention/test_workflow_run_archiver.py:530:56
 ERROR Argument `FakeArchiveStorage` is not assignable to parameter `storage` with type `ArchiveStorage | None` in function `services.retention.workflow_run.archive_paid_plan_workflow_run.WorkflowRunArchiver._archive_bundle` [bad-argument-type]
-   --> tests/integration_tests/services/retention/test_workflow_run_archiver.py:437:60
+   --> tests/integration_tests/services/retention/test_workflow_run_archiver.py:555:60
 ERROR Argument `_stub_resolver._Resolver` is not assignable to parameter `binding_resolver` with type `WorkflowAgentBindingResolver | None` in function `services.workflow.node_output_inspector_service.NodeOutputInspectorService.__init__` [bad-argument-type]
    --> tests/integration_tests/services/test_node_output_inspector_service.py:231:59
 ERROR Argument `_stub_resolver._Resolver` is not assignable to parameter `binding_resolver` with type `WorkflowAgentBindingResolver | None` in function `services.workflow.node_output_inspector_service.NodeOutputInspectorService.__init__` [bad-argument-type]
@@ -1925,6 +1925,8 @@
 ERROR `None` is not subscriptable [unsupported-operation]
   --> tests/unit_tests/clients/agent_backend/test_event_adapter.py:53:12
 ERROR Expected a callable, got `None` [not-callable]
+   --> tests/unit_tests/commands/test_archive_workflow_runs.py:124:9
+ERROR Expected a callable, got `None` [not-callable]
   --> tests/unit_tests/commands/test_clean_expired_messages.py:33:9
 ERROR Expected a callable, got `None` [not-callable]
   --> tests/unit_tests/commands/test_clean_expired_messages.py:63:9

@github-actions

Copy link
Copy Markdown
Contributor

Pyrefly Type Coverage

Metric Base PR Delta
Type coverage 51.49% 51.49% -0.01%
Strict coverage 51.01% 51.00% -0.01%
Typed symbols 30,846 30,854 +8
Untyped symbols 29,332 29,347 +15
Modules 2931 2933 +2

@codecov

codecov Bot commented Jun 29, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 55.81395% with 57 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.35%. Comparing base (fa1ac75) to head (3860264).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...ion/workflow_run/archive_paid_plan_workflow_run.py 35.13% 24 Missing ⚠️
api/commands/retention.py 67.21% 18 Missing and 2 partials ⚠️
api/services/retention/workflow_run/db_retry.py 58.06% 10 Missing and 3 partials ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##             main   #38170    +/-   ##
========================================
  Coverage   85.34%   85.35%            
========================================
  Files        4967     4968     +1     
  Lines      258448   258554   +106     
  Branches    49042    49050     +8     
========================================
+ Hits       220580   220679    +99     
+ Misses      33575    33561    -14     
- Partials     4293     4314    +21     
Flag Coverage Δ
api 85.49% <55.81%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@zhaohao1004 zhaohao1004 marked this pull request as ready for review June 29, 2026 13:09
@dosubot dosubot Bot added size:L This PR changes 100-499 lines, ignoring generated files. refactor labels Jun 29, 2026
@zhaohao1004 zhaohao1004 changed the title [codex] Harden workflow archive DB retries fix: harden workflow archive DB retries Jun 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

refactor size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant