Skip to content

Conversation

@scylladbbot
Copy link

Description

When provision_resources fails with TEST_ERROR status, the failure reason was not sent to Argus, making root cause analysis difficult for capacity or configuration issues.

Changes

  • Error event submission: Modified provision_resources() exception handler to create and submit error event to Argus before setting TEST_ERROR status
  • Direct RawEventPayload creation: Creates RawEventPayload directly without using TestFrameworkEvent to avoid event machinery dependency
  • Event payload: Includes backend type, exception type and details, timestamp, and CRITICAL severity
  • Provision log upload: Uploads the full provision stage log file (hydra.log) to S3 and submits the URL to Argus via submit_sct_logs()
  • Graceful degradation: Argus submission and log upload failures are logged but don't block the error flow
# On provision failure, creates event with informative message
error_message = f"Failed to provision {backend} resources: {type(exc).__name__}: {exc}"
event_payload: RawEventPayload = {
    "run_id": str(test_config.test_id()),
    "severity": "CRITICAL",
    "ts": time.time(),
    "message": error_message,
    "event_type": "TestFrameworkEvent",
    # ... additional required fields
}

# Submits directly to Argus
test_config.argus_client().submit_event(event_payload)

# Upload provision log to S3 and submit to Argus
log_url = s3.upload_file(log_file_path, s3_path, public=False)
test_config.argus_client().submit_sct_logs([LogLink(log_name=log_name, log_link=log_url)])
  • Unit test: Added test_provision_error_event.py to verify event submission with correct payload structure
  • Helper function update: Modified add_file_logger() to return the log file path for upload purposes

Testing

  • - example email: scylla-staging/lukasz/2025/gemini-3h-with-nemesis-test#11: 16/01/2026

PR pre-checks (self review)

  • I added the relevant backport labels
  • I didn't leave commented-out/debugging code

Reminders

  • Add New configuration option and document them (in sdcm/sct_config.py)
  • Add unit tests to cover my changes (under unit-test/ folder)
  • Update the Readme/doc folder relevant to this change (if needed)

💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

  • (cherry picked from commit 76c9eff)

  • (cherry picked from commit 6ea4ca3)

Parent PR: #13246

Copilot AI added 2 commits January 20, 2026 14:37
When provision fails we miss details in Argus. This commit adds error
event submission to Argus when provision fails. This way we'll have it
in email reports and Argus event tab (new one, experimental).

fixes: scylladb#13245
(cherry picked from commit 76c9eff)
When provision step fails we skip logs collection.

Adjusted pipelines to collect these even if provision fails.

fixes: scylladb#12072
(cherry picked from commit 6ea4ca3)
@scylladbbot scylladbbot force-pushed the backport/13246/to-2025.3 branch from 504a7ff to ec7c24d Compare January 20, 2026 14:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant