Skip to content

improvement(gce-utils): surface GCP audit log errors in operation failures#13672

Draft
Copilot wants to merge 2 commits intomasterfrom
copilot/extend-wait-for-extended-operation
Draft

improvement(gce-utils): surface GCP audit log errors in operation failures#13672
Copilot wants to merge 2 commits intomasterfrom
copilot/extend-wait-for-extended-operation

Conversation

Copy link
Contributor

Copilot AI commented Feb 19, 2026

Description

GCP operations fail with generic timeout/error messages while critical diagnostics sit in audit logs. Example: BACKEND_ERROR with status code 13 from cloudaudit.googleapis.com/activity never surfaces to developers, requiring manual GCP console inspection.

Changes:

  • Added GceLoggingClient.get_operation_audit_logs() to query activity logs by operation ID
  • Enhanced wait_for_extended_operation() to automatically fetch and include audit log errors when operations fail/timeout
  • Updated create_instance() to pass instance context for audit log queries
  • Gracefully degrades if audit log fetch fails (doesn't break main operation)

Before:

TimeoutError: instance creation timed out after 300 seconds
Operation ID: operation-1771459766465-64b221e1d3334

After:

TimeoutError: instance creation timed out after 300 seconds
Operation ID: operation-1771459766465-64b221e1d3334

Audit Log Errors:
  - v1.compute.instances.insert: Code 13, Message: BACKEND_ERROR

Surfaces GCP-specific errors (BACKEND_ERROR, QUOTA_EXCEEDED, RESOURCE_EXHAUSTED) immediately instead of requiring audit log inspection.

Testing

  • Added 5 unit tests covering audit log integration scenarios
  • CodeQL scan: 0 vulnerabilities

PR pre-checks (self review)

  • I added the relevant backport labels
  • I didn't leave commented-out/debugging code

Reminders

  • Add New configuration option and document them (in sdcm/sct_config.py)
  • Add unit tests to cover my changes (under unit-test/ folder)
  • Update the Readme/doc folder relevant to this change (if needed)

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot AI changed the title [WIP] Enhance wait_for_extended_operation for better error reporting improvement(gce-utils): extract operation metadata on timeout for better diagnostics Feb 19, 2026
Copilot AI requested a review from fruch February 19, 2026 10:12
@fruch fruch added backport/none Backport is not required P3 Medium Priority labels Feb 19, 2026
Copilot AI changed the title improvement(gce-utils): extract operation metadata on timeout for better diagnostics improvement(gce-utils): surface GCP audit log errors in operation failures Feb 19, 2026
@fruch fruch added the test-provision-gce Run provision test on GCE label Feb 19, 2026
Copilot AI and others added 2 commits February 19, 2026 13:10
…iled timeout error information

Enhanced the wait_for_extended_operation function to provide more
detailed error information when operations timeout. When a timeout
occurs, the function now extracts and includes operation details such
as operation ID, status, error codes, error messages, target links,
and operation types in the error message and logs.

This helps diagnose GCP instance creation failures by providing more
context about why the operation timed out instead of just saying it
timed out without additional information.

Co-authored-by: fruch <340979+fruch@users.noreply.github.com>
Enhanced wait_for_extended_operation to query GCP audit logs when
operations fail or timeout. When instance creation fails, the function
now fetches relevant audit log entries to provide additional context
about the failure (e.g., BACKEND_ERROR, QUOTA_EXCEEDED).

Added get_operation_audit_logs method to GceLoggingClient to query
activity logs by operation ID. The method queries cloudaudit activity
logs which contain error details like status codes and messages that
aren't available in the operation object itself.

This helps diagnose instance creation failures by surfacing errors from
GCP's audit logs, such as backend errors, quota issues, or resource
exhaustion that may not be clearly indicated in the operation response.

Co-authored-by: fruch <340979+fruch@users.noreply.github.com>
@fruch fruch force-pushed the copilot/extend-wait-for-extended-operation branch from bf3dfaa to 633ce75 Compare February 19, 2026 11:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport/none Backport is not required P3 Medium Priority test-provision-gce Run provision test on GCE

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments