Add OS-level resource metric collection to flow run subprocesses#21071
Conversation
Add OpenTelemetry-based CPU and memory metric collection inside flow run subprocesses, exporting via OTLP HTTP to the Cloud telemetry endpoint. - Add TelemetrySettings model with enable_resource_metrics and resource_metrics_interval_seconds settings - Add opentelemetry-instrumentation-system-metrics to otel extra and dev deps - Add RunMetrics context manager that starts SystemMetricsInstrumentor filtered to process.cpu.utilization, process.memory.usage, process.memory.virtual with proper resource attributes - Wrap run_flow() in engine.py __main__ block with RunMetrics Closes: OSS-7694 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Merging this PR will not alter performance
Comparing |
…and type annotation - Add PREFECT_TELEMETRY_ENABLE_RESOURCE_METRICS and PREFECT_TELEMETRY_RESOURCE_METRICS_INTERVAL_SECONDS to SUPPORTED_SETTINGS - Fix test_noop_when_import_fails to use builtins module instead of __builtins__ dict - Fix test_instruments_and_shuts_down to patch OTel classes at their source modules instead of at the import site - Add type annotation to logger and use get_logger() Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…collection-to-flow-run
chrisguidry
left a comment
There was a problem hiding this comment.
🥵 This is going to be real!
The OTLPMetricExporter needs the Prefect API key to authenticate with Cloud's telemetry ingestion endpoint. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Prevents a 30s stall on shutdown when the endpoint is unreachable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 219e28a67c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…tion - Only send API key to Cloud-derived endpoints, not user-overridden ones - Pass settings into _resolve_metrics_endpoint to avoid double call - Set export_timeout_millis=5000 on the reader to prevent 30s shutdown stall - Add ge=1 validation on resource_metrics_interval_seconds Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3453cf9dd0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| with RunMetrics(flow_run, flow): | ||
| if flow.isasync: | ||
| run_coro_as_sync( | ||
| run_flow(flow, flow_run=flow_run, error_logger=run_logger) | ||
| ) | ||
| else: | ||
| run_flow(flow, flow_run=flow_run, error_logger=run_logger) |
…rt, cloud auth on override - Wrap telemetry setup in try/except so initialization errors degrade to no-op instead of aborting the flow run - Honor standard OTEL_EXPORTER_OTLP_ENDPOINT env var as fallback when the metrics-specific variable is not set - Preserve Cloud auth headers when endpoint is overridden via env var (is_cloud now derived from connected_to_cloud, not endpoint source) - Only pass headers kwarg to OTLPMetricExporter for Cloud endpoints so non-cloud exporters can use OTEL_EXPORTER_OTLP_HEADERS env vars Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
User-overridden endpoints (OTEL_EXPORTER_OTLP_METRICS_ENDPOINT or OTEL_EXPORTER_OTLP_ENDPOINT) now always return is_cloud=False, so the Prefect API key is only attached to the auto-derived Cloud endpoint. This prevents leaking credentials to third-party collectors. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Prevents double-slash in derived endpoint when PREFECT_API_URL ends with /, which can cause export failures behind strict proxies. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
TelemetrySettingsmodel withenable_resource_metrics(default:True) andresource_metrics_interval_seconds(default:10) settings underPREFECT_TELEMETRY_env prefixopentelemetry-instrumentation-system-metricsto theoteloptional extra anddevdependency groupRunMetricscontext manager that creates an OTelMeterProviderwithOTLPMetricExporter, startsSystemMetricsInstrumentorfiltered toprocess.cpu.utilization,process.memory.usage,process.memory.virtualwith flow run resource attributesrun_flow()inengine.py__main__block withRunMetricsso metrics are collected for the lifetime of flow run subprocessesOTEL_EXPORTER_OTLP_METRICS_ENDPOINTenv var > auto-derived from Cloud API URL > disabledCloses: OSS-7694
🤖 Generated with Claude Code