GitHub Runners Observability Guidelines

Note

This guidelines are an unreleased DRAFT

Overview

PyTorch's CI infrastructure serves as the backbone for continuous integration across multiple cloud providers, supporting thousands of developers and contributors worldwide. The reliability of this infrastructure directly impacts PyTorch's development velocity, code quality, and release cadence. Without comprehensive monitoring, issues such as runner failures, performance degradation, or capacity bottlenecks can go undetected, leading to delayed builds, frustrated contributors, and potential release delays. Monitoring enables proactive identification of problems, ensures optimal resource utilization, and provides the data necessary for capacity planning and infrastructure optimization. This is especially critical given PyTorch's position as a leading machine learning framework where build reliability directly affects the broader AI/ML ecosystem.

This document defines the mandatory monitoring and observability requirements for GitHub runners added to the Pytorch multi-cloud CI infrastructure. All runners must comply with these guidelines to ensure proper tracking of health, performance, and availability. This document is split into two parts --

Requirements : which provides guidelines on what is reqired to onboard a new runner system and
Implementation : which provides guidelines on how to fulfill the above requriements in a manner consisent with the rest of pytorch CI infra

Requirements

Runner Pool Stability

A candidate runner pool must:

Undergo stability assessment before deployment in critical CI/CD workflows
Maintain performance metrics during test jobs
Track resource utilization and stability patterns
Document baseline performance metrics for each runner type

Incident Management

Runner pools must:

Implement real-time status monitoring
Configure automated alerts for:
- Runner pool offline events
- Capacity reduction incidents
- Performance degradation
- Resource exhaustion
Establish alert routing to:
- CI infrastructure team
- Community maintainers
- System administrators

Metrics Requirements

All runners must collect and expose the following metrics on hud.pytorch.org/metrics

Lifecycle Metrics

Runners must track:

Registration/unregistration events
Job start/completion times
Queue wait times
Job execution duration
Resource utilization during jobs
Error rates and types

Health Metrics

Runners must monitor:

Heartbeat status
System resource usage (CPU, Memory, Disk)
Network connectivity
GitHub API response times
Runner process health

Technical Requirements

OpenTelemetry Integration

All monitoring implementations must:

Expose metrics in OpenTelemetry format
Follow standardized metric naming conventions
Use consistent labeling across all runners
Implement proper metric aggregation and sampling

Service Level Requirements

Production runners must maintain:

Minimum uptime of 99.9%
Maximum job queue time of 5 minutes
Job execution time variance within ±10% of baseline
Response time to critical alerts within 15 minutes
Maximum capacity reduction of 10%

Dashboard Requirements

HUD Integration

The PyTorch CI HUD is a dashboard that consolidates metrics and dashboards for tracking the Continuous Integration (CI) system of PyTorch, including metrics related to runners. The HUD provides a centralized view of these metrics, with dashboards like the Metrics Dashboard, Flambeau and Benchmark Metrics offering insights into runner performance and CI health

Teams providing runners to the pool must

Implementing OpenTelemetry data source integration to HUD
Support real-time status overview
Support resource utilization graphs
Alert history and status
Runner pool capacity visualization

Alternative Dashboards

Teams may implement:

Grafana dashboard implementation
Custom metrics visualization
Alert management interface
Performance reporting

Documentation Requirements

Teams must:

Maintain up-to-date monitoring documentation
Doccuemnt the architecture diagram detailing their runner CI infrastructure setup
Document all custom metrics, monitoring endpoints, and esclalation routes
Document thresholds for raising and resolving alerts
Document alert response procedures and playbooks for internal SRE/manitainers to follow to resolve the alerts
Conduct regular review and updates of the documentation to prevent documentation from getting outdated

Maintenance Requirements

Teams must:

Conduct regular metric review
Perform alert threshold tuning
Optimize performance
Plan for capacity

Compliance Requirements

Teams must:

Conduct regular review of monitoring effectiveness
Perform quarterly metric analysis
Update monitoring strategy annually
Implement continuous improvement process

Implementation

System Architecture

In order to provide a clear separation between the pytorch foundation runners and community/partner runners, the following guidelines must be followed. For details on getting started with onboarding a new runner, please refer to the Partners Pytorch CI Runners guide.

PyTorch Runners

Must implement:

Dedicated monitoring namespace
Resource quotas and limits
Custom metrics for PyTorch-specific workloads
Integration with existing PyTorch monitoring infrastructure

Community Runners

Must implement:

Separate monitoring namespace
Basic resource monitoring
Job execution metrics
Error tracking and reporting

Alerting

All CI Runners should post alerts to #pytorch-infra-alerts channel in case of service degradation.

Teams must define clear alert thresholds as part of the runner documentation requirements

Alerts may be of three types --

Raise warning alers when the expected values degrade below the P50 threshold nominal values
Raise error alerts when the expected values degrade below the P90 threshold nominal values
Raise critical alerts when the expected values degrade below the P99 threshold nominal values

Alerts need to have a raise threshold and a clear threshold defined.

In general the high raise threshold must be greater than the high clear threshold, and low raise threshold values would be less than the low clear threshold. A common use case for high raise thresholds would be runner HTTP 5XX error rate, and low raise threshold would be metrics such as runner node disk space.

In addition to the above, for successfully managing alerts, teams must:

Implement best-effort alert deduplication so as to reduce redundant posts in the channel
Establish proper escalation paths for tagging maintainers.

Metric Collection

Pytorch HUD uses a ClickHouse Cloud database as the data source for the dashboards, whose schema is defined here. All runners must publish metrics marked mandatory in the tables below.

Note

TODO :: Fill in this section based on current state of metrics from WG meeting.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Runners Observability Guidelines

Overview

Requirements

Runner Pool Stability

Incident Management

Metrics Requirements

Lifecycle Metrics

Health Metrics

Technical Requirements

OpenTelemetry Integration

Service Level Requirements

Dashboard Requirements

HUD Integration

Alternative Dashboards

Documentation Requirements

Maintenance Requirements

Compliance Requirements

Implementation

System Architecture

PyTorch Runners

Community Runners

Alerting

Metric Collection

FilesExpand file tree

Observability-Guidelines.md

Latest commit

History

Observability-Guidelines.md

File metadata and controls

GitHub Runners Observability Guidelines

Overview

Requirements

Runner Pool Stability

Incident Management

Metrics Requirements

Lifecycle Metrics

Health Metrics

Technical Requirements

OpenTelemetry Integration

Service Level Requirements

Dashboard Requirements

HUD Integration

Alternative Dashboards

Documentation Requirements

Maintenance Requirements

Compliance Requirements

Implementation

System Architecture

PyTorch Runners

Community Runners

Alerting

Metric Collection