Releases: pytorch/test-infra
Releases · pytorch/test-infra
v20260605-234020
autorevert: split job signals by test config (preserve config in base…
v20260604-153951
[CRCR] Set default HUD_API_URL and ensure it ends with /oot/results f…
v20260604-011414
[crcr] Send HUD bot key via x-hud-internal-bot header (#8143) ## Summary The cross-repo CI relay forwards each downstream callback to the HUD as a separate POST. Under bursty workloads (`in_progress` + `completed` per job, amplified by matrix and edge-case fan-out), these requests were rejected by the HUD's rate-limit layer with HTTP 429, surfacing in the callback Lambda logs as: ``` [ERROR] HUD rejected callback: HTTP 429: Too Many Requests File "/var/task/utils/hud.py", line 58, in forward_to_hud ``` **Root cause:** the HUD exempts internal-bot traffic from rate limiting based on the `x-hud-internal-bot` request header, but `forward_to_hud` was sending `HUD_BOT_KEY` under `X-OOT-Relay-Token`. Without the recognized header the relay's callbacks were treated as public traffic and throttled. **Fix:** send `HUD_BOT_KEY` as the value of the `x-hud-internal-bot` header so the HUD identifies the relay as an internal bot and skips rate limiting. ## Test Plan Added a unit test asserting the bot key is sent under `x-hud-internal-bot`, and ran the relay's HUD test suite: ``` cd aws/lambda/cross_repo_ci_relay python3 -m unittest tests.test_hud -v ``` All 6 tests pass. This PR was authored with the assistance of an AI coding assistant.
v20260530-093439
OOT HUD: Add API endpoint, PR page integration, and replicator mappin…
v20260529-182245
Log full GH API registration token fail response (#8126) When createRegistrationTokenForRepo/Org throws an HttpError, log the full response body (e.response.data) alongside the error so the root cause is visible in CloudWatch without needing a separate debug script. Previously only "HttpError: Forbidden" was logged, hiding actionable details like "Repository level self-hosted runners are disabled on this repository". Signed-off-by: Thanh Ha <thanh.ha@linuxfoundation.org> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
v20260522-142343
[CRCR] Initial implementation of L2 (#7967)
## Author
- @KarhouTam
- @can-gaa-hou
- @fffrog
## Summary
- This PR implements the L2 levels of the cross-repository CI relay
described in https://github.com/pytorch/rfcs/pull/90.
- For the previous L1 implementation, please refer to
https://github.com/pytorch/test-infra/pull/7847.
- Please refer to
https://github.com/pytorch/rfcs/pull/90#issuecomment-4148056447 for the
overall implementation.
- Please refer to https://github.com/pytorch/rfcs/pull/96 for the design
of HUD side.
- Please refer to https://github.com/pytorch/test-infra/pull/8069 for
the implementation of HUD side.
Higher-level behaviors for `L3` and `L4` are intentionally left for
follow-up work.
## Architecture
The relay is split into two AWS Lambda functions:
- `webhook` lambda function (Updated)
- [x] receives GitHub webhook PR and push events from the upstream repo
- [x] validates webhook signatures and authenticates with AWS Secret
Manager
- [x] reads the downstream whitelist from the URL and stores it in Redis
- [x] for `opened`/`reopened`/`synchronized`/`closed` actions, forwards
repository_dispatch events to downstream repos
- `callback` lambda function (Added)
- [x] receives downstream callback payload through a public lambda
function URL
- [x] validates callback payload with OIDC
- [x] reads the downstream whitelist from the URL and stores it in Redis
- [x] extracts CI result information from the payload and uploads to
PyTorch HUD
- [x] records `queue time` and `execute time` for evolution to `L3` repo
## Changes
```md
..github/
├── workflows/
│ └── _lambda-do-release-runners.yml # Updates the Lambda release workflow to include cross-repo-ci-relay packaging/release
│
└── actions/
└── cross-repo-ci-relay-callback/
└── action.yml # Composite action used by downstream workflows to report status back to the relay/result endpoint
aws/lambda/cross_repo_ci_relay/
├── tests/ # Unit tests for allowlist/config/webhook/result/redis behavior
├── README.md # Project overview, local development, callback flow, and result-side validation steps
├── Makefile # Top-level local developer entrypoint for test / deploy / clean
├── local_server.py # FastAPI wrapper for local end-to-end testing of both webhook and result endpoints
├── requirements.txt # Python dependencies required by the relay Lambdas
│
├── utils/
│ ├── allowlist.py # Loads, parses, and queries the downstream allowlist by rollout level
│ ├── config.py # Shared runtime config loading and cached get_config() helper
│ ├── gh_helper.py # GitHub App, repository_dispatch, and GitHub file access helpers
│ ├── hud.py # HUD write helpers for downstream result reporting
│ ├── jwt_helper.py # Helpers for minting/verifying relay callback tokens
│ ├── redis_helper.py # Redis helpers for allowlist cache, OOT state, and timing data
│ └── misc.py # Shared TypedDict definitions and HTTPException
│
├── webhook/
│ ├── Makefile # Build/package/deploy commands for the webhook Lambda
│ ├── lambda_function.py # Webhook Lambda entrypoint: verifies GitHub webhook requests and routes events
│ └── event_handler.py # Handles PR/push events, resolves allowlist targets, and dispatches to downstream repos
│
└── callback/
├── Makefile # Build/package/deploy commands for the result Lambda
├── lambda_function.py # Result Lambda entrypoint: verifies callback token and GitHub OIDC token
└── callback_handler.py # Validates callback payloads, checks L2+ eligibility, stores state, and writes to HUD
```
## Usage
See
[README.md](https://github.com/KarhouTam/test-infra/blob/crcr-L2/aws/lambda/cross_repo_ci_relay/README.md)
for more details.
## Verification
We performed the following scenario verification on our AWS Lambda
instance:
- [x] Test with Upstream PR create/reopen/synchronize and push events
triggering webhook, then redispatching to the Downstream CI (different
organization) workflow.
- [x] Test with Downstream workflow send callback payload through the
added action to the result lambda, then extract CI result information
and send to PyTorch HUD.
## Terraform configuration
- https://github.com/pytorch/ci-infra/pull/446
## Unit Tests
- [x] Unit Tests (Mock)
## Security
- **Callback payload carries full upstream webhook data back to HUD** —
`action.yml` builds the callback body by mutating
`github.event.client_payload` (which contains the entire original
webhook payload: PR metadata, commits, author info) and adding
`status`/`conclusion`/`workflow_name`/`workflow_url` on top. This full
blob is forwarded verbatim by `hud.py` to HUD with no relay-side
filtering. HUD receives both relay-trusted `verified_repo` and an
unvalidated body — if HUD trusts self-reported fields inside the body
over `verified_repo`, a manipulated dispatch payload could tamper with
HUD records.
- **Lambda callback URL is public and hardcoded** — The endpoint is
hardcoded in `action.yml and exposed in a public action, making it
trivially discoverable. OIDC verification blocks unauthorized HUD
writes, but the endpoint has no rate limiting; request flooding can
cause Lambda concurrency exhaustion or Redis connection saturation.
- **Only OIDC is used for verification** — The callback lambda relies
solely on GitHub OIDC token verification for authentication, without
additional application-level secrets or signatures. If an attacker
compromises a downstream repo's GitHub Actions permissions, they could
forge authenticated requests to the callback endpoint. Besides, OIDC has
its own limitations (e.g., token expiration, potential
misconfigurations) that could lead to unauthorized access if not
carefully managed.
## HUD Interaction
- **Design Principle: Transparent Relay & Decoupling**
The Relay Server acts as a **lightweight data passthrough layer**. It
does not define or parse specific CI data formats; instead, it offloads
data interpretation and validation to the HUD. This ensures complete
decoupling between the relay infrastructure and business-specific data.
- **Security & Risk Mitigation**
The relay uses **OIDC authentication** to guarantee the authenticity of
the data source (**Verified Repo**). Its core responsibility is to
ensure the data originates from the claimed repository, while security
filtering and content compliance are enforced at the HUD level.
---------
Co-authored-by: can-gaa-hou <jiahaochen535@gmail.com>
Co-authored-by: fffrog <ljw1101.vip@gmail.com>
v20260515-222645
[autorevert] dispatch untargeted signals without tests-to-include fil…
v20260511-182214
autorevert: stop creating duplicate workflow runs on GitHub 5xx durin…
v20260511-174055
autorevert: treat skipped jobs as missing, not pending (#8056) ## Summary A `workflow_job` with `conclusion=skipped` (e.g. `if:` gate, required-check skip on a cancelled/failed dependency) was recorded as **PENDING** in `misc.autorevert_state` with the real `started_at` / `job_id` and stuck around until the ~16h lookback dropped it. Cause: `JobMeta.status` had no `skipped` predicate and fell through to its `return SignalStatus.PENDING` default. Repro — job [74727873795](https://github.com/pytorch/pytorch/actions/runs/25468257317/job/74727873795): | source | conclusion / status | |---|---| | GitHub + `default.workflow_job` | `skipped` | | `misc.autorevert_state` (ts `2026-05-07 16:05:06`) | **`pending`** ❌ | ## Fix - `JobRow.is_skipped` (`conclusion == 'skipped'`) - `JobMeta.is_skipped = all(r.is_skipped for r in jrows)` — `all()` so a real attempt running alongside a skipped shard still drives the verdict - Treat `is_skipped` like `is_cancelled` in `JobMeta.status` → return `None` (missing) ## Test plan - New `test_skipped_attempt_yields_no_event` mirrors the cancelled-attempt test; verified it fails against pre-fix code, passes after. - fed the real CH-sourced `JobRow` for job 74727873795 (the affected one) through the patched extractor. `JobMeta.status` returns `None` → `_build_non_test_signals` emits no event for the cell, matching the intended "missing" semantics. - Full signal-track suite (97 tests) green; `ruff format` + `ruff check` clean. - Ran autorevert locally, manually verified behavior before / after
v20260510-191038
[autorevert] Add Signal.replace / SignalCommit.replace for safe recon…