Skip to content

Releases: pytorch/test-infra

v20260605-234020

05 Jun 23:42
35cd0d7

Choose a tag to compare

autorevert: split job signals by test config (preserve config in base…

v20260604-153951

04 Jun 15:41
b9b7ab8

Choose a tag to compare

[CRCR] Set default HUD_API_URL and ensure it ends with /oot/results f…

v20260604-011414

04 Jun 01:15
4a2a9c4

Choose a tag to compare

[crcr] Send HUD bot key via x-hud-internal-bot header (#8143)

## Summary

The cross-repo CI relay forwards each downstream callback to the HUD as
a separate POST. Under bursty workloads (`in_progress` + `completed` per
job, amplified by matrix and edge-case fan-out), these requests were
rejected by the HUD's rate-limit layer with HTTP 429, surfacing in the
callback Lambda logs as:

```
[ERROR] HUD rejected callback: HTTP 429: Too Many Requests
  File "/var/task/utils/hud.py", line 58, in forward_to_hud
```

**Root cause:** the HUD exempts internal-bot traffic from rate limiting
based on the `x-hud-internal-bot` request header, but `forward_to_hud`
was sending `HUD_BOT_KEY` under `X-OOT-Relay-Token`. Without the
recognized header the relay's callbacks were treated as public traffic
and throttled.

**Fix:** send `HUD_BOT_KEY` as the value of the `x-hud-internal-bot`
header so the HUD identifies the relay as an internal bot and skips rate
limiting.

## Test Plan

Added a unit test asserting the bot key is sent under
`x-hud-internal-bot`, and ran the relay's HUD test suite:

```
cd aws/lambda/cross_repo_ci_relay
python3 -m unittest tests.test_hud -v
```

All 6 tests pass.

This PR was authored with the assistance of an AI coding assistant.

v20260530-093439

30 May 09:36
c53a8dc

Choose a tag to compare

OOT HUD: Add API endpoint, PR page integration, and replicator mappin…

v20260529-182245

29 May 18:25
90efa7e

Choose a tag to compare

Log full GH API registration token fail response (#8126)

When createRegistrationTokenForRepo/Org throws an HttpError, log the
full response body (e.response.data) alongside the error so the root
cause is visible in CloudWatch without needing a separate debug script.
Previously only "HttpError: Forbidden" was logged, hiding actionable
details like "Repository level self-hosted runners are disabled on this
repository".

Signed-off-by: Thanh Ha <thanh.ha@linuxfoundation.org>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

v20260522-142343

22 May 14:25
5f8dd50

Choose a tag to compare

[CRCR] Initial implementation of L2 (#7967)

## Author

- @KarhouTam 
- @can-gaa-hou
- @fffrog 

## Summary
- This PR implements the L2 levels of the cross-repository CI relay
described in https://github.com/pytorch/rfcs/pull/90.
- For the previous L1 implementation, please refer to
https://github.com/pytorch/test-infra/pull/7847.
- Please refer to
https://github.com/pytorch/rfcs/pull/90#issuecomment-4148056447 for the
overall implementation.
- Please refer to https://github.com/pytorch/rfcs/pull/96 for the design
of HUD side.
- Please refer to https://github.com/pytorch/test-infra/pull/8069 for
the implementation of HUD side.

Higher-level behaviors for `L3` and `L4` are intentionally left for
follow-up work.

## Architecture

The relay is split into two AWS Lambda functions:

- `webhook` lambda function (Updated)
- [x] receives GitHub webhook PR and push events from the upstream repo
- [x] validates webhook signatures and authenticates with AWS Secret
Manager
- [x] reads the downstream whitelist from the URL and stores it in Redis
- [x] for `opened`/`reopened`/`synchronized`/`closed` actions, forwards
repository_dispatch events to downstream repos
 
- `callback` lambda function (Added)
- [x] receives downstream callback payload through a public lambda
function URL
  - [x] validates callback payload with OIDC
- [x] reads the downstream whitelist from the URL and stores it in Redis
- [x] extracts CI result information from the payload and uploads to
PyTorch HUD
- [x] records `queue time` and `execute time` for evolution to `L3` repo

## Changes
```md
..github/
├── workflows/
│   └── _lambda-do-release-runners.yml     # Updates the Lambda release workflow to include cross-repo-ci-relay packaging/release
│
└── actions/
    └── cross-repo-ci-relay-callback/
        └── action.yml                     # Composite action used by downstream workflows to report status back to the relay/result endpoint

aws/lambda/cross_repo_ci_relay/
├── tests/                                 # Unit tests for allowlist/config/webhook/result/redis behavior
├── README.md                              # Project overview, local development, callback flow, and result-side validation steps
├── Makefile                               # Top-level local developer entrypoint for test / deploy / clean
├── local_server.py                        # FastAPI wrapper for local end-to-end testing of both webhook and result endpoints
├── requirements.txt                       # Python dependencies required by the relay Lambdas
│
├── utils/
│   ├── allowlist.py                       # Loads, parses, and queries the downstream allowlist by rollout level
│   ├── config.py                          # Shared runtime config loading and cached get_config() helper
│   ├── gh_helper.py                       # GitHub App, repository_dispatch, and GitHub file access helpers
│   ├── hud.py                             # HUD write helpers for downstream result reporting
│   ├── jwt_helper.py                      # Helpers for minting/verifying relay callback tokens
│   ├── redis_helper.py                    # Redis helpers for allowlist cache, OOT state, and timing data
│   └── misc.py                            # Shared TypedDict definitions and HTTPException
│
├── webhook/
│   ├── Makefile                           # Build/package/deploy commands for the webhook Lambda
│   ├── lambda_function.py                 # Webhook Lambda entrypoint: verifies GitHub webhook requests and routes events
│   └── event_handler.py                   # Handles PR/push events, resolves allowlist targets, and dispatches to downstream repos
│
└── callback/
    ├── Makefile                           # Build/package/deploy commands for the result Lambda
    ├── lambda_function.py                 # Result Lambda entrypoint: verifies callback token and GitHub OIDC token
    └── callback_handler.py                # Validates callback payloads, checks L2+ eligibility, stores state, and writes to HUD
```

## Usage

See
[README.md](https://github.com/KarhouTam/test-infra/blob/crcr-L2/aws/lambda/cross_repo_ci_relay/README.md)
for more details.

## Verification

We performed the following scenario verification on our AWS Lambda
instance:

- [x] Test with Upstream PR create/reopen/synchronize and push events
triggering webhook, then redispatching to the Downstream CI (different
organization) workflow.
- [x] Test with Downstream workflow send callback payload through the
added action to the result lambda, then extract CI result information
and send to PyTorch HUD.

## Terraform configuration

- https://github.com/pytorch/ci-infra/pull/446

## Unit Tests

- [x] Unit Tests (Mock)

## Security

- **Callback payload carries full upstream webhook data back to HUD** —
`action.yml` builds the callback body by mutating
`github.event.client_payload` (which contains the entire original
webhook payload: PR metadata, commits, author info) and adding
`status`/`conclusion`/`workflow_name`/`workflow_url` on top. This full
blob is forwarded verbatim by `hud.py` to HUD with no relay-side
filtering. HUD receives both relay-trusted `verified_repo` and an
unvalidated body — if HUD trusts self-reported fields inside the body
over `verified_repo`, a manipulated dispatch payload could tamper with
HUD records.

- **Lambda callback URL is public and hardcoded** — The endpoint is
hardcoded in `action.yml and exposed in a public action, making it
trivially discoverable. OIDC verification blocks unauthorized HUD
writes, but the endpoint has no rate limiting; request flooding can
cause Lambda concurrency exhaustion or Redis connection saturation.

- **Only OIDC is used for verification** — The callback lambda relies
solely on GitHub OIDC token verification for authentication, without
additional application-level secrets or signatures. If an attacker
compromises a downstream repo's GitHub Actions permissions, they could
forge authenticated requests to the callback endpoint. Besides, OIDC has
its own limitations (e.g., token expiration, potential
misconfigurations) that could lead to unauthorized access if not
carefully managed.

## HUD Interaction

- **Design Principle: Transparent Relay & Decoupling**
The Relay Server acts as a **lightweight data passthrough layer**. It
does not define or parse specific CI data formats; instead, it offloads
data interpretation and validation to the HUD. This ensures complete
decoupling between the relay infrastructure and business-specific data.

- **Security & Risk Mitigation**
The relay uses **OIDC authentication** to guarantee the authenticity of
the data source (**Verified Repo**). Its core responsibility is to
ensure the data originates from the claimed repository, while security
filtering and content compliance are enforced at the HUD level.

---------

Co-authored-by: can-gaa-hou <jiahaochen535@gmail.com>
Co-authored-by: fffrog <ljw1101.vip@gmail.com>

v20260515-222645

15 May 22:28
33f82bb

Choose a tag to compare

[autorevert] dispatch untargeted signals without tests-to-include fil…

v20260511-182214

11 May 18:24
fcebb92

Choose a tag to compare

autorevert: stop creating duplicate workflow runs on GitHub 5xx durin…

v20260511-174055

11 May 17:42
5c94ff9

Choose a tag to compare

autorevert: treat skipped jobs as missing, not pending (#8056)

## Summary

A `workflow_job` with `conclusion=skipped` (e.g. `if:` gate,
required-check skip on a cancelled/failed dependency) was recorded as
**PENDING** in `misc.autorevert_state` with the real `started_at` /
`job_id` and stuck around until the ~16h lookback dropped it.

Cause: `JobMeta.status` had no `skipped` predicate and fell through to
its `return SignalStatus.PENDING` default.

Repro — job
[74727873795](https://github.com/pytorch/pytorch/actions/runs/25468257317/job/74727873795):

| source | conclusion / status |
|---|---|
| GitHub + `default.workflow_job` | `skipped` |
| `misc.autorevert_state` (ts `2026-05-07 16:05:06`) | **`pending`** ❌ |

## Fix

- `JobRow.is_skipped` (`conclusion == 'skipped'`)
- `JobMeta.is_skipped = all(r.is_skipped for r in jrows)` — `all()` so a
real attempt running alongside a skipped shard still drives the verdict
- Treat `is_skipped` like `is_cancelled` in `JobMeta.status` → return
`None` (missing)

## Test plan

- New `test_skipped_attempt_yields_no_event` mirrors the
cancelled-attempt test; verified it fails against pre-fix code, passes
after.
- fed the real CH-sourced `JobRow` for job 74727873795 (the affected
one) through the patched extractor. `JobMeta.status` returns `None` →
`_build_non_test_signals` emits no event for the cell, matching the
intended "missing" semantics.
- Full signal-track suite (97 tests) green; `ruff format` + `ruff check`
clean.
- Ran autorevert locally, manually verified behavior before / after

v20260510-191038

10 May 19:12
a6bae5e

Choose a tag to compare

[autorevert] Add Signal.replace / SignalCommit.replace for safe recon…