Skip to content

[EXPERIMENT] cache repo-local .dotnet toolsets (bootstrap lever)#18163

Closed
radical wants to merge 26 commits into
microsoft:mainfrom
radical:ankj/exp-dotnet-cache
Closed

[EXPERIMENT] cache repo-local .dotnet toolsets (bootstrap lever)#18163
radical wants to merge 26 commits into
microsoft:mainfrom
radical:ankj/exp-dotnet-cache

Conversation

@radical

@radical radical commented Jun 12, 2026

Copy link
Copy Markdown
Member

[EXPERIMENT] Perf probe, not for merge. Stacked on the source-partition (#18147) + tests-only-restore (#18162) changes. Targets the bootstrap portion of Restore.

Context

Measured setup_for_tests so far:

enumerate Restore whole job
Baseline 67s 111s ~202s
+ source-partition enumerate 25–33s ~107s ~180s
+ tests-only restore 25–34s 99s ~179s

Restricting ProjectToBuild only shaved ~8s — the restore wall is the ~40s bootstrap: arcade installs the SDK + 8 runtime toolsets (global.json tools.runtimes: dotnet/aspnetcore × x64+arm64 × net8/net9) into repo-local .dotnet, plus ~55s restoring the test→src closure (tests reference most of src, so that can't shrink much).

Change

Cache repo-local .dotnet (~450MB) keyed on global.json + eng/Versions.props. On a warm hit, arcade finds the SDK + runtimes present and skips the installs.

Contrast with the rejected global-NuGet-cache experiment (#18145): that folder is multi-GB, so its tar-extract + restore-graph work exceeded the download saved (net loss). .dotnet is small enough that the math should flip.

How to read

  • First run = cold (miss): bootstrap unchanged; post-step saves the cache.
  • Warm re-run: bootstrap install skipped; measure Restore drop minus cache-extract.

Honest expectation

Even in the best case there is a cold-runner floor: ~55s test-graph restore + cache-extract + select (~25s) + enumerate (~30s). This experiment measures how close the whole job gets to 60s; hitting 60s exactly on a cold runner is unlikely while a real restore happens.

radical and others added 26 commits June 10, 2026 13:53
Adds a descriptive map of which CI targets must run when a given repo path
changes, covering the .NET test projects and the validation/polyglot jobs in
tests.yml. Intended to audit/validate a future selective-CI implementation;
nothing consumes it yet (CI still runs the full matrix, gated only by
eng/testing/github-ci-trigger-patterns.txt).

- docs/ci/test-trigger-map.yml: machine-readable rules (aliases, test_self,
  run_all, test_hubs, shared_source, core_source, curated_jobs, loose_file_deps,
  leaf_source, shared_compiled_source, gaps)
- docs/ci/test-trigger-map.md: methodology, target vocabulary, caveats

Derivation: ProjectReference closure over src->src edges (so leaf integrations
are not inflated by the shared Aspire.Hosting.Tests hub), plus file-granular
<Compile Include> coupling for shared source link-compiled across projects.
Curated rules layer on for runtime/loose-file deps and the non-.NET jobs
(polyglot/typescript/extension/cli/api/deployment).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Design for a tool that maps a PR's changed files to the subset of test
projects and CI jobs to run, so PR CI runs a relevant subset instead of
the full matrix. Companion to the descriptive test-trigger-map.{md,yml}.

Two layers, split by who can know the dependency:

- Derived (zero maintenance): dotnet-affected over the MSBuild graph,
  scoped to Aspire.slnx, covering ProjectReference reverse closure,
  foreign linked <Compile Include>, CPM, and Directory.Build.* changes.
- Curated (~80 lines): the non-.NET jobs and runtime loose-file reads
  the project graph cannot see.

Records design decisions and the measured selectivity profile: any
hosting-integration or data-component change fans out to ~36 hosting
tests via two structural edges (the Aspire.Hosting.Tests hub and the
tests/testproject mega-apphost). This fan-out is accepted deliberately
-- still far cheaper than the full matrix -- and the edges are not
pruned. CLI (1), Dashboard (7), and component-to-component isolation
stay tightly scoped.

Also specifies audit mode (full matrix still runs; a GitHub step summary
shows the would-have-been-skipped set), a verifier test for the curated
layer, the pipeline insertion point, and operational constraints proven
by probes (Aspire.slnx scoping, DOTNET_ROOT, internal-feed mirroring).

Proposal only; nothing consumes it yet.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The curated test-trigger-map listed three run_all_globs entries that match
no tracked file (eng/Tools.props, eng/Packaging.targets, eng/targets/**), so
they could never trigger anything. Replaced them with eng/*.props and
eng/*.targets so present and future root build-infra files fail-open to ALL.

Also added a shared_compiled_source entry for
src/Components/Common/ConfigurationSchemaAttributes.cs ->
ConfigurationSchemaGenerator.Tests: that test link-compiles the file but was
covered by no rule, so a change to it would have skipped a test that compiles
it.

Updated the design doc to use dotnet-affected's current --filter-file-path
flag (--solution-path is obsolete in 6.2.0).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds Infrastructure.Tests coverage that keeps the hand-maintained
docs/ci/test-trigger-map.yml honest against repo reality: referential
integrity of test:/job: targets, alias resolution, every glob matches a
tracked file, every src project is reachable by some rule, structural
hygiene, and that every <Compile Include> of a shared file into a test
project is selected by the map. Plus an outerloop, self-skipping oracle
cross-check against dotnet-affected.

Parses the map with YamlDotNet (test-only) and matches globs with
Microsoft.Extensions.FileSystemGlobbing.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Extracts the dependency kinds the curated test-trigger-map describes into a
behavior spec for the planned SelectTests selector: leaf/core ProjectReference
closure, component isolation, file-granular <Compile Include>, test hubs,
shared source dirs, run_all catch-all + fail-open, test_self, curated jobs,
runtime/loose-file deps, additive union, alias expansion, Layer-1 union, and
the kill switch.

The 25 tests are intentionally failing: TestSelector.Select throws until the
selector is implemented. Implementing it to make these pass is the definition
of done. Adds the tools/SelectTests skeleton (contract types + throwing impl)
so the tests compile, and a map rule (tools/SelectTests/** -> Infrastructure.Tests).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
TestSelector.Select threw NotImplementedException, so the 25 acceptance
tests in SelectTestsAcceptanceTests (the behavior spec for the selective-CI
tool) were RED.

Implement the Layer 2 resolution that consumes docs/ci/test-trigger-map.yml:

- Match each changed file against every rule section (run_all_globs,
  test_self, test_hubs, shared/core/leaf/shared_compiled source,
  loose_file_deps, curated_jobs); union the targets.
- Expand ALL_HOSTING_TESTS / ALL_COMPONENT_TESTS aliases to concrete test
  projects; route test: targets to TestProjects and job: targets to Jobs.
- Fail open: a src/** file matched by no rule escalates to the full matrix
  (a missed test is a silent regression; an extra run is just slower).
- Kill switch (ForceAll) and any run_all match escalate to the full matrix
  with an EscalationReason; ALL expands TestProjects to the matrix passed in
  and Jobs to every curated job token.
- Union the injected Layer 1 (graph-derived) affected projects.

The tool gets its own internal YamlDotNet parse model (TriggerMap) rather
than sharing the verifier's model: the test project references the tool, so
sharing would be a circular dependency. The model is internal, so it adds no
public API surface.

YamlDotNet is needed by the tool under central package management, so its
PackageVersion moves from tests/Directory.Packages.props (which imports the
root) up to the root Directory.Packages.props.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Program.cs was a stub. Turn it into the real entry point that runs the
two-layer selection the design describes, in audit mode by default.

Flow:

- Read the enumerate-tests all_tests matrix JSON; the projectName values are
  the universe an ALL selection expands to.
- Resolve changed files via `git diff --name-only <from> <to>` (or
  --changed-files). Layer 2 globs match these directly.
- Layer 1: run dotnet-affected over the MSBuild ProjectGraph and union its
  reported test projects in. It reports the union of changed + affected
  projects (one affected.json: [{Name, FilePath}]); we keep the names that
  are matrix test projects.
- Call TestSelector (Layer 2), then emit the matrix, per-job booleans
  (run_polyglot, run_extension_e2e, ...), and a step summary.

Audit vs enforce: without --enforce the full matrix is emitted unchanged
(audit rollout — CI keeps running everything while the map is validated);
--enforce emits the filtered matrix. --force-all is the kill switch.

dotnet-affected is invoked as a local tool ('dotnet tool run
dotnet-affected') so CI only needs 'dotnet tool restore'; it is added to
.config/dotnet-tools.json and resolves from the dotnet-public feed (its '*'
package-source mapping), so no NuGet.config change is needed.
--dotnet-affected overrides with a standalone executable.

Layer 1 failure handling matches the rollout: audit mode warns and falls
back to Layer 2 only; --enforce fails loudly, since under-selecting would
silently skip real tests.

The step summary now reports the options, the incoming changed files, and
the final selection (selected projects, would-have-been-skipped, jobs) so an
audit reader can see exactly what drove the result.

Verified end-to-end against a normal checkout: dotnet-affected reports the
correct affected projects and the pipeline produces the filtered matrix.
Note: dotnet-affected's --from/--to reads git blobs through libgit2 and
throws inside a git worktree (where .git is a file); CI uses a normal
checkout, and a code comment documents using --skip-layer1 for local
worktree dev.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
dotnet-affected silently ignores changed files it can't map to a project
(extension/**, docs/**, loose tooling dirs, ...) — it returns no signal for
them. Those are exactly the files the curated map (Layer 2) must cover, so a
newly added loose-file dependency could otherwise select nothing and go
untested with no warning.

Add SelectionResult.UnmatchedFiles: the changed files that matched no Layer 2
rule. Because the verifier keeps every src project rule-reachable, in a green
repo these are precisely the loose, non-project files Layer 1 also cannot
attribute — i.e. the "neither layer" set. The audit summary now lists them
under "Unattributed changed files" as an early warning to add a curated rule.

Tested: loose unmapped file is reported; mapped files (leaf rule, curated job,
test_self, run_all) are not; the set unions across changed files; and an
unmapped src file is both reported and fails open to ALL.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The map's leaf_source / core_source / test_hubs sections duplicated the
MSBuild project-graph closure that dotnet-affected (Layer 1) computes live, so
they were a maintenance trap. Delete them and keep only what the graph
provably cannot see (Layer 2): run_all_globs, curated_jobs, loose_file_deps,
shared_source, shared_compiled_source, test_self, and named groups. The
selector unions Layer 1's affected projects with these rules, so the deleted
fan-out is reproduced at runtime and can never drift. The yml drops from ~430
to ~160 lines.

Named groups: generalize the old test-only aliases into "groups" that can hold
both test: and job: members, expanded by routing each member by prefix. Use it
to map eng/Bundle.proj (CLI bundle assembly) to a CLI_BUNDLE group
(Aspire.Cli.EndToEnd.Tests + job:cli-starter + job:extension-e2e) rather than
the full matrix. .proj traversal files are invisible to dotnet-affected (not in
Aspire.slnx, and not a supported project type), so they must be curated;
eng/OuterPreBuild.proj (build-wide validation) goes to run_all.

Fail-open change: the selector no longer escalates an unmapped src file to ALL.
Post-trim most src files match no Layer 2 rule on purpose (Layer 1 owns the
closure), so that would over-escalate. Src safety is now "Layer 1 must run"
(the Program entry point fails closed in enforce mode, falls back to the full
matrix in audit mode) plus the unmatched-files audit signal.

Verifier reworked to match: "every src project is reachable by a rule" becomes
"every src project is in Aspire.slnx (so dotnet-affected sees it) or covered by
a curated rule" — the ~16 deliberately out-of-slnx src projects (template
placeholders, tools) are covered by loose_file_deps. The alias check becomes a
groups check that allows job: members. The link-compile coverage check still
guards shared_compiled_source. The oracle test (project-level subset vs
dotnet-affected) is removed: with the graph sections gone it no longer has
anything to cross-check.

Acceptance tests rewritten to a pure input -> SelectionResult contract: given
changed files (and the injected Layer 1 set where the closure is now Layer 1's
job), assert on selected projects / jobs / SelectsAll / UnmatchedFiles, instead
of asserting on map section or group names.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The curated test-trigger map hand-maintained edges the MSBuild graph already
sees, and split "glob -> targets" across several keys that the selector treated
identically. Validated with dotnet-affected 6.2.0 over Aspire.slnx:

- a link-compiled `.cs` file reports its consuming test projects in the
  *changed* set (so a `shared_compiled_source` section was redundant);
- a `src/Components/Common/*` change reports 71 `.Tests` (so a
  `Components/Common -> all-component-tests` rule was redundant);
- the vendored OTel instrumentation dirs are graph-covered, and
  `src/Vendoring/OpenTelemetry.Shared/**` is compiled by nothing (inert).

The curated layer now has exactly four matchers -- a section is its own key only
when the selector treats it differently:

- `conventions`: a `<name>`-capture pattern -> target template, additive and
  existence-guarded (emitted only if the derived test exists in the matrix).
  Covers a test's own folder (`tests/<name>/**`) and the Hosting/Components
  integration dirs -- a backstop for non-MSBuild files (data/resource,
  compile-excluded) the graph can't attribute.
- `ignore`: shared dirs Layer 1 already covers (`src/Components/Common`, the
  vendored OTel instrumentation dirs) or that are inert, listed so they don't
  trip the run-all fallback.
- `path_rules`: the one general glob -> targets matcher (targets may be
  `test:` / `job:` / a group / `ALL`). The catch-all-to-ALL, the
  `Azure.<service>` convention misses, the non-.NET jobs, and the loose-file
  reads all live here under comment headers.
- `derived_targets`: "if any of these selected tests, also run these targets",
  applied to the union of both layers to a cycle-safe fixpoint (CLI tests ->
  cli-starter; acquisition -> the winget/homebrew installer jobs).

Selector: a changed `src/**` file is "Layer-1-owned" when it lives under a
project dir in Aspire.slnx (dotnet-affected emits no per-file map, so this is the
proxy). Only a leftover `src/**` file that is neither owned, matched, nor ignored
forces the full matrix; non-src leftovers are audit-only. Groups expand
recursively (cycle-safe).

Verifier: referential integrity (every test:/job:/derived target resolves; every
convention pattern substitutes its `<name>`), `src`-project reachability, and
dead-glob checks. Acceptance tests drive the engine with synthetic maps so they
assert mechanisms, not the real map's contents.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The curated test-trigger map repeated the same src/<Project>/** path globs
across many job rules (src/Aspire.Cli/** on 7 rules, src/Aspire.Hosting*/**
on 3, src/Aspire.TypeSystem/** on 4, ...). The duplication was a maintenance
trap and, worse, under-selected: a literal path glob only fires on files
inside that directory, so a change to a library the CLI depends on would not
trigger the CLI jobs even though the CLI build is affected.

Root cause: Layer 1 (dotnet-affected) already computes the full transitive
affected set, but RunLayer1 intersected it down to test-project names and
discarded the production-project identity. Layer 2 then had to re-derive
"is the CLI affected" from raw file paths.

Keep the full affected set from Layer 1 and add a project_rules matcher: an
affected production project (matched by project-NAME glob against the affected
set) selects jobs/tests. This dedupes the repeated globs (Aspire.Hosting*
collapses 66 projects to one line), survives a project moving directories,
and follows the graph's transitive closure so a dependency change triggers
the dependent's jobs. It is additive and inert when Layer 1 is skipped; the
loose-file path_rules still cover those triggers.

Only genuine slnx project-root globs migrated. Paths the graph cannot
attribute stay in path_rules: loose files, narrow sub-paths
(src/Aspire.Cli/Templating/Templates/ts-starter/**), the non-compiled
*.ats.txt / *.tscompat.suppression.txt baselines, and src/Aspire/Cli/**
(link-compiled into Aspire.Hosting.CodeGeneration.TypeScript, with no owning
project) which keeps an explicit rule carrying its old targets.

Add tests/Infrastructure.Tests and tests/Aspire.Hosting.Maui.Tests to
Aspire.slnx. Both were absent, so Layer 1 could never fan their
ProjectReference dependencies into them: Infrastructure.Tests would miss
changes to the tools it tests (ExtractTestPartitions, GenerateCITimeline,
CreateFailingTestIssue, GenerateTestSummary, TypeScriptApiCompat), and
Maui.Tests would miss core Aspire.Hosting changes. Both are plain
net10.0 projects with no MAUI workload dependency. Infrastructure.Tests still
needs its path_rules because it reads workflows/eng files as loose inputs the
graph cannot see.

Verifier gains a check that every project_rules name glob resolves to a real
Aspire.slnx project; acceptance tests cover the project_rules mechanism
(name globs, additive union, group/ALL expansion, feeding derived_targets).
Docs updated to five matchers.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two follow-ups to the selective-CI review.

Audit mode emitted the full test matrix but still wrote the *selective*
per-job booleans. Once those gate the non-.NET jobs via `if:`, audit mode
would run every test yet skip unselected jobs -- contradicting "audit changes
nothing; CI runs everything". This got sharper with project_rules, which only
fire when Layer 1 ran: a skipped or failed Layer 1 in audit would switch the
project-rule-driven job booleans off. Force every job boolean true in audit
mode, matching the full matrix WriteMatrix already emits. The step summary
still reports the real (selective) computation, so audit stays advisory:
compute and show what enforcing would do, then run everything.

Add a verifier asserting every tests/<Name>/<Name>.csproj is in Aspire.slnx
(allow-list for deliberate exclusions, empty today). A test project absent
from the solution is invisible to Layer 1, so a change to a production
dependency could never fan into it and its tests would silently never run in
enforcing mode -- the exact gap the Infrastructure.Tests and
Aspire.Hosting.Maui.Tests slnx additions closed. This catches the next such
regression at PR time.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Run SelectTests as part of tests.yml setup_for_tests and rename project_rules
to affected_project_rules so the section name says what it keys off.

Wiring (audit by default, enforce is a one-line flip):

- setup_for_tests gains a select_tests step: after enumerate-tests it runs
  SelectTests over the all_tests matrix with --from/--to (PR base..head) and
  Layer 1 (dotnet-affected, restored via `dotnet tool restore`). It writes the
  selected matrix, the run_* job booleans, and the advisory summary.
- A new setup_for_tests output, selection_enforced, defaults to 'false'. While
  false, SelectTests emits the full matrix and run_* = true, and every gate reads
  `selection_enforced != 'true' || run_X == 'true'`, so the full matrix and all
  jobs run -- behavior is unchanged during the audit phase. Flipping it to 'true'
  (and adding --enforce) makes the matrix and gates selective.
- The step falls back to the full enumerated matrix on any failure, so coverage
  is never silently reduced. The [full ci] PR-body token and run-all-tests label
  pass --force-all; untrusted PR fields go through env vars, never interpolated
  into the script, to avoid shell injection.

Job gates:

- Drop the hand-rolled extension_e2e_changes regex job; gate extension_e2e_tests
  on run_extension_e2e instead. The map's extension-e2e rule now carries the same
  path set the regex had, so it is faithfully replaced.
- Gate the previously-unconditional typescript_sdk_tests and
  typescript_api_compat on their run_* outputs, and teach the results gate to
  tolerate an enforce-mode skip of those jobs.

extension-e2e is intentionally double-routed in the map: its production triggers
live in affected_project_rules (transitive coverage when Layer 1 runs) AND as
path_rules globs, so a src/Aspire.Cli|Hosting*|Dashboard* change still fires it by
pure path match when Layer 1 is unavailable. Other job gates stay project-only.

Note: the tests.yml logic cannot be validated locally (GitHub Actions) -- it needs
a real CI run. dotnet-affected adds an MSBuild ProjectGraph evaluation to the
critical-path setup_for_tests job and must be available on the CI feeds (dnceng).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Extend the selector wiring so every non-.NET job is gated by its run_* output,
not just extension-e2e and the typescript jobs. Now also gated: polyglot_validation,
cli_starter_validation_windows, extension_tests_win, extension_bootstrap_linux, and
the WinGet/Homebrew installer-prepare jobs. Each gate is ANDed with the job's
existing event conditions and reads `selection_enforced != 'true' || run_X ==
'true'`, so audit keeps running everything and enforce makes them selective.

Adds the run_winget_installer / run_homebrew_installer outputs, and wraps each
newly-gated job's skip-as-failure check in the results gate so an enforce-mode skip
is not treated as a failure (audit behavior is unchanged).

The extension-unit jobs gate on run_extension_unit OR run_extension_e2e:
extension_e2e_tests `needs:` them, so gating them off while e2e is selected would
skip e2e through need-propagation. Base builds and the .NET test jobs stay ungated
-- base builds are upstream needs, and the .NET jobs are already gated by their
matrix bucket emptying once SelectTests filters the matrix.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…enforced

The previous wiring gated every job on `selection_enforced != 'true' || run_X ==
'true'`, duplicating the audit/enforce decision into every gate. That was
redundant: in audit the tool already returns run-all (run_* = true + the full
matrix), so the gates only ever need to read what the selector returns.

Collapse the audit/enforce decision to a single knob, ENFORCE_SELECTION, in the
select_tests step (it just controls whether SelectTests gets --enforce). Every job
gate now reads plain `run_X == 'true'` (combined with each job's existing event
conditions), and the results skip-checks read `result == 'skipped' && run_Y ==
'true'`. Remove the selection_enforced job output entirely.

Fail-open is now explicit: if SelectTests fails, the step emits the full matrix and
sets every run_* to true, so a tool failure runs everything rather than skipping
gated jobs (the role selection_enforced incidentally played before).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The "Select relevant tests" step failed at startup with:

    An error occurred trying to start process '/usr/bin/bash' with
    working directory '/home/runner/work/aspire/aspire'.
    Argument list too long

Root cause: the step received the full expanded test matrix through the
ALL_TESTS environment variable. The canonical matrix is ~148KB and the
OS-expanded form is larger, so the single env string exceeds Linux's
MAX_ARG_STRLEN (~128KB). When the runner calls execve to start bash for
the step, the oversized environment makes the call fail with E2BIG,
surfaced as "Argument list too long" — the script never runs.

Fix: write the matrix to all_tests.json in a preceding pwsh step. pwsh
writes its run-script to a file, so interpolating the (trusted,
build-generated) matrix into a literal here-string keeps it out of the
process environment entirely. The bash step then reads the file it
already expected via --matrix all_tests.json, so its logic is unchanged.

PR_BODY stays in env: GitHub caps PR bodies well under the limit, and
keeping it out of the script body avoids shell injection.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…rash

The "Select relevant tests" step ran the selector but it threw every time:

    Unhandled exception: System.InvalidOperationException:
    Provide either --changed-files or --from (with optional --to).

Two defects:

1. SelectTests crash. Selection.Run resolved the changed-file set before
   the --force-all short-circuit, so --force-all (whose job is to run the
   full matrix *regardless* of the diff) still demanded a --from or
   --changed-files input and threw. The workflow reaches --force-all on
   the [full ci] kill switch or when no diff base is available — i.e. the
   exact case where there are deliberately no diff inputs. Skip
   ResolveChangedFiles (and Layer 1) when ForceAll; the selector already
   returns the full matrix in that mode.

2. The crash was invisible. The step wrapped the tool in
   `if ! ...; then <fallback>; fi`, which swallowed *any* non-zero exit,
   emitted the full matrix + run-all, and exited 0 — so the job went
   green and the broken selector went unnoticed. The tool already fails
   OPEN by design when it *decides* to (force-all / unavailable base ->
   full matrix + run-all), so a non-zero exit means an actual crash. Let
   it fail the step (and the run); surfacing it is the point of the audit
   phase.

Regression test drives Selection.Run with --force-all and no diff inputs
and asserts exit 0 + the full matrix passes through. Reverting fix #1
reproduces the original throw at ResolveChangedFiles.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When a PR has a base SHA but the shallow `git fetch` of that commit fails
(or it's otherwise unavailable), the selector step used to fall back to
`--force-all` and run everything. That masks a real problem: `base.sha`
is always reachable on origin, so a fetch failure means something is
wrong (a bad fetch, a rewritten base), and silently running run-all
teaches the audit nothing about what selective CI would have picked.

Fail the step with a clear error instead. The remaining `--force-all`
paths are deliberate: the `[full ci]`/`run-all-tests` kill switch, and
non-PR events that have no base SHA at all.

Also update the design doc: the prior "fails open on any SelectTests
failure" note is stale (the swallow was removed), and document that an
unfetchable PR base now fails rather than forcing run-all.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The selective-CI selector's Layer 1 (affected-projects computation) used
the dotnet-affected dotnet tool, which is unusable in this repo:

- It crashes whenever a PR's diff touches Directory.Packages.props. Its
  package-diff step loads from-commit projects through a libgit2-backed
  MSBuild virtual filesystem, which eager-loads global.json as MSBuild
  XML and throws InvalidProjectFileException
  (leonardochaia/dotnet-affected#155, open upstream).
- It cannot run inside a git worktree (libgit2 sees .git as a file).

Replace it with a self-owned, in-process Layer 1
(tools/SelectTests/GraphAffectedProjects.cs) that builds an MSBuild
ProjectGraph from Aspire.slnx at the PR head (HEAD-only -- it never
evaluates from-commit content, so both failures disappear by
construction).

File -> project attribution uses the evaluated ProjectInstance items
(resolved FullPath, so cross-project linked/shared files map to every
linking project) plus ImportPaths (which captures repo hook files
imported through SDK/Arcade targets in the NuGet cache, e.g.
eng/Versions.props) and AvailableItemName types (e.g. Protobuf). Changed
files come from `git diff --name-status -M`, so deletes and both sides
of a rename are captured; anything the index misses (deleted files, the
old side of a rename, project-owned non-item files) falls back to
longest-prefix project-directory containment. The transitive reverse
ProjectReference closure then yields downstream dependents.

Layer 1 is not optional: any failure to compute the graph now hard-fails
(under-selecting would silently skip real tests); the prior audit-mode
warn-tolerance is removed.

Notable:
- Microsoft.Build.Prediction is intentionally not used: an item+import
  index was measured equal-or-superset of a prediction-based index on
  every changed-file scenario (linked .cs, .proto, linked .json, .resx),
  with zero under-selection, and strictly better on deleted-file cases.
- SelectTests and Infrastructure.Tests move to net10.0: Microsoft.Build
  18.3.x (matching the repo SDK's MSBuild line) ships only net10.0/net472
  managed assets; the engine is loaded from the repo-local SDK via
  MSBuildLocator (ExcludeAssets=runtime), not shipped.
- Removes dotnet-affected from .config/dotnet-tools.json and the
  --dotnet-affected option; the tests.yml Select step no longer needs
  `dotnet tool restore`. Design docs updated.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…elected projects

SelectTests ran after enumerate-tests and filtered the already-built full matrix.
But enumerate-tests does a full `./build.sh -test` to shard every test project,
so the expensive work happened regardless of the selection.

Invert the order. SelectTests now runs first (after checkout+restore), computes
the affected test projects from the diff (Layer 1 graph + curated map), and in
enforce mode writes an OverrideProjectToBuild props file. enumerate-tests then
reuses the job's checkout+restore and builds/enumerates ONLY the selected
projects via $(BeforeBuildPropsPath) -- the same hook the quarantine/outerloop
runners use. In audit mode no props are written and enumerate-tests produces the
full matrix unchanged.

SelectTests contract:

- The test-project universe (ALL-expansion + existence guard) now comes from
  Aspire.slnx (tests/*.Tests), not an enumerated matrix, since the selector runs
  before enumeration.
- Dropped --matrix/--output; added --before-build-props. Emits a before_build_props
  output so the workflow passes /p:BeforeBuildPropsPath to enumerate-tests.
- run_* job booleans are unchanged and still drive the non-.NET job gates.

enumerate-tests action gains checkout/restore/beforeBuildPropsPath inputs; defaults
preserve deployment-tests.yml. tests.yml shares one checkout+restore so the props
file survives (a fresh checkout's git clean would remove it).

Also:

- Move the curated map docs/ci/test-trigger-map.yml -> eng/test-trigger-map.yml
  and update all references.
- Ignore **/*.md and the map file itself, so doc/map edits no longer appear as
  unattributed changed files.
- Flip ENFORCE_SELECTION to 'true'.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The selector engine (TestSelector) had 49 acceptance tests, but the boundary
that actually gates CI in enforce mode -- Selection.Run's GITHUB_OUTPUT
contract, change resolution, and the Layer 1 graph integration -- had almost
no coverage. That boundary is the sole gate once ENFORCE_SELECTION flips to
true, so an untested run_* mistranslation or under-selection would silently
skip tests or jobs.

Add tests that each pin a concrete regression, using synthetic maps (no
coupling to eng/test-trigger-map.yml):

SelectTestsCliTests (run_* + matrix contract):
- audit forces every run_* true even on a subset; enforce emits real per-job
  values and the job: -> run_ name mapping; every map job appears as a key
  (including group-only and derived-only jobs); ALL via a path rule runs
  everything with no restriction props.
- no --from/--changed-files/--force-all throws; --from/--to is a two-dot diff
  (diverging git repo so a three-dot regression drops a file); --from alone
  diffs the working tree; --changed-files trims whitespace/blank lines.
- docs-only PR selects nothing -> empty OverrideProjectToBuild (build nothing,
  the intended outcome); a production project name in the selection produces
  no override item.

GraphAffectedProjectsTests (Layer 1 git/import edges):
- a change to an imported Directory.Build.props fans out to importers
  (ImportPaths index); empty change set returns empty; deepest-directory
  attribution prefers a nested project over its parent; a cross-project git
  rename attributes both old and new owners (first test on the git diff path).

SelectTestsLayer1IntegrationTests (new): Selection.Run end-to-end with Layer 1
enabled -- a production-source change flows through the reverse closure into
the enforce-mode props.

Serialize the env-mutating classes into the existing GraphAffectedProjects
collection. GITHUB_OUTPUT/GITHUB_STEP_SUMMARY are process-global env vars, so
running these classes in parallel let one test's temp dir be deleted while
another wrote its side-channel files -- a latent flaky-test hazard for any
future Selection.Run test.

Delete docs/ci/test-trigger-selector-test-plan.md: its Layer 1 section stubbed
a dotnet-affected executable and affected.json parsing that no longer exist
(Layer 1 is now the in-process GraphAffectedProjects graph).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…losure compile)

The enumerate-tests step (~67s baseline, the second-biggest cost in
setup_for_tests after restore) is dominated by compiling the dependency
closures of the 4 SplitTestsOnCI projects, just to reflect their split metadata
off the built assemblies. Aspire.Hosting.Tests is the worst: it references the
hosting "god-edge" closure (~40 integration projects), so its closure compile
is the bulk of the step. This recurs for any PR whose selection includes
Aspire.Hosting.Tests, not only the full-matrix case.

Partition-mode discovery only needs the [Trait("Partition","<n>")] values, which
are compile-time string literals on the test classes. This change reads them
straight from source via eng/scripts/scan-test-partitions-from-source.ps1, so
the partition list is produced with no compile. build-test-matrix.ps1 already
appends an "uncollected:*" entry (--filter-not-trait "Partition=*"), so any
class the scan misses (or one with no trait) still runs.

The scan writes .tests-partitions.json only when it finds partitions; class-mode
projects (Aspire.Cli.EndToEnd.Tests, Aspire.Templates.Tests — no Partition
traits in source) leave the file absent and fall back to the original
Build + --list-tests path. GenerateTestPartitionsForCI is untouched, so the
AzDO/Helix prepare path that calls it directly is unaffected.

Verified locally: the scan emits collection:1..6 + uncollected:* for
Aspire.Hosting.Tests (identical to the compiled enumeration's 7 entries) and
writes nothing for the class-mode projects.

Caveat (experiment): a partition value written as a non-literal (const/computed)
would be missed by the source scan and, because it still carries a Partition
trait, would not fall under uncollected:* — those tests would be dropped. All
current Aspire.Hosting.Tests partitions are string literals, so this does not
occur today; productionizing should add a CI parity check vs the compiled list.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Stacked on the source-partition enumerate change. After that, the whole
setup_for_tests job (~180s) is dominated by Restore (~107s). The restore pulls
all 415 repo projects, but test selection + enumeration only need the
test-project graph. Heavy items that are pure waste here:

  - src/Aspire.Cli (native-AOT managed front) ~10s
  - src/Aspire.TerminalHost ~10s
  - eng/dcppack + eng/dashboardpack per-RID packaging (win/osx x64+arm64)
  - eng/clipack native-AOT projects
  - playground/** sample apps

BuildTestsOnly restricts eng/Build.props ProjectToBuild to tests/** (still
pulling their src ProjectReferences, so the class-mode enumerate build still
works); SkipNativeBuild drops the clipack native projects (not gated by
BuildTestsOnly). Both are passed to restore.sh and to the enumerate -test build
so the restored set and the build set stay consistent.

Open question this run answers: SelectTests builds an MSBuild ProjectGraph over
all of Aspire.slnx (incl. the now-unrestored native/packaging/standalone-src
projects). If graph evaluation hard-requires their restore assets, this run
fails and tells us the restore floor; if evaluation tolerates the missing
package-provided imports (most are Exists-guarded), selection stays correct and
restore shrinks.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Stacked on the source-partition + tests-only-restore changes. After those, the
whole setup_for_tests job is still dominated by Restore (~99s), of which ~40s is
arcade installing the SDK + the 8 runtime toolsets from global.json tools.runtimes
(dotnet/aspnetcore x64+arm64, net8+net9) into the repo-local .dotnet.

This caches .dotnet (~450MB) keyed on global.json + eng/Versions.props. Unlike
the global NuGet packages folder (multi-GB; a separate experiment showed caching
it is a net loss because tar extract + restore-graph work exceeds the download
saved), .dotnet is small enough that skipping the runtime/SDK installs on a warm
hit should beat the extract cost.

First run is a cold cache (populates on the post-step); the warm re-run measures
the bootstrap saving.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 12, 2026 16:32
@github-actions

Copy link
Copy Markdown
Contributor

🚀 Dogfood this PR with:

⚠️ WARNING: Do not do this without first carefully reviewing the code of this PR to satisfy yourself it is safe.

curl -fsSL https://raw.githubusercontent.com/microsoft/aspire/main/eng/scripts/get-aspire-cli-pr.sh | bash -s -- 18163

Or

  • Run remotely in PowerShell:
iex "& { $(irm https://raw.githubusercontent.com/microsoft/aspire/main/eng/scripts/get-aspire-cli-pr.ps1) } 18163"

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This is a large experimental PR that introduces a selective CI system (SelectTests tool) for the Aspire repository. It computes which test projects and non-.NET CI jobs are relevant to a PR's changed files, using a two-layer approach: Layer 1 derives affected projects from an in-process MSBuild ProjectGraph (replacing dotnet-affected), and Layer 2 applies hand-curated rules from eng/test-trigger-map.yml. The system is integrated into tests.yml with an audit/enforce knob and replaces the extension_e2e_changes job with centralized selection logic. This PR is stacked on two prior experiment PRs (source-partition enumerate and tests-only restore) and adds a .dotnet toolset cache.

Changes:

  • Adds tools/SelectTests — a new C# console tool (Layer 1 graph + Layer 2 curated map resolver) that runs before enumerate-tests to select the relevant CI subset
  • Integrates selective CI into tests.yml: adds .dotnet caching, restricted restore, the SelectTests step, and per-job run_* output booleans that gate non-.NET jobs
  • Adds comprehensive tests (Infrastructure.Tests/TestTriggerMap/) covering the selector engine, CLI wiring, Layer 1 graph, and trigger-map verifiers

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tools/SelectTests/TriggerMap.cs Strongly-typed model for eng/test-trigger-map.yml with glob/convention matching
tools/SelectTests/TestSelector.cs Core selection engine: conventions, path rules, ignore, Layer 1, derived targets, groups
tools/SelectTests/GraphAffectedProjects.cs Layer 1: MSBuild ProjectGraph → reverse-dependency closure of changed files
tools/SelectTests/Program.cs CLI entry point: argument parsing, git diff, props writing, GITHUB_OUTPUT
tools/SelectTests/SelectTests.csproj Project file targeting net10.0 with MSBuild/YAML dependencies
eng/test-trigger-map.yml The curated Layer 2 map: conventions, ignore, path_rules, affected_project_rules, derived_targets
eng/scripts/scan-test-partitions-from-source.ps1 PowerShell script scanning [Trait("Partition",...)] from source to skip closure builds
eng/TestEnumerationRunsheetBuilder/TestEnumerationRunsheetBuilder.targets MSBuild integration: source-partition scan before fallback to build
.github/workflows/tests.yml CI integration: cache, restricted restore, SelectTests step, per-job gates, removal of extension_e2e_changes
.github/actions/enumerate-tests/action.yml New inputs: checkout, restore, beforeBuildPropsPath for reuse in same job
tests/Infrastructure.Tests/TestTriggerMap/SelectTestsAcceptanceTests.cs Behavioral spec for the selector engine with synthetic maps
tests/Infrastructure.Tests/TestTriggerMap/SelectTestsCliTests.cs CLI wiring tests (side-channel outputs, enforce/audit, git diff)
tests/Infrastructure.Tests/TestTriggerMap/GraphAffectedProjectsTests.cs Layer 1 graph tests (reverse closure, shared files, renames)
tests/Infrastructure.Tests/TestTriggerMap/SelectTestsLayer1IntegrationTests.cs End-to-end Layer 1 integration with real MSBuild graph
tests/Infrastructure.Tests/TestTriggerMap/TestTriggerMapTests.cs Verifier: trigger-map integrity against repo reality
tests/Infrastructure.Tests/TestTriggerMap/TestTriggerMapModel.cs Parallel YAML model for the verifier tests
tests/Infrastructure.Tests/Infrastructure.Tests.csproj Target framework bump to net10.0, new dependencies
docs/ci/test-trigger-selector-design.md Design document for the selective CI system
docs/ci/test-trigger-map.md Reference documentation for the curated map
Directory.Packages.props New package versions: Microsoft.Build 18.3.3, YamlDotNet 16.3.0, MSBuildLocator 1.9.1
Aspire.slnx Adds Infrastructure.Tests and Aspire.Hosting.Maui.Tests to the solution

Comment on lines +106 to +115
private static TestSelector Selector(IEnumerable<string>? projectDirs = null)
{
var dir = Directory.CreateTempSubdirectory("selecttests");
var path = Path.Combine(dir.FullName, "map.yml");
File.WriteAllText(path, SyntheticMap);
return new TestSelector(
path,
s_matrix.ToHashSet(StringComparer.Ordinal),
(projectDirs ?? []).ToHashSet(StringComparer.Ordinal));
}
private static void WriteJobBooleans(RunOptions options, SelectionResult result)
{
var githubOutput = Environment.GetEnvironmentVariable("GITHUB_OUTPUT");
var allJobs = TriggerMap.Load(options.MapPath).AllJobTokens().ToHashSet(StringComparer.Ordinal);
@radical

radical commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

Superseded: .dotnet caching deprioritized in favor of the minimal-restore architecture in PR #18166. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants