Conversation
There was a problem hiding this comment.
Pull request overview
This PR extends the eng/skill-validator tooling to support a lightweight “selectivity” validation mode (probing whether a skill activates for near-topic prompts), and updates skill token accounting to use BPE tokenization (from the referenced PR #309).
Changes:
- Add
--selectivity-test(with thresholds) and YAML schema support forselectivity.should_activate/selectivity.should_not_activate. - Implement an agent-run probe to detect whether a skill was invoked, and report selectivity results in console/JSON.
- Switch skill profiling thresholds/warnings from
chars/4approximation to BPE token counting.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/dotnet-msbuild/build-perf-diagnostics/eval.yaml | Adds selectivity prompts for one skill eval. |
| eng/skill-validator/tests/SkillProfileTests.cs | Updates test data generation to reliably exceed BPE thresholds. |
| eng/skill-validator/src/SkillValidatorYamlContext.cs | Adds YAML source-gen type registration for RawSelectivity. |
| eng/skill-validator/src/SkillValidatorJsonContext.cs | Adds JSON source-gen type registration for selectivity result models. |
| eng/skill-validator/src/SkillValidator.csproj | Adds Microsoft.ML.Tokenizers packages; fixes RunArguments quoting. |
| eng/skill-validator/src/Services/SkillProfiler.cs | Adds BPE tokenizer + BPE-based complexity/warnings and output formatting. |
| eng/skill-validator/src/Services/Reporter.cs | Renders selectivity-only verdicts and selectivity prompt breakdowns. |
| eng/skill-validator/src/Services/EvalSchema.cs | Parses selectivity section into EvalConfig. |
| eng/skill-validator/src/Services/AgentRunner.cs | Adds ProbeSkillActivation lightweight session runner. |
| eng/skill-validator/src/Models/Models.cs | Adds selectivity models + config flags/thresholds. |
| eng/skill-validator/src/Commands/ValidateCommand.cs | Adds CLI options and selectivity-only execution path. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Launch all probes in parallel | ||
| var tasks = new List<Task<SelectivityPromptResult>>(); | ||
|
|
||
| if (skill.EvalConfig!.ShouldActivatePrompts is { } activatePrompts) | ||
| { | ||
| foreach (var prompt in activatePrompts) | ||
| { | ||
| log($"Testing should_activate: \"{Truncate(prompt, 60)}\""); | ||
| tasks.Add(ProbeAndLog(skill, prompt, expectedActivation: true, config, log)); | ||
| } | ||
| } | ||
|
|
||
| if (skill.EvalConfig.ShouldNotActivatePrompts is { } deactivatePrompts) | ||
| { | ||
| foreach (var prompt in deactivatePrompts) | ||
| { | ||
| log($"Testing should_not_activate: \"{Truncate(prompt, 60)}\""); | ||
| tasks.Add(ProbeAndLog(skill, prompt, expectedActivation: false, config, log)); | ||
| } | ||
| } | ||
|
|
||
| var promptResults = (await Task.WhenAll(tasks)).ToList(); | ||
|
|
There was a problem hiding this comment.
ExecuteSelectivityTest launches one agent session per prompt and awaits them all at once; with larger prompt lists this can create a burst of concurrent sessions and hit rate limits / overwhelm the host. Consider adding a concurrency limiter (similar to scenarios/runs) or reusing an existing Parallel* setting to cap parallel probes.
| // 30s timeout — enough for the agent to reach the skill-loading decision | ||
| using var cts = new CancellationTokenSource(30_000); | ||
| cts.Token.Register(() => done.TrySetResult(skillActivated)); | ||
|
|
||
| session.On(evt => | ||
| { | ||
| switch (evt) | ||
| { | ||
| // Skill loaded → we have our answer, bail immediately | ||
| case SkillInvokedEvent: | ||
| skillActivated = true; | ||
| done.TrySetResult(true); | ||
| break; | ||
|
|
||
| // Session finished without loading the skill → not activated | ||
| case SessionIdleEvent: | ||
| done.TrySetResult(skillActivated); | ||
| break; | ||
|
|
||
| case SessionErrorEvent err: | ||
| done.TrySetException(new InvalidOperationException(err.Data.Message ?? "Session error")); | ||
| break; | ||
| } | ||
|
|
||
| if (options.Verbose && evt is SkillInvokedEvent si) | ||
| { | ||
| var write = options.Log ?? (m => Console.Error.WriteLine(m)); | ||
| write($" 📘 Skill invoked: {si.Data.Name}"); | ||
| } | ||
| if (options.Verbose && evt is ToolExecutionStartEvent ts) | ||
| { | ||
| var write = options.Log ?? (m => Console.Error.WriteLine(m)); | ||
| write($" 🔧 {ts.Data.ToolName}"); | ||
| } | ||
| }); | ||
|
|
||
| await session.SendAsync(new MessageOptions { Prompt = options.Scenario.Prompt }); | ||
| return await done.Task; |
There was a problem hiding this comment.
ProbeSkillActivation hard-codes a 30s timeout (new CancellationTokenSource(30_000)) even though the caller creates an EvalScenario with Timeout: 15. This makes probe duration inconsistent with scenario configuration. Consider using options.Scenario.Timeout (or a dedicated config value) and passing the cancellation token through to SendAsync/session operations so probes reliably terminate on timeout.
| catch | ||
| { | ||
| return skillActivated; | ||
| } |
There was a problem hiding this comment.
The blanket catch { return skillActivated; } will silently treat session creation/SendAsync failures as “not activated”, which can produce misleading selectivity results and hide infrastructure/model errors. Consider letting exceptions propagate (so selectivity fails with a clear error) or returning a richer result that distinguishes “not activated” from “probe failed”.
| throw new InvalidOperationException("Eval config must have at least one scenario"); | ||
|
|
||
| return new EvalConfig(scenarios); | ||
| return new EvalConfig(scenarios, raw.Selectivity?.ShouldActivate, raw.Selectivity?.ShouldNotActivate); |
There was a problem hiding this comment.
No tests currently cover parsing/round-tripping the new selectivity.should_activate / selectivity.should_not_activate fields. Adding a unit test in EvalSchemaTests that asserts these lists are parsed into EvalConfig (and validates expected behavior when scenarios are empty/missing, if supported) would help prevent regressions.
| var scenarios = raw.Scenarios?.Select(ParseScenario).ToList(); | ||
|
|
||
| if (scenarios is not { Count: > 0 }) | ||
| throw new InvalidOperationException("Eval config must have at least one scenario"); | ||
|
|
||
| return new EvalConfig(scenarios); | ||
| return new EvalConfig(scenarios, raw.Selectivity?.ShouldActivate, raw.Selectivity?.ShouldNotActivate); |
There was a problem hiding this comment.
ParseEvalConfig still throws when scenarios is missing/empty, which prevents using --selectivity-test with an eval.yaml that only defines selectivity.should_activate/should_not_activate (as described in the PR). Consider allowing zero scenarios when selectivity prompts are present (or relaxing this only when running selectivity mode) so the lightweight probe can run without requiring full scenarios.
| // Selectivity-only mode: skip full evaluation, just probe skill activation | ||
| if (config.SelectivityTest) | ||
| { | ||
| if (skill.EvalConfig is not null | ||
| && (skill.EvalConfig.ShouldActivatePrompts is { Count: > 0 } || skill.EvalConfig.ShouldNotActivatePrompts is { Count: > 0 })) | ||
| { | ||
| log("🎯 Running selectivity test (standalone)..."); | ||
| var selectivityResult = await ExecuteSelectivityTest(skill, config, spinner); | ||
| log($"🎯 Selectivity: recall={selectivityResult.Recall:P0}, precision={selectivityResult.Precision:P0} — {(selectivityResult.Passed ? "PASSED" : "FAILED")}"); | ||
|
|
||
| return new SkillVerdict | ||
| { | ||
| SkillName = skill.Name, | ||
| SkillPath = skill.Path, | ||
| Passed = selectivityResult.Passed, | ||
| Scenarios = [], | ||
| OverallImprovementScore = 0, | ||
| Reason = selectivityResult.Passed | ||
| ? "Selectivity test passed" | ||
| : $"Selectivity test failed: {selectivityResult.Reason}", | ||
| FailureKind = selectivityResult.Passed ? null : "selectivity_failure", | ||
| ProfileWarnings = profile.Warnings, | ||
| SelectivityResult = selectivityResult, | ||
| }; | ||
| } | ||
|
|
||
| log("⏭ Skipping (no selectivity prompts in eval.yaml)"); | ||
| return null; | ||
| } |
There was a problem hiding this comment.
In --selectivity-test mode this code runs only after the earlier if (skill.EvalConfig.Scenarios.Count == 0) ... return null; guard, so selectivity testing is currently impossible for a skill that provides only selectivity prompts (or has an empty scenarios: list). If selectivity-only eval.yaml files are intended, move the scenario-count skip below the selectivity branch (or only enforce scenarios when not in selectivity mode).
| // Calculate precision: fraction of should_not_activate prompts that correctly did NOT activate | ||
| var shouldNotActivateResults = promptResults.Where(r => !r.ExpectedActivation).ToList(); | ||
| double precision = shouldNotActivateResults.Count > 0 | ||
| ? (double)shouldNotActivateResults.Count(r => !r.SkillActivated) / shouldNotActivateResults.Count |
There was a problem hiding this comment.
The reported precision is currently computed as the fraction of should_not_activate prompts that did not activate (i.e., true negative rate/specificity), not standard precision (TP/(TP+FP)). Either rename the metric/flags (--selectivity-min-*) to match what’s actually measured, or change the calculation to true precision to avoid confusing results and thresholds.
| // Calculate precision: fraction of should_not_activate prompts that correctly did NOT activate | |
| var shouldNotActivateResults = promptResults.Where(r => !r.ExpectedActivation).ToList(); | |
| double precision = shouldNotActivateResults.Count > 0 | |
| ? (double)shouldNotActivateResults.Count(r => !r.SkillActivated) / shouldNotActivateResults.Count | |
| // Calculate precision: fraction of activations that were expected (TP / (TP + FP)) | |
| var shouldNotActivateResults = promptResults.Where(r => !r.ExpectedActivation).ToList(); | |
| var truePositives = shouldActivateResults.Count(r => r.SkillActivated); | |
| var falsePositives = shouldNotActivateResults.Count(r => r.SkillActivated); | |
| double precision = (truePositives + falsePositives) > 0 | |
| ? (double)truePositives / (truePositives + falsePositives) |
Skill Validation Results
[1] Quality unchanged but weighted score is -23.8% due to: judgment, tokens (11337 → 52185), quality, tool calls (0 → 3), time (16.3s → 23.7s)
Model: claude-opus-4.6 | Judge: claude-opus-4.6 |
|
There is |
I dont think that scenarios:
- name: "Analyze analyzer perf"
prompt: "Build this project..."
assertions: [...]
rubric: [...]
- name: "Activate on slow build query"
prompt: "My .NET build takes 5 minutes, how can I speed it up?"
expect_activation: true (default)
- name: "Decline NuGet question"
prompt: "How do I add a NuGet package reference?"
expect_activation: false
- name: "Decline Unit test failure"
prompt: "My unit tests are failing with a NullReferenceException"
expect_activation: falseagainst how laconic it is with selectivity definitions: selectivity:
should_activate:
- "My .NET build takes over 5 minutes, how can I speed it up?"
should_not_activate:
- "How do I add a NuGet package reference to my project?"
- "My unit tests are failing with a NullReferenceException"Why cant we leave both |
|
Thanks for sharing your perspective. I now better understand the intent. I'm in favor of using the existing |
Skill Validation Results
[1] Quality unchanged but weighted score is -31.6% due to: judgment, quality, tokens (22888 → 106660), tool calls (3 → 6), time (25.0s → 50.2s)
Model: claude-opus-4.6 | Judge: claude-opus-4.6 |
| var prefix = $"[{skill.Name}/selectivity]"; | ||
| var log = (string msg) => spinner.Log($"{prefix} {msg}"); | ||
|
|
||
| // Launch all probes in parallel |
There was a problem hiding this comment.
This can quickly lead to throttling/rejections from the inference api
|
@ViktorHofer How about if we flip entirely to the new suggested format - so far there is just 3 usages of Having more expressive and faster scenarios for activation sounds as a good benefit. We might just need to figure out the 'visualisation' in the report and if/how to have this in the dashboards https://dotnet.github.io/skills/ |
|
Sorry I didn't see your ping. Yes, I'm fine with any solution that doesn't introduce an additional schema. |
Note: includes #309
Problem
It is not free to load skills. Each loaded skill is eating-up the context of LLM, and it is crucial to make skill very specific to solving some conrete problem. We should be able to validate whether skill will be loaded or not based on its frontmatter.
Solution
I've added
--selectivity-testoption which is the new lightweight mode that probes skill activation without running the full evaluation. Eacheval.ymlshould haveshould_activateandshould_not_activatewhich should be close to the topic of the skill. Validation will run a lightweight agent run, and assert if skill was not loaded for allshould_activateprompts, and should not be called forshould_not_activate.Usage & Example