Skip to content

feat: skill loading tests#311

Open
DeagleGross wants to merge 3 commits intomainfrom
dmkorolev/skills-loading
Open

feat: skill loading tests#311
DeagleGross wants to merge 3 commits intomainfrom
dmkorolev/skills-loading

Conversation

@DeagleGross
Copy link
Copy Markdown
Member

Note: includes #309

Problem

It is not free to load skills. Each loaded skill is eating-up the context of LLM, and it is crucial to make skill very specific to solving some conrete problem. We should be able to validate whether skill will be loaded or not based on its frontmatter.

Solution

I've added --selectivity-test option which is the new lightweight mode that probes skill activation without running the full evaluation. Each eval.yml should have should_activate and should_not_activate which should be close to the topic of the skill. Validation will run a lightweight agent run, and assert if skill was not loaded for all should_activate prompts, and should not be called for should_not_activate.

Usage & Example

dotnet run --project eng/skill-validator/src -- \
  plugins/dotnet-msbuild/skills/build-perf-diagnostics \
  --tests-dir tests/dotnet-msbuild \
  --selectivity-test --runs 1 --verbose

[build-perf-diagnostics] 🔍 Evaluating...
[build-perf-diagnostics] 📊 📊 build-perf-diagnostics: 1,560 BPE tokens [chars/4: 1,649] (detailed ✓), 12 sections, 10 code blocks
[build-perf-diagnostics] 🎯 Running selectivity test (standalone)...
[build-perf-diagnostics/selectivity] Testing should_activate: "My .NET build takes over 5 minutes, how can I speed it up?"
⠹ Evaluating 1 skill(s)... 📂 C:\Users\dmkorolev\AppData\Local\Temp\sv-6e86045eebd84007834facc17361ee54 (skilled)
[build-perf-diagnostics/selectivity] Testing should_activate: "How do I analyze a binlog to find slow targets in MSBuild?"
[build-perf-diagnostics/selectivity] → ✅ activated: "Our CI builds are fast but local dev builds are p…"
...
[build-perf-diagnostics/selectivity] → ✅ correctly NOT activated: "What's the difference between .NET 8 and .NET 9?"
[build-perf-diagnostics/selectivity] → ✅ correctly NOT activated: "My unit tests are failing with a NullReferenceExc…"
[build-perf-diagnostics] 🎯 Selectivity: recall=100%, precision=100% — PASSED

═══ Skill Validation Results ═══

🎯 Selectivity: recall=100%, precision=100% ✅
✓ "My .NET build takes over 5 minutes, how can I speed it up?" — should activate → activated
✓ "How do I analyze a binlog to find slow targets in MSBuild?" — should activate → activated
✓ "Roslyn analyzers are making my compilation really slow, wha…" — should activate → activated
✓ "I want to profile my MSBuild build to understand where time…" — should activate → activated
✓ "Our CI builds are fast but local dev builds are painfully s…" — should activate → activated
✓ "How do I add a NuGet package reference to my project?" — should NOT activate → not activated
✓ "My unit tests are failing with a NullReferenceException" — should NOT activate → not activated
✓ "How do I configure Docker for my .NET application?" — should NOT activate → not activated
✓ "What's the difference between .NET 8 and .NET 9?" — should NOT activate → not activated
✓ "How do I set up Entity Framework Core migrations?" — should NOT activate → not activated

Copilot AI review requested due to automatic review settings March 10, 2026 12:32
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the eng/skill-validator tooling to support a lightweight “selectivity” validation mode (probing whether a skill activates for near-topic prompts), and updates skill token accounting to use BPE tokenization (from the referenced PR #309).

Changes:

  • Add --selectivity-test (with thresholds) and YAML schema support for selectivity.should_activate / selectivity.should_not_activate.
  • Implement an agent-run probe to detect whether a skill was invoked, and report selectivity results in console/JSON.
  • Switch skill profiling thresholds/warnings from chars/4 approximation to BPE token counting.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
tests/dotnet-msbuild/build-perf-diagnostics/eval.yaml Adds selectivity prompts for one skill eval.
eng/skill-validator/tests/SkillProfileTests.cs Updates test data generation to reliably exceed BPE thresholds.
eng/skill-validator/src/SkillValidatorYamlContext.cs Adds YAML source-gen type registration for RawSelectivity.
eng/skill-validator/src/SkillValidatorJsonContext.cs Adds JSON source-gen type registration for selectivity result models.
eng/skill-validator/src/SkillValidator.csproj Adds Microsoft.ML.Tokenizers packages; fixes RunArguments quoting.
eng/skill-validator/src/Services/SkillProfiler.cs Adds BPE tokenizer + BPE-based complexity/warnings and output formatting.
eng/skill-validator/src/Services/Reporter.cs Renders selectivity-only verdicts and selectivity prompt breakdowns.
eng/skill-validator/src/Services/EvalSchema.cs Parses selectivity section into EvalConfig.
eng/skill-validator/src/Services/AgentRunner.cs Adds ProbeSkillActivation lightweight session runner.
eng/skill-validator/src/Models/Models.cs Adds selectivity models + config flags/thresholds.
eng/skill-validator/src/Commands/ValidateCommand.cs Adds CLI options and selectivity-only execution path.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +690 to +712
// Launch all probes in parallel
var tasks = new List<Task<SelectivityPromptResult>>();

if (skill.EvalConfig!.ShouldActivatePrompts is { } activatePrompts)
{
foreach (var prompt in activatePrompts)
{
log($"Testing should_activate: \"{Truncate(prompt, 60)}\"");
tasks.Add(ProbeAndLog(skill, prompt, expectedActivation: true, config, log));
}
}

if (skill.EvalConfig.ShouldNotActivatePrompts is { } deactivatePrompts)
{
foreach (var prompt in deactivatePrompts)
{
log($"Testing should_not_activate: \"{Truncate(prompt, 60)}\"");
tasks.Add(ProbeAndLog(skill, prompt, expectedActivation: false, config, log));
}
}

var promptResults = (await Task.WhenAll(tasks)).ToList();

Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ExecuteSelectivityTest launches one agent session per prompt and awaits them all at once; with larger prompt lists this can create a burst of concurrent sessions and hit rate limits / overwhelm the host. Consider adding a concurrency limiter (similar to scenarios/runs) or reusing an existing Parallel* setting to cap parallel probes.

Copilot uses AI. Check for mistakes.
Comment on lines +316 to +353
// 30s timeout — enough for the agent to reach the skill-loading decision
using var cts = new CancellationTokenSource(30_000);
cts.Token.Register(() => done.TrySetResult(skillActivated));

session.On(evt =>
{
switch (evt)
{
// Skill loaded → we have our answer, bail immediately
case SkillInvokedEvent:
skillActivated = true;
done.TrySetResult(true);
break;

// Session finished without loading the skill → not activated
case SessionIdleEvent:
done.TrySetResult(skillActivated);
break;

case SessionErrorEvent err:
done.TrySetException(new InvalidOperationException(err.Data.Message ?? "Session error"));
break;
}

if (options.Verbose && evt is SkillInvokedEvent si)
{
var write = options.Log ?? (m => Console.Error.WriteLine(m));
write($" 📘 Skill invoked: {si.Data.Name}");
}
if (options.Verbose && evt is ToolExecutionStartEvent ts)
{
var write = options.Log ?? (m => Console.Error.WriteLine(m));
write($" 🔧 {ts.Data.ToolName}");
}
});

await session.SendAsync(new MessageOptions { Prompt = options.Scenario.Prompt });
return await done.Task;
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ProbeSkillActivation hard-codes a 30s timeout (new CancellationTokenSource(30_000)) even though the caller creates an EvalScenario with Timeout: 15. This makes probe duration inconsistent with scenario configuration. Consider using options.Scenario.Timeout (or a dedicated config value) and passing the cancellation token through to SendAsync/session operations so probes reliably terminate on timeout.

Copilot uses AI. Check for mistakes.
Comment on lines +355 to +358
catch
{
return skillActivated;
}
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The blanket catch { return skillActivated; } will silently treat session creation/SendAsync failures as “not activated”, which can produce misleading selectivity results and hide infrastructure/model errors. Consider letting exceptions propagate (so selectivity fails with a clear error) or returning a richer result that distinguishes “not activated” from “probe failed”.

Copilot uses AI. Check for mistakes.
throw new InvalidOperationException("Eval config must have at least one scenario");

return new EvalConfig(scenarios);
return new EvalConfig(scenarios, raw.Selectivity?.ShouldActivate, raw.Selectivity?.ShouldNotActivate);
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No tests currently cover parsing/round-tripping the new selectivity.should_activate / selectivity.should_not_activate fields. Adding a unit test in EvalSchemaTests that asserts these lists are parsed into EvalConfig (and validates expected behavior when scenarios are empty/missing, if supported) would help prevent regressions.

Copilot uses AI. Check for mistakes.
Comment on lines 19 to +24
var scenarios = raw.Scenarios?.Select(ParseScenario).ToList();

if (scenarios is not { Count: > 0 })
throw new InvalidOperationException("Eval config must have at least one scenario");

return new EvalConfig(scenarios);
return new EvalConfig(scenarios, raw.Selectivity?.ShouldActivate, raw.Selectivity?.ShouldNotActivate);
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ParseEvalConfig still throws when scenarios is missing/empty, which prevents using --selectivity-test with an eval.yaml that only defines selectivity.should_activate/should_not_activate (as described in the PR). Consider allowing zero scenarios when selectivity prompts are present (or relaxing this only when running selectivity mode) so the lightweight probe can run without requiring full scenarios.

Copilot uses AI. Check for mistakes.
Comment on lines +345 to +373
// Selectivity-only mode: skip full evaluation, just probe skill activation
if (config.SelectivityTest)
{
if (skill.EvalConfig is not null
&& (skill.EvalConfig.ShouldActivatePrompts is { Count: > 0 } || skill.EvalConfig.ShouldNotActivatePrompts is { Count: > 0 }))
{
log("🎯 Running selectivity test (standalone)...");
var selectivityResult = await ExecuteSelectivityTest(skill, config, spinner);
log($"🎯 Selectivity: recall={selectivityResult.Recall:P0}, precision={selectivityResult.Precision:P0} — {(selectivityResult.Passed ? "PASSED" : "FAILED")}");

return new SkillVerdict
{
SkillName = skill.Name,
SkillPath = skill.Path,
Passed = selectivityResult.Passed,
Scenarios = [],
OverallImprovementScore = 0,
Reason = selectivityResult.Passed
? "Selectivity test passed"
: $"Selectivity test failed: {selectivityResult.Reason}",
FailureKind = selectivityResult.Passed ? null : "selectivity_failure",
ProfileWarnings = profile.Warnings,
SelectivityResult = selectivityResult,
};
}

log("⏭ Skipping (no selectivity prompts in eval.yaml)");
return null;
}
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In --selectivity-test mode this code runs only after the earlier if (skill.EvalConfig.Scenarios.Count == 0) ... return null; guard, so selectivity testing is currently impossible for a skill that provides only selectivity prompts (or has an empty scenarios: list). If selectivity-only eval.yaml files are intended, move the scenario-count skip below the selectivity branch (or only enforce scenarios when not in selectivity mode).

Copilot uses AI. Check for mistakes.
Comment on lines +719 to +722
// Calculate precision: fraction of should_not_activate prompts that correctly did NOT activate
var shouldNotActivateResults = promptResults.Where(r => !r.ExpectedActivation).ToList();
double precision = shouldNotActivateResults.Count > 0
? (double)shouldNotActivateResults.Count(r => !r.SkillActivated) / shouldNotActivateResults.Count
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reported precision is currently computed as the fraction of should_not_activate prompts that did not activate (i.e., true negative rate/specificity), not standard precision (TP/(TP+FP)). Either rename the metric/flags (--selectivity-min-*) to match what’s actually measured, or change the calculation to true precision to avoid confusing results and thresholds.

Suggested change
// Calculate precision: fraction of should_not_activate prompts that correctly did NOT activate
var shouldNotActivateResults = promptResults.Where(r => !r.ExpectedActivation).ToList();
double precision = shouldNotActivateResults.Count > 0
? (double)shouldNotActivateResults.Count(r => !r.SkillActivated) / shouldNotActivateResults.Count
// Calculate precision: fraction of activations that were expected (TP / (TP + FP))
var shouldNotActivateResults = promptResults.Where(r => !r.ExpectedActivation).ToList();
var truePositives = shouldActivateResults.Count(r => r.SkillActivated);
var falsePositives = shouldNotActivateResults.Count(r => r.SkillActivated);
double precision = (truePositives + falsePositives) > 0
? (double)truePositives / (truePositives + falsePositives)

Copilot uses AI. Check for mistakes.
@github-actions
Copy link
Copy Markdown
Contributor

Skill Validation Results

Skill Scenario Quality Skills Loaded Overfit Verdict
csharp-scripts Test a C# language feature with a script 3.0/5 → 4.0/5 🟢 ✅ csharp-scripts; tools: skill, create 🟡 0.33
nuget-trusted-publishing Set up trusted publishing for a new NuGet library 3.0/5 → 5.0/5 🟢 ✅ nuget-trusted-publishing; tools: skill ✅ 0.15
nuget-trusted-publishing Set up NuGet publishing without mentioning trusted publishing 2.0/5 → 5.0/5 🟢 ✅ nuget-trusted-publishing; tools: skill, report_intent, glob, view, bash, stop_bash ✅ 0.15
nuget-trusted-publishing Migrate existing workflow from API key to trusted publishing 3.0/5 → 5.0/5 🟢 ✅ nuget-trusted-publishing; tools: skill ✅ 0.15
dotnet-pinvoke Generate LibraryImport declaration from C header (.NET 8+) 4.0/5 → 5.0/5 🟢 ✅ dotnet-pinvoke; tools: skill ✅ 0.05
dotnet-pinvoke Generate LibraryImport declaration from C header (.NET Framework) 4.0/5 → 5.0/5 🟢 ✅ dotnet-pinvoke; tools: skill ✅ 0.05
dotnet-trace-collect High CPU in Kubernetes on Linux (.NET 8) 4.0/5 → 4.0/5 ✅ dotnet-trace-collect; tools: skill, report_intent, view, glob ✅ 0.13
dotnet-trace-collect .NET Framework on Windows without admin privileges 3.0/5 → 5.0/5 🟢 ✅ dotnet-trace-collect; tools: skill ✅ 0.13
dotnet-trace-collect .NET 10 on Linux with root access and native call stacks 1.0/5 → 4.0/5 🟢 ✅ dotnet-trace-collect; tools: skill ✅ 0.13
dotnet-trace-collect Memory leak on Linux (.NET 8) 2.0/5 → 3.0/5 🟢 ✅ dotnet-trace-collect; tools: skill ✅ 0.13
dotnet-trace-collect Slow requests on Windows with PerfView 5.0/5 → 5.0/5 ✅ dotnet-trace-collect; tools: skill, report_intent, view ✅ 0.13 [1]
dotnet-trace-collect Excessive GC on Linux (.NET 8) 5.0/5 → 5.0/5 ✅ dotnet-trace-collect; tools: skill ✅ 0.13 [2]
dotnet-trace-collect Hang or deadlock diagnosis on Linux 3.0/5 → 5.0/5 🟢 ✅ dotnet-trace-collect; tools: skill ✅ 0.13
dotnet-trace-collect Windows container high CPU with PerfView 1.0/5 → 5.0/5 🟢 ✅ dotnet-trace-collect; tools: report_intent, skill, view, glob ✅ 0.13
dotnet-trace-collect Long-running intermittent issue with PerfView triggers 3.0/5 → 5.0/5 🟢 ✅ dotnet-trace-collect; tools: skill, report_intent, view ✅ 0.13
dotnet-trace-collect Linux pre-.NET 10 needing native call stacks 2.0/5 → 5.0/5 🟢 ✅ dotnet-trace-collect; tools: skill, report_intent, view, glob ✅ 0.13
dotnet-trace-collect Windows modern .NET with admin high CPU 3.0/5 → 5.0/5 🟢 ✅ dotnet-trace-collect; tools: skill, report_intent, view ✅ 0.13
dotnet-trace-collect Memory leak on .NET Framework Windows 3.0/5 → 5.0/5 🟢 ✅ dotnet-trace-collect; tools: skill, report_intent, view ✅ 0.13
dotnet-trace-collect Kubernetes with console access prefers console tools 5.0/5 → 5.0/5 ✅ dotnet-trace-collect; tools: skill ✅ 0.13 [3]
dotnet-trace-collect Container installation without .NET SDK 4.0/5 → 4.0/5 ✅ dotnet-trace-collect; tools: skill ✅ 0.13
dotnet-trace-collect HTTP 500s from downstream service on Linux (.NET 8) 4.0/5 → 5.0/5 🟢 ✅ dotnet-trace-collect; tools: skill, report_intent, view ✅ 0.13
dotnet-trace-collect Networking timeouts on Windows with admin (.NET 8) 2.0/5 → 5.0/5 🟢 ✅ dotnet-trace-collect; tools: skill, report_intent, view, glob ✅ 0.13
microbenchmarking Investigate runtime upgrade performance impact 3.0/5 → 5.0/5 🟢 ✅ microbenchmarking; tools: skill, glob, stop_bash ✅ 0.10
clr-activation-debugging Diagnose unexpected FOD dialog from native build tool 1.0/5 → 5.0/5 🟢 ✅ clr-activation-debugging; tools: skill ✅ 0.09
clr-activation-debugging Diagnose FOD suppressed but activation still failing 1.0/5 → 5.0/5 🟢 ✅ clr-activation-debugging; tools: skill ✅ 0.09
clr-activation-debugging Explain why same binary behaves differently under different launch methods 1.0/5 → 5.0/5 🟢 ✅ clr-activation-debugging; tools: skill ✅ 0.09
clr-activation-debugging Analyze healthy managed EXE activation 1.0/5 → 5.0/5 🟢 ✅ clr-activation-debugging; tools: skill ✅ 0.09
clr-activation-debugging Identify multiple activation sequences in a single log 1.0/5 → 5.0/5 🟢 ✅ clr-activation-debugging; tools: skill ✅ 0.09
clr-activation-debugging Explain useLegacyV2RuntimeActivationPolicy in activation log 2.0/5 → 3.0/5 🟢 ✅ clr-activation-debugging; tools: skill ✅ 0.09
clr-activation-debugging Decline non-CLR-activation issue 1.0/5 → 5.0/5 🟢 ℹ️ not activated (expected) ✅ 0.09
analyzing-dotnet-performance Detects compiled regex startup budget and regex chain allocations 1.0/5 → 5.0/5 🟢 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.12
analyzing-dotnet-performance Detects CurrentCulture comparer and compiled regex budget in inflection rules 1.0/5 → 5.0/5 🟢 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.12
analyzing-dotnet-performance Finds per-call Dictionary allocation not hoisted to static 1.0/5 → 5.0/5 🟢 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.12
analyzing-dotnet-performance Catches compound allocations in recursive number converter with ToLower 1.0/5 → 4.0/5 🟢 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.12
analyzing-dotnet-performance Finds StringComparison.Ordinal missing and FrozenDictionary opportunities 1.0/5 → 5.0/5 🟢 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.12
analyzing-dotnet-performance Detects Aggregate+Replace chain and struct missing IEquatable 1.0/5 → 5.0/5 🟢 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.12
analyzing-dotnet-performance Finds branched Replace chain in format string manipulation 1.0/5 → 1.0/5 ⏰ ✅ analyzing-dotnet-performance; tools: skill, read_bash, stop_bash ✅ 0.12
analyzing-dotnet-performance Catches LINQ on hot-path string processing and All(char.IsUpper) 1.0/5 → 5.0/5 🟢 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.12
analyzing-dotnet-performance Detects LINQ pipeline in TimeSpan formatting and collection processing 1.0/5 → 5.0/5 🟢 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.12
analyzing-dotnet-performance Flags Span inconsistencies and compound method chains in truncation library 1.0/5 → 5.0/5 🟢 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.12
analyzing-dotnet-performance Identifies unsealed leaf classes and locale hierarchy patterns 1.0/5 → 5.0/5 🟢 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.12
android-tombstone-symbolication Symbolicate .NET frames in an Android tombstone 4.0/5 → 5.0/5 🟢 ✅ android-tombstone-symbolication; tools: skill, glob ✅ 0.19
android-tombstone-symbolication Recognize tombstone with no .NET frames 5.0/5 → 5.0/5 ✅ android-tombstone-symbolication; tools: skill ✅ 0.19 [4]
android-tombstone-symbolication Symbolicate CoreCLR frames in an Android tombstone 3.0/5 → 3.0/5 ✅ android-tombstone-symbolication; tools: skill, stop_bash ✅ 0.19
android-tombstone-symbolication Recognize NativeAOT tombstone with app binary and libSystem.Native.so 3.0/5 → 4.0/5 🟢 ✅ android-tombstone-symbolication; tools: skill, bash, glob ✅ 0.19
android-tombstone-symbolication Symbolicate multi-thread tombstone 4.0/5 → 5.0/5 🟢 ✅ android-tombstone-symbolication; tools: skill, glob ✅ 0.19
android-tombstone-symbolication Handle .NET frames with no BuildId metadata 5.0/5 → 4.0/5 🔴 ✅ android-tombstone-symbolication; tools: skill, bash, glob ✅ 0.19 [5]
android-tombstone-symbolication Symbolicate tombstone with multiple .NET libraries and different BuildIds 4.0/5 → 5.0/5 🟢 ✅ android-tombstone-symbolication; tools: skill, glob ✅ 0.19
android-tombstone-symbolication Reject iOS crash log as wrong format 4.0/5 → 5.0/5 🟢 ℹ️ not activated (expected) ✅ 0.19
dump-collect Configure automatic crash dumps for CoreCLR app on Linux 5.0/5 → 5.0/5 ✅ dump-collect; tools: skill, report_intent, view, glob 🟡 0.25
dump-collect Set up NativeAOT crash dumps with createdump in Kubernetes 2.0/5 → 5.0/5 🟢 ✅ dump-collect; tools: skill 🟡 0.25
dump-collect Recover crash dump from macOS NativeAOT without createdump 2.0/5 → 4.0/5 🟢 ✅ dump-collect; tools: skill, report_intent, view, glob 🟡 0.25
dump-collect Configure CoreCLR dump collection in Alpine Docker as non-root 4.0/5 → 4.0/5 ✅ dump-collect; tools: skill, report_intent, view, glob 🟡 0.25
dump-collect Advisory: macOS NativeAOT crash dump recovery steps 4.0/5 → 5.0/5 🟢 ✅ dump-collect; tools: skill 🟡 0.25
dump-collect Advisory: CoreCLR Alpine Docker non-root configuration 4.0/5 → 4.0/5 ✅ dump-collect; tools: skill, report_intent, view, glob 🟡 0.25
dump-collect Advisory: NativeAOT Kubernetes dump collection setup 3.0/5 → 5.0/5 🟢 ✅ dump-collect; tools: skill 🟡 0.25
dump-collect Detect runtime and configure crash dumps for unknown .NET app on Linux 4.0/5 → 4.0/5 ✅ dump-collect; tools: skill 🟡 0.25
dump-collect Decline dump analysis request 2.0/5 → 4.0/5 🟢 ℹ️ not activated (expected) 🟡 0.25
optimizing-ef-core-queries Optimize bulk operations with EF Core 7+ ExecuteUpdate and ExecuteDelete 4.0/5 → 4.0/5 ✅ optimizing-ef-core-queries; tools: skill ✅ 0.16
build-parallelism Analyze build parallelism bottlenecks 4.0/5 → 5.0/5 🟢 ✅ build-parallelism; binlog-generation; tools: skill, task, glob ✅ 0.14
including-generated-files Diagnose generated file inclusion failure 3.0/5 → 5.0/5 🟢 ✅ including-generated-files; tools: skill 🟡 0.24
msbuild-antipatterns Review MSBuild files for anti-patterns and style issues 5.0/5 → 5.0/5 ✅ msbuild-antipatterns; tools: skill ✅ 0.06
build-perf-baseline Establish build performance baseline and recommend optimizations 3.0/5 → 4.0/5 🟢 ✅ build-perf-baseline; build-perf-diagnostics; tools: skill 🟡 0.30
msbuild-modernization Modernize legacy project to SDK-style 5.0/5 → 5.0/5 ✅ msbuild-modernization; tools: skill ✅ 0.04
directory-build-organization Organize build infrastructure for a multi-project repo 3.0/5 → 5.0/5 🟢 ✅ directory-build-organization; msbuild-antipatterns; tools: skill ✅ 0.13
check-bin-obj-clash Diagnose bin/obj output path clashes 5.0/5 → 5.0/5 ✅ check-bin-obj-clash; binlog-generation; tools: skill, glob, edit ✅ 0.15 [6]
incremental-build Analyze incremental build issues 3.0/5 → 4.0/5 🟢 ✅ incremental-build; tools: skill ✅ 0.12
eval-performance Analyze MSBuild evaluation performance issues 4.0/5 → 5.0/5 🟢 ✅ eval-performance; tools: skill ✅ 0.11
build-perf-diagnostics Analyze analyzer performance impact on builds 5.0/5 → 5.0/5 ✅ binlog-generation; binlog-failure-analysis; build-perf-diagnostics; tools: skill, edit 🟡 0.25 [7]
binlog-generation Build project with /bl flag 1.0/5 → 5.0/5 🟢 ✅ binlog-generation; tools: skill 🟡 0.40
binlog-generation Build with /bl in PowerShell 3.0/5 → 5.0/5 🟢 ✅ binlog-generation; tools: skill 🟡 0.40
binlog-generation Build multiple configurations with unique binlogs 2.0/5 → 5.0/5 🟢 ✅ binlog-generation; tools: skill 🟡 0.40
binlog-failure-analysis Diagnose build failures from binlog only (no source files) 3.0/5 → 5.0/5 🟢 ✅ binlog-failure-analysis; tools: skill ✅ 0.05
dotnet-maui-doctor Plan macOS MAUI setup with Xcode 3.0/5 → 5.0/5 🟢 ✅ dotnet-maui-doctor; tools: skill 🟡 0.23
dotnet-maui-doctor Plan Linux MAUI environment for Android 3.0/5 → 3.0/5 ⏰ ✅ dotnet-maui-doctor; tools: skill, view, glob 🟡 0.23
dotnet-maui-doctor Guardrail against workload update and repair 1.0/5 → 3.0/5 🟢 ✅ dotnet-maui-doctor; tools: skill 🟡 0.23
dotnet-maui-doctor Diagnose non-Microsoft JDK causing build failure 2.0/5 → 5.0/5 🟢 ✅ dotnet-maui-doctor; tools: skill 🟡 0.23
dotnet-maui-doctor Plan complete MAUI setup on Windows 3.0/5 → 3.0/5 ✅ dotnet-maui-doctor; tools: skill, bash 🟡 0.23 [8]
dotnet-maui-doctor Prevent incorrect JAVA_HOME configuration 3.0/5 → 5.0/5 🟢 ✅ dotnet-maui-doctor; tools: report_intent, skill 🟡 0.23
dotnet-maui-doctor Determine required Android SDK packages for specific .NET version 3.0/5 → 5.0/5 🟢 ✅ dotnet-maui-doctor; tools: report_intent, skill, view, bash, glob 🟡 0.23
dotnet-maui-doctor Fix stale MAUI workloads after SDK update 2.0/5 → 4.0/5 🟢 ✅ dotnet-maui-doctor; tools: skill, glob 🟡 0.23
thread-abort-migration Worker thread with abort-based cancellation 4.0/5 → 5.0/5 🟢 ✅ thread-abort-migration; tools: skill ✅ 0.12
thread-abort-migration Timeout enforcement via Thread.Abort 4.0/5 → 5.0/5 🟢 ✅ thread-abort-migration; tools: skill ✅ 0.12
thread-abort-migration Blocking WaitHandle with Thread.Interrupt 3.0/5 → 4.0/5 🟢 ✅ thread-abort-migration; tools: skill ✅ 0.12
thread-abort-migration ASP.NET Response.End and Response.Redirect with Thread.Abort 4.0/5 → 5.0/5 🟢 ✅ thread-abort-migration; tools: skill ✅ 0.12
thread-abort-migration Thread.Join and Thread.Sleep only — should not migrate 3.0/5 → 5.0/5 🟢 ✅ thread-abort-migration; tools: skill ✅ 0.12
migrate-nullable-references Enable NRT in a small library with mixed nullability 5.0/5 → 5.0/5 ✅ migrate-nullable-references; tools: skill ✅ 0.14 [9]
migrate-nullable-references File-by-file migration: only modify the targeted file 5.0/5 → 5.0/5 ⚠️ NOT ACTIVATED ✅ 0.14
migrate-nullable-references Enable NRT in ASP.NET Core Web API with EF Core 4.0/5 → 3.0/5 🔴 ✅ migrate-nullable-references; tools: skill ✅ 0.14
dotnet-aot-compat Make Azure.ResourceManager AOT-compatible 1.0/5 → 3.0/5 ⏰ 🟢 ✅ dotnet-aot-compat; tools: skill, create, read_agent, stop_bash ✅ 0.12

[1] Quality unchanged but weighted score is -23.8% due to: judgment, tokens (11337 → 52185), quality, tool calls (0 → 3), time (16.3s → 23.7s)
[2] Quality unchanged but weighted score is -2.2% due to: tokens (33609 → 50757)
[3] Quality unchanged but weighted score is -8.5% due to: tokens (11141 → 30808), tool calls (0 → 1), time (12.3s → 17.4s)
[4] Quality unchanged but weighted score is -7.9% due to: tokens (23024 → 43425), tool calls (2 → 4), time (11.0s → 15.4s)
[5] Quality dropped but weighted score is +8.4% due to: efficiency metrics
[6] Quality unchanged but weighted score is -29.8% due to: judgment, quality, tokens (94044 → 320434), tool calls (15 → 25), time (61.4s → 98.7s)
[7] Quality unchanged but weighted score is -0.9% due to: tokens (219911 → 267999)
[8] Quality unchanged but weighted score is -2.4% due to: completion (✓ → ✗), tokens (34906 → 94171), tool calls (3 → 12), time (39.3s → 69.6s)
[9] Quality unchanged but weighted score is -7.0% due to: tokens (119804 → 272015), time (68.3s → 98.9s), tool calls (25 → 34)

timeout — run hit the scenario timeout limit; scoring may be impacted by aborting model execution before it could produce its full output

Model: claude-opus-4.6 | Judge: claude-opus-4.6

Full results

@JanKrivanek
Copy link
Copy Markdown
Member

There is expect_activation option in eval.yaml for this case. Can you reuse/expand that one?

@DeagleGross
Copy link
Copy Markdown
Member Author

DeagleGross commented Mar 10, 2026

There is expect_activation option in eval.yaml for this case. Can you reuse/expand that one?

I dont think that expect_activation is suitable for such "lightweight" definition of tests, since eval.yaml schema becomes "heavy". Consider:

scenarios:
  - name: "Analyze analyzer perf"
    prompt: "Build this project..."
    assertions: [...]
    rubric: [...]
  
  - name: "Activate on slow build query"
    prompt: "My .NET build takes 5 minutes, how can I speed it up?"
    expect_activation: true  (default)
  
  - name: "Decline NuGet question"  
    prompt: "How do I add a NuGet package reference?"
    expect_activation: false

  - name: "Decline Unit test failure"  
    prompt: "My unit tests are failing with a NullReferenceException"
    expect_activation: false

against how laconic it is with selectivity definitions:

selectivity:
  should_activate:
    - "My .NET build takes over 5 minutes, how can I speed it up?"
  should_not_activate:
    - "How do I add a NuGet package reference to my project?"
    - "My unit tests are failing with a NullReferenceException"

Why cant we leave both expect_activation and selectivity? One can be used if you need to also provide a rubric to assert something (like for expect_activation: false agent did not take 15 minutes to run query, did not invoke other tools etc). Selectivity can be used only for this prompt validation.

@ViktorHofer
Copy link
Copy Markdown
Member

Thanks for sharing your perspective. I now better understand the intent. I'm in favor of using the existing expect_activation parameter. It's used in the existing eval tests and people are familiar with it. I would also rather not support both to keep the eval.yml schema minimal.

@github-actions
Copy link
Copy Markdown
Contributor

Skill Validation Results

Skill Scenario Quality Skills Loaded Overfit Verdict
csharp-scripts Test a C# language feature with a script 3.0/5 → 5.0/5 🟢 ✅ csharp-scripts; tools: skill, create 🟡 0.34
nuget-trusted-publishing Set up trusted publishing for a new NuGet library 3.0/5 → 5.0/5 🟢 ✅ nuget-trusted-publishing; tools: skill ✅ 0.10
nuget-trusted-publishing Set up NuGet publishing without mentioning trusted publishing 3.0/5 → 5.0/5 🟢 ✅ nuget-trusted-publishing; tools: skill, report_intent, glob, view, bash, create ✅ 0.10
nuget-trusted-publishing Migrate existing workflow from API key to trusted publishing 2.0/5 → 4.0/5 🟢 ✅ nuget-trusted-publishing; tools: skill, glob ✅ 0.10
dotnet-pinvoke Generate LibraryImport declaration from C header (.NET 8+) 4.0/5 → 5.0/5 🟢 ✅ dotnet-pinvoke; tools: skill ✅ 0.08
dotnet-pinvoke Generate LibraryImport declaration from C header (.NET Framework) 4.0/5 → 5.0/5 🟢 ✅ dotnet-pinvoke; tools: skill ✅ 0.08
dotnet-trace-collect High CPU in Kubernetes on Linux (.NET 8) 4.0/5 → 4.0/5 ✅ dotnet-trace-collect; tools: skill, report_intent, view ✅ 0.10
dotnet-trace-collect .NET Framework on Windows without admin privileges 2.0/5 → 5.0/5 🟢 ✅ dotnet-trace-collect; tools: skill ✅ 0.10
dotnet-trace-collect .NET 10 on Linux with root access and native call stacks 1.0/5 → 4.0/5 🟢 ✅ dotnet-trace-collect; tools: skill ✅ 0.10
dotnet-trace-collect Memory leak on Linux (.NET 8) 3.0/5 → 3.0/5 ✅ dotnet-trace-collect; tools: skill, report_intent, view, glob ✅ 0.10
dotnet-trace-collect Slow requests on Windows with PerfView 5.0/5 → 5.0/5 ✅ dotnet-trace-collect; tools: skill, report_intent, view ✅ 0.10
dotnet-trace-collect Excessive GC on Linux (.NET 8) 5.0/5 → 5.0/5 ✅ dotnet-trace-collect; tools: skill ✅ 0.10 [1]
dotnet-trace-collect Hang or deadlock diagnosis on Linux 3.0/5 → 4.0/5 🟢 ✅ dotnet-trace-collect; tools: skill, report_intent, view ✅ 0.10
dotnet-trace-collect Windows container high CPU with PerfView 2.0/5 → 5.0/5 🟢 ✅ dotnet-trace-collect; tools: report_intent, skill, view ✅ 0.10
dotnet-trace-collect Long-running intermittent issue with PerfView triggers 3.0/5 → 4.0/5 🟢 ✅ dotnet-trace-collect; tools: skill, report_intent, view ✅ 0.10
dotnet-trace-collect Linux pre-.NET 10 needing native call stacks 2.0/5 → 5.0/5 🟢 ✅ dotnet-trace-collect; tools: skill, report_intent, view, glob ✅ 0.10
dotnet-trace-collect Windows modern .NET with admin high CPU 2.0/5 → 5.0/5 🟢 ✅ dotnet-trace-collect; tools: skill, report_intent, view, glob ✅ 0.10
dotnet-trace-collect Memory leak on .NET Framework Windows 3.0/5 → 5.0/5 🟢 ✅ dotnet-trace-collect; tools: skill, report_intent, view ✅ 0.10
dotnet-trace-collect Kubernetes with console access prefers console tools 5.0/5 → 5.0/5 ✅ dotnet-trace-collect; tools: skill, report_intent, view, glob ✅ 0.10 [2]
dotnet-trace-collect Container installation without .NET SDK 4.0/5 → 5.0/5 🟢 ✅ dotnet-trace-collect; tools: skill ✅ 0.10
dotnet-trace-collect HTTP 500s from downstream service on Linux (.NET 8) 5.0/5 → 5.0/5 ✅ dotnet-trace-collect; tools: skill, report_intent, view ✅ 0.10
dotnet-trace-collect Networking timeouts on Windows with admin (.NET 8) 2.0/5 → 5.0/5 🟢 ✅ dotnet-trace-collect; tools: skill, report_intent, view ✅ 0.10
microbenchmarking Investigate runtime upgrade performance impact 4.0/5 → 5.0/5 🟢 ✅ microbenchmarking; tools: skill, glob ✅ 0.12
clr-activation-debugging Diagnose unexpected FOD dialog from native build tool 1.0/5 → 5.0/5 🟢 ✅ clr-activation-debugging; tools: skill ✅ 0.07
clr-activation-debugging Diagnose FOD suppressed but activation still failing 1.0/5 → 5.0/5 🟢 ✅ clr-activation-debugging; tools: skill ✅ 0.07
clr-activation-debugging Explain why same binary behaves differently under different launch methods 1.0/5 → 5.0/5 🟢 ✅ clr-activation-debugging; tools: skill ✅ 0.07
clr-activation-debugging Analyze healthy managed EXE activation 1.0/5 → 2.0/5 ⏰ 🟢 ✅ clr-activation-debugging; tools: skill ✅ 0.07
clr-activation-debugging Identify multiple activation sequences in a single log 1.0/5 → 5.0/5 🟢 ✅ clr-activation-debugging; tools: skill, task, glob ✅ 0.07
clr-activation-debugging Explain useLegacyV2RuntimeActivationPolicy in activation log 3.0/5 → 3.0/5 ✅ clr-activation-debugging; tools: skill ✅ 0.07
clr-activation-debugging Decline non-CLR-activation issue 1.0/5 → 5.0/5 🟢 ℹ️ not activated (expected) ✅ 0.07
analyzing-dotnet-performance Detects compiled regex startup budget and regex chain allocations 1.0/5 → 3.0/5 ⏰ 🟢 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.13
analyzing-dotnet-performance Detects CurrentCulture comparer and compiled regex budget in inflection rules 1.0/5 → 5.0/5 🟢 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.13
analyzing-dotnet-performance Finds per-call Dictionary allocation not hoisted to static 1.0/5 → 1.0/5 ⏰ ✅ analyzing-dotnet-performance; tools: skill ✅ 0.13 [3]
analyzing-dotnet-performance Catches compound allocations in recursive number converter with ToLower 1.0/5 → 4.0/5 🟢 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.13
analyzing-dotnet-performance Finds StringComparison.Ordinal missing and FrozenDictionary opportunities 1.0/5 → 4.0/5 🟢 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.13
analyzing-dotnet-performance Detects Aggregate+Replace chain and struct missing IEquatable 1.0/5 → 5.0/5 🟢 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.13
analyzing-dotnet-performance Finds branched Replace chain in format string manipulation 1.0/5 → 3.0/5 🟢 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.13
analyzing-dotnet-performance Catches LINQ on hot-path string processing and All(char.IsUpper) 1.0/5 → 5.0/5 🟢 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.13
analyzing-dotnet-performance Detects LINQ pipeline in TimeSpan formatting and collection processing 1.0/5 → 5.0/5 🟢 ✅ analyzing-dotnet-performance; tools: skill, read_bash ✅ 0.13
analyzing-dotnet-performance Flags Span inconsistencies and compound method chains in truncation library 1.0/5 → 5.0/5 🟢 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.13
analyzing-dotnet-performance Identifies unsealed leaf classes and locale hierarchy patterns 1.0/5 → 1.0/5 ⏰ ✅ analyzing-dotnet-performance; tools: skill ✅ 0.13
android-tombstone-symbolication Symbolicate .NET frames in an Android tombstone 4.0/5 → 5.0/5 🟢 ✅ android-tombstone-symbolication; tools: skill 🟡 0.22
android-tombstone-symbolication Recognize tombstone with no .NET frames 5.0/5 → 5.0/5 ✅ android-tombstone-symbolication; tools: skill 🟡 0.22
android-tombstone-symbolication Symbolicate CoreCLR frames in an Android tombstone 3.0/5 → 5.0/5 🟢 ✅ android-tombstone-symbolication; tools: skill 🟡 0.22
android-tombstone-symbolication Recognize NativeAOT tombstone with app binary and libSystem.Native.so 3.0/5 → 4.0/5 🟢 ✅ android-tombstone-symbolication; tools: skill, stop_bash 🟡 0.22
android-tombstone-symbolication Symbolicate multi-thread tombstone 4.0/5 → 4.0/5 ✅ android-tombstone-symbolication; tools: skill, glob 🟡 0.22
android-tombstone-symbolication Handle .NET frames with no BuildId metadata 5.0/5 → 5.0/5 ✅ android-tombstone-symbolication; tools: skill, bash, glob 🟡 0.22
android-tombstone-symbolication Symbolicate tombstone with multiple .NET libraries and different BuildIds 3.0/5 → 5.0/5 🟢 ✅ android-tombstone-symbolication; tools: skill, glob 🟡 0.22
android-tombstone-symbolication Reject iOS crash log as wrong format 5.0/5 → 5.0/5 ℹ️ not activated (expected) 🟡 0.22 [4]
dump-collect Configure automatic crash dumps for CoreCLR app on Linux 5.0/5 → 5.0/5 ✅ dump-collect; tools: skill, report_intent, view, glob 🟡 0.24
dump-collect Set up NativeAOT crash dumps with createdump in Kubernetes 2.0/5 → 5.0/5 🟢 ✅ dump-collect; tools: skill 🟡 0.24
dump-collect Recover crash dump from macOS NativeAOT without createdump 2.0/5 → 5.0/5 🟢 ✅ dump-collect; tools: skill, report_intent, view, glob, bash 🟡 0.24
dump-collect Configure CoreCLR dump collection in Alpine Docker as non-root 4.0/5 → 4.0/5 ✅ dump-collect; tools: skill, view 🟡 0.24
dump-collect Advisory: macOS NativeAOT crash dump recovery steps 4.0/5 → 5.0/5 🟢 ✅ dump-collect; tools: skill 🟡 0.24
dump-collect Advisory: CoreCLR Alpine Docker non-root configuration 4.0/5 → 5.0/5 🟢 ✅ dump-collect; tools: skill, report_intent, view 🟡 0.24
dump-collect Advisory: NativeAOT Kubernetes dump collection setup 2.0/5 → 5.0/5 🟢 ✅ dump-collect; tools: skill 🟡 0.24
dump-collect Detect runtime and configure crash dumps for unknown .NET app on Linux 3.0/5 → 4.0/5 🟢 ✅ dump-collect; tools: skill 🟡 0.24 [5]
dump-collect Decline dump analysis request 2.0/5 → 4.0/5 🟢 ℹ️ not activated (expected) 🟡 0.24
optimizing-ef-core-queries Optimize bulk operations with EF Core 7+ ExecuteUpdate and ExecuteDelete 5.0/5 → 5.0/5 ✅ optimizing-ef-core-queries; tools: skill 🟡 0.25 [6]
build-parallelism Analyze build parallelism bottlenecks 1.0/5 ⏰ → 3.0/5 🟢 ✅ build-parallelism; binlog-generation; tools: skill ✅ 0.14
including-generated-files Diagnose generated file inclusion failure 3.0/5 → 5.0/5 🟢 ✅ including-generated-files; tools: skill 🟡 0.23
msbuild-antipatterns Review MSBuild files for anti-patterns and style issues 5.0/5 → 5.0/5 ✅ msbuild-antipatterns; tools: skill, glob ✅ 0.06
build-perf-baseline Establish build performance baseline and recommend optimizations 3.0/5 → 4.0/5 🟢 ✅ build-perf-baseline; tools: skill 🟡 0.21
msbuild-modernization Modernize legacy project to SDK-style 5.0/5 → 5.0/5 ✅ msbuild-modernization; tools: skill ✅ 0.05
directory-build-organization Organize build infrastructure for a multi-project repo 3.0/5 → 5.0/5 🟢 ✅ directory-build-organization; msbuild-antipatterns; tools: skill ✅ 0.15
check-bin-obj-clash Diagnose bin/obj output path clashes 5.0/5 → 5.0/5 ✅ check-bin-obj-clash; binlog-generation; tools: skill, glob ✅ 0.14 [7]
incremental-build Analyze incremental build issues 3.0/5 → 5.0/5 🟢 ✅ incremental-build; tools: skill, bash ✅ 0.14
eval-performance Analyze MSBuild evaluation performance issues 4.0/5 → 5.0/5 🟢 ✅ eval-performance; tools: skill ✅ 0.11
build-perf-diagnostics Analyze analyzer performance impact on builds 1.0/5 ⏰ → 5.0/5 🟢 ✅ build-perf-diagnostics; tools: skill 🟡 0.31
binlog-generation Build project with /bl flag 2.0/5 → 5.0/5 🟢 ✅ binlog-generation; tools: skill 🟡 0.46
binlog-generation Build with /bl in PowerShell 3.0/5 → 5.0/5 🟢 ✅ binlog-generation; tools: skill 🟡 0.46
binlog-generation Build multiple configurations with unique binlogs 5.0/5 → 5.0/5 ✅ binlog-generation; tools: skill 🟡 0.46
binlog-failure-analysis Diagnose build failures from binlog only (no source files) 5.0/5 → 5.0/5 ✅ binlog-failure-analysis; tools: skill ✅ 0.05 [8]
dotnet-maui-doctor Plan macOS MAUI setup with Xcode 3.0/5 → 5.0/5 🟢 ✅ dotnet-maui-doctor; tools: skill 🟡 0.21
dotnet-maui-doctor Plan Linux MAUI environment for Android 3.0/5 → 5.0/5 🟢 ✅ dotnet-maui-doctor; tools: skill, bash 🟡 0.21
dotnet-maui-doctor Guardrail against workload update and repair 1.0/5 → 5.0/5 🟢 ✅ dotnet-maui-doctor; tools: skill 🟡 0.21
dotnet-maui-doctor Diagnose non-Microsoft JDK causing build failure 1.0/5 → 5.0/5 🟢 ✅ dotnet-maui-doctor; tools: skill 🟡 0.21
dotnet-maui-doctor Plan complete MAUI setup on Windows 3.0/5 → 5.0/5 🟢 ✅ dotnet-maui-doctor; tools: skill 🟡 0.21 [9]
dotnet-maui-doctor Prevent incorrect JAVA_HOME configuration 3.0/5 → 5.0/5 🟢 ✅ dotnet-maui-doctor; tools: skill 🟡 0.21
dotnet-maui-doctor Determine required Android SDK packages for specific .NET version 3.0/5 → 5.0/5 🟢 ✅ dotnet-maui-doctor; tools: skill, report_intent, view, bash, glob 🟡 0.21
dotnet-maui-doctor Fix stale MAUI workloads after SDK update 2.0/5 → 5.0/5 🟢 ✅ dotnet-maui-doctor; tools: skill, glob 🟡 0.21
thread-abort-migration Worker thread with abort-based cancellation 5.0/5 → 5.0/5 ✅ thread-abort-migration; tools: skill ✅ 0.10
thread-abort-migration Timeout enforcement via Thread.Abort 4.0/5 → 5.0/5 🟢 ✅ thread-abort-migration; tools: skill ✅ 0.10
thread-abort-migration Blocking WaitHandle with Thread.Interrupt 4.0/5 → 5.0/5 🟢 ✅ thread-abort-migration; tools: skill ✅ 0.10
thread-abort-migration ASP.NET Response.End and Response.Redirect with Thread.Abort 4.0/5 → 5.0/5 🟢 ✅ thread-abort-migration; tools: skill ✅ 0.10
thread-abort-migration Thread.Join and Thread.Sleep only — should not migrate 4.0/5 → 5.0/5 🟢 ✅ thread-abort-migration; tools: skill ✅ 0.10
migrate-nullable-references Enable NRT in a small library with mixed nullability 5.0/5 → 5.0/5 ✅ migrate-nullable-references; tools: skill ✅ 0.13 [10]
migrate-nullable-references File-by-file migration: only modify the targeted file 5.0/5 → 5.0/5 ⚠️ NOT ACTIVATED ✅ 0.13
migrate-nullable-references Enable NRT in ASP.NET Core Web API with EF Core 3.0/5 → 3.0/5 ✅ migrate-nullable-references; tools: skill ✅ 0.13 [11]
dotnet-aot-compat Make Azure.ResourceManager AOT-compatible 3.0/5 → 1.0/5 ⏰ 🔴 ✅ dotnet-aot-compat; tools: skill, create ✅ 0.13

[1] Quality unchanged but weighted score is -31.6% due to: judgment, quality, tokens (22888 → 106660), tool calls (3 → 6), time (25.0s → 50.2s)
[2] Quality unchanged but weighted score is -10.0% due to: tokens (11218 → 105974), tool calls (0 → 6), time (12.7s → 46.8s)
[3] Quality unchanged but weighted score is -3.0% due to: tokens (33051 → 89420), errors (0 → 1), tool calls (3 → 9), time (20.9s → 120.1s)
[4] Quality unchanged but weighted score is -5.5% due to: quality, time (20.0s → 27.6s), tokens (23436 → 26090)
[5] Quality improved but weighted score is -1.6% due to: tokens (45508 → 86029), tool calls (4 → 9), time (38.0s → 54.2s)
[6] Quality unchanged but weighted score is -9.1% due to: tokens (11311 → 25199), tool calls (0 → 1), time (10.5s → 17.4s)
[7] Quality unchanged but weighted score is -6.8% due to: tokens (79354 → 392453), tool calls (12 → 28), time (48.6s → 131.5s)
[8] Quality unchanged but weighted score is -1.5% due to: tokens (386341 → 677126), tool calls (19 → 34)
[9] Quality improved but weighted score is -3.1% due to: completion (✓ → ✗), tokens (34628 → 52518), tool calls (3 → 8), time (39.8s → 56.1s)
[10] Quality unchanged but weighted score is -1.3% due to: tokens (118171 → 158369)
[11] Quality unchanged but weighted score is -6.1% due to: tokens (86274 → 188835), tool calls (19 → 25)

timeout — run hit the scenario timeout limit; scoring may be impacted by aborting model execution before it could produce its full output

Model: claude-opus-4.6 | Judge: claude-opus-4.6

Full results

var prefix = $"[{skill.Name}/selectivity]";
var log = (string msg) => spinner.Log($"{prefix} {msg}");

// Launch all probes in parallel
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can quickly lead to throttling/rejections from the inference api

@JanKrivanek
Copy link
Copy Markdown
Member

@ViktorHofer How about if we flip entirely to the new suggested format - so far there is just 3 usages of expect_activation, so if this PR removes the duplication and updates those usages as well - we'll have a clean state.

Having more expressive and faster scenarios for activation sounds as a good benefit.

We might just need to figure out the 'visualisation' in the report and if/how to have this in the dashboards https://dotnet.github.io/skills/

@ViktorHofer
Copy link
Copy Markdown
Member

Sorry I didn't see your ping. Yes, I'm fine with any solution that doesn't introduce an additional schema.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants