feat: skill loading tests by DeagleGross · Pull Request #311 · dotnet/skills

DeagleGross · 2026-03-10T12:32:57Z

Note: includes #309

Problem

It is not free to load skills. Each loaded skill is eating-up the context of LLM, and it is crucial to make skill very specific to solving some conrete problem. We should be able to validate whether skill will be loaded or not based on its frontmatter.

Solution

I've added --selectivity-test option which is the new lightweight mode that probes skill activation without running the full evaluation. Each eval.yml should have should_activate and should_not_activate which should be close to the topic of the skill. Validation will run a lightweight agent run, and assert if skill was not loaded for all should_activate prompts, and should not be called for should_not_activate.

Usage & Example

dotnet run --project eng/skill-validator/src -- \
  plugins/dotnet-msbuild/skills/build-perf-diagnostics \
  --tests-dir tests/dotnet-msbuild \
  --selectivity-test --runs 1 --verbose

[build-perf-diagnostics] 🔍 Evaluating...
[build-perf-diagnostics] 📊 📊 build-perf-diagnostics: 1,560 BPE tokens [chars/4: 1,649] (detailed ✓), 12 sections, 10 code blocks
[build-perf-diagnostics] 🎯 Running selectivity test (standalone)...
[build-perf-diagnostics/selectivity] Testing should_activate: "My .NET build takes over 5 minutes, how can I speed it up?"
⠹ Evaluating 1 skill(s)... 📂 C:\Users\dmkorolev\AppData\Local\Temp\sv-6e86045eebd84007834facc17361ee54 (skilled)
[build-perf-diagnostics/selectivity] Testing should_activate: "How do I analyze a binlog to find slow targets in MSBuild?"
[build-perf-diagnostics/selectivity] → ✅ activated: "Our CI builds are fast but local dev builds are p…"
...
[build-perf-diagnostics/selectivity] → ✅ correctly NOT activated: "What's the difference between .NET 8 and .NET 9?"
[build-perf-diagnostics/selectivity] → ✅ correctly NOT activated: "My unit tests are failing with a NullReferenceExc…"
[build-perf-diagnostics] 🎯 Selectivity: recall=100%, precision=100% — PASSED

═══ Skill Validation Results ═══

🎯 Selectivity: recall=100%, precision=100% ✅
✓ "My .NET build takes over 5 minutes, how can I speed it up?" — should activate → activated
✓ "How do I analyze a binlog to find slow targets in MSBuild?" — should activate → activated
✓ "Roslyn analyzers are making my compilation really slow, wha…" — should activate → activated
✓ "I want to profile my MSBuild build to understand where time…" — should activate → activated
✓ "Our CI builds are fast but local dev builds are painfully s…" — should activate → activated
✓ "How do I add a NuGet package reference to my project?" — should NOT activate → not activated
✓ "My unit tests are failing with a NullReferenceException" — should NOT activate → not activated
✓ "How do I configure Docker for my .NET application?" — should NOT activate → not activated
✓ "What's the difference between .NET 8 and .NET 9?" — should NOT activate → not activated
✓ "How do I set up Entity Framework Core migrations?" — should NOT activate → not activated

Copilot

Pull request overview

This PR extends the eng/skill-validator tooling to support a lightweight “selectivity” validation mode (probing whether a skill activates for near-topic prompts), and updates skill token accounting to use BPE tokenization (from the referenced PR #309).

Changes:

Add --selectivity-test (with thresholds) and YAML schema support for selectivity.should_activate / selectivity.should_not_activate.
Implement an agent-run probe to detect whether a skill was invoked, and report selectivity results in console/JSON.
Switch skill profiling thresholds/warnings from chars/4 approximation to BPE token counting.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
tests/dotnet-msbuild/build-perf-diagnostics/eval.yaml	Adds `selectivity` prompts for one skill eval.
eng/skill-validator/tests/SkillProfileTests.cs	Updates test data generation to reliably exceed BPE thresholds.
eng/skill-validator/src/SkillValidatorYamlContext.cs	Adds YAML source-gen type registration for `RawSelectivity`.
eng/skill-validator/src/SkillValidatorJsonContext.cs	Adds JSON source-gen type registration for selectivity result models.
eng/skill-validator/src/SkillValidator.csproj	Adds Microsoft.ML.Tokenizers packages; fixes RunArguments quoting.
eng/skill-validator/src/Services/SkillProfiler.cs	Adds BPE tokenizer + BPE-based complexity/warnings and output formatting.
eng/skill-validator/src/Services/Reporter.cs	Renders selectivity-only verdicts and selectivity prompt breakdowns.
eng/skill-validator/src/Services/EvalSchema.cs	Parses `selectivity` section into `EvalConfig`.
eng/skill-validator/src/Services/AgentRunner.cs	Adds `ProbeSkillActivation` lightweight session runner.
eng/skill-validator/src/Models/Models.cs	Adds selectivity models + config flags/thresholds.
eng/skill-validator/src/Commands/ValidateCommand.cs	Adds CLI options and selectivity-only execution path.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-10T12:36:16Z

eng/skill-validator/src/Commands/ValidateCommand.cs

+        // Launch all probes in parallel
+        var tasks = new List<Task<SelectivityPromptResult>>();
+
+        if (skill.EvalConfig!.ShouldActivatePrompts is { } activatePrompts)
+        {
+            foreach (var prompt in activatePrompts)
+            {
+                log($"Testing should_activate: \"{Truncate(prompt, 60)}\"");
+                tasks.Add(ProbeAndLog(skill, prompt, expectedActivation: true, config, log));
+            }
+        }
+
+        if (skill.EvalConfig.ShouldNotActivatePrompts is { } deactivatePrompts)
+        {
+            foreach (var prompt in deactivatePrompts)
+            {
+                log($"Testing should_not_activate: \"{Truncate(prompt, 60)}\"");
+                tasks.Add(ProbeAndLog(skill, prompt, expectedActivation: false, config, log));
+            }
+        }
+
+        var promptResults = (await Task.WhenAll(tasks)).ToList();
+


ExecuteSelectivityTest launches one agent session per prompt and awaits them all at once; with larger prompt lists this can create a burst of concurrent sessions and hit rate limits / overwhelm the host. Consider adding a concurrency limiter (similar to scenarios/runs) or reusing an existing Parallel* setting to cap parallel probes.

Copilot · 2026-03-10T12:36:17Z

eng/skill-validator/src/Services/AgentRunner.cs

+            // 30s timeout — enough for the agent to reach the skill-loading decision
+            using var cts = new CancellationTokenSource(30_000);
+            cts.Token.Register(() => done.TrySetResult(skillActivated));
+
+            session.On(evt =>
+            {
+                switch (evt)
+                {
+                    // Skill loaded → we have our answer, bail immediately
+                    case SkillInvokedEvent:
+                        skillActivated = true;
+                        done.TrySetResult(true);
+                        break;
+
+                    // Session finished without loading the skill → not activated
+                    case SessionIdleEvent:
+                        done.TrySetResult(skillActivated);
+                        break;
+
+                    case SessionErrorEvent err:
+                        done.TrySetException(new InvalidOperationException(err.Data.Message ?? "Session error"));
+                        break;
+                }
+
+                if (options.Verbose && evt is SkillInvokedEvent si)
+                {
+                    var write = options.Log ?? (m => Console.Error.WriteLine(m));
+                    write($"      📘 Skill invoked: {si.Data.Name}");
+                }
+                if (options.Verbose && evt is ToolExecutionStartEvent ts)
+                {
+                    var write = options.Log ?? (m => Console.Error.WriteLine(m));
+                    write($"      🔧 {ts.Data.ToolName}");
+                }
+            });
+
+            await session.SendAsync(new MessageOptions { Prompt = options.Scenario.Prompt });
+            return await done.Task;


ProbeSkillActivation hard-codes a 30s timeout (new CancellationTokenSource(30_000)) even though the caller creates an EvalScenario with Timeout: 15. This makes probe duration inconsistent with scenario configuration. Consider using options.Scenario.Timeout (or a dedicated config value) and passing the cancellation token through to SendAsync/session operations so probes reliably terminate on timeout.

Copilot · 2026-03-10T12:36:17Z

eng/skill-validator/src/Services/AgentRunner.cs

+        catch
+        {
+            return skillActivated;
+        }


The blanket catch { return skillActivated; } will silently treat session creation/SendAsync failures as “not activated”, which can produce misleading selectivity results and hide infrastructure/model errors. Consider letting exceptions propagate (so selectivity fails with a clear error) or returning a richer result that distinguishes “not activated” from “probe failed”.

Copilot · 2026-03-10T12:36:17Z

eng/skill-validator/src/Services/EvalSchema.cs

            throw new InvalidOperationException("Eval config must have at least one scenario");

-        return new EvalConfig(scenarios);
+        return new EvalConfig(scenarios, raw.Selectivity?.ShouldActivate, raw.Selectivity?.ShouldNotActivate);


No tests currently cover parsing/round-tripping the new selectivity.should_activate / selectivity.should_not_activate fields. Adding a unit test in EvalSchemaTests that asserts these lists are parsed into EvalConfig (and validates expected behavior when scenarios are empty/missing, if supported) would help prevent regressions.

Copilot · 2026-03-10T12:36:18Z

eng/skill-validator/src/Services/EvalSchema.cs

        var scenarios = raw.Scenarios?.Select(ParseScenario).ToList();

        if (scenarios is not { Count: > 0 })
            throw new InvalidOperationException("Eval config must have at least one scenario");

-        return new EvalConfig(scenarios);
+        return new EvalConfig(scenarios, raw.Selectivity?.ShouldActivate, raw.Selectivity?.ShouldNotActivate);


ParseEvalConfig still throws when scenarios is missing/empty, which prevents using --selectivity-test with an eval.yaml that only defines selectivity.should_activate/should_not_activate (as described in the PR). Consider allowing zero scenarios when selectivity prompts are present (or relaxing this only when running selectivity mode) so the lightweight probe can run without requiring full scenarios.

Copilot · 2026-03-10T12:36:18Z

eng/skill-validator/src/Commands/ValidateCommand.cs

+        // Selectivity-only mode: skip full evaluation, just probe skill activation
+        if (config.SelectivityTest)
+        {
+            if (skill.EvalConfig is not null
+                && (skill.EvalConfig.ShouldActivatePrompts is { Count: > 0 } || skill.EvalConfig.ShouldNotActivatePrompts is { Count: > 0 }))
+            {
+                log("🎯 Running selectivity test (standalone)...");
+                var selectivityResult = await ExecuteSelectivityTest(skill, config, spinner);
+                log($"🎯 Selectivity: recall={selectivityResult.Recall:P0}, precision={selectivityResult.Precision:P0} — {(selectivityResult.Passed ? "PASSED" : "FAILED")}");
+
+                return new SkillVerdict
+                {
+                    SkillName = skill.Name,
+                    SkillPath = skill.Path,
+                    Passed = selectivityResult.Passed,
+                    Scenarios = [],
+                    OverallImprovementScore = 0,
+                    Reason = selectivityResult.Passed
+                        ? "Selectivity test passed"
+                        : $"Selectivity test failed: {selectivityResult.Reason}",
+                    FailureKind = selectivityResult.Passed ? null : "selectivity_failure",
+                    ProfileWarnings = profile.Warnings,
+                    SelectivityResult = selectivityResult,
+                };
+            }
+
+            log("⏭  Skipping (no selectivity prompts in eval.yaml)");
+            return null;
+        }


In --selectivity-test mode this code runs only after the earlier if (skill.EvalConfig.Scenarios.Count == 0) ... return null; guard, so selectivity testing is currently impossible for a skill that provides only selectivity prompts (or has an empty scenarios: list). If selectivity-only eval.yaml files are intended, move the scenario-count skip below the selectivity branch (or only enforce scenarios when not in selectivity mode).

Copilot · 2026-03-10T12:36:18Z

eng/skill-validator/src/Commands/ValidateCommand.cs

+        // Calculate precision: fraction of should_not_activate prompts that correctly did NOT activate
+        var shouldNotActivateResults = promptResults.Where(r => !r.ExpectedActivation).ToList();
+        double precision = shouldNotActivateResults.Count > 0
+            ? (double)shouldNotActivateResults.Count(r => !r.SkillActivated) / shouldNotActivateResults.Count


The reported precision is currently computed as the fraction of should_not_activate prompts that did not activate (i.e., true negative rate/specificity), not standard precision (TP/(TP+FP)). Either rename the metric/flags (--selectivity-min-*) to match what’s actually measured, or change the calculation to true precision to avoid confusing results and thresholds.

Suggested change

// Calculate precision: fraction of should_not_activate prompts that correctly did NOT activate

var shouldNotActivateResults = promptResults.Where(r => !r.ExpectedActivation).ToList();

double precision = shouldNotActivateResults.Count > 0

? (double)shouldNotActivateResults.Count(r => !r.SkillActivated) / shouldNotActivateResults.Count

// Calculate precision: fraction of activations that were expected (TP / (TP + FP))

var shouldNotActivateResults = promptResults.Where(r => !r.ExpectedActivation).ToList();

var truePositives = shouldActivateResults.Count(r => r.SkillActivated);

var falsePositives = shouldNotActivateResults.Count(r => r.SkillActivated);

double precision = (truePositives + falsePositives) > 0

? (double)truePositives / (truePositives + falsePositives)

github-actions · 2026-03-10T13:16:31Z

Skill Validation Results

Skill	Scenario	Quality	Skills Loaded	Overfit	Verdict
csharp-scripts	Test a C# language feature with a script	3.0/5 → 4.0/5 🟢	✅ csharp-scripts; tools: skill, create	🟡 0.33	✅
nuget-trusted-publishing	Set up trusted publishing for a new NuGet library	3.0/5 → 5.0/5 🟢	✅ nuget-trusted-publishing; tools: skill	✅ 0.15	✅
nuget-trusted-publishing	Set up NuGet publishing without mentioning trusted publishing	2.0/5 → 5.0/5 🟢	✅ nuget-trusted-publishing; tools: skill, report_intent, glob, view, bash, stop_bash	✅ 0.15	✅
nuget-trusted-publishing	Migrate existing workflow from API key to trusted publishing	3.0/5 → 5.0/5 🟢	✅ nuget-trusted-publishing; tools: skill	✅ 0.15	✅
dotnet-pinvoke	Generate LibraryImport declaration from C header (.NET 8+)	4.0/5 → 5.0/5 🟢	✅ dotnet-pinvoke; tools: skill	✅ 0.05	✅
dotnet-pinvoke	Generate LibraryImport declaration from C header (.NET Framework)	4.0/5 → 5.0/5 🟢	✅ dotnet-pinvoke; tools: skill	✅ 0.05	✅
dotnet-trace-collect	High CPU in Kubernetes on Linux (.NET 8)	4.0/5 → 4.0/5	✅ dotnet-trace-collect; tools: skill, report_intent, view, glob	✅ 0.13	✅
dotnet-trace-collect	.NET Framework on Windows without admin privileges	3.0/5 → 5.0/5 🟢	✅ dotnet-trace-collect; tools: skill	✅ 0.13	✅
dotnet-trace-collect	.NET 10 on Linux with root access and native call stacks	1.0/5 → 4.0/5 🟢	✅ dotnet-trace-collect; tools: skill	✅ 0.13	✅
dotnet-trace-collect	Memory leak on Linux (.NET 8)	2.0/5 → 3.0/5 🟢	✅ dotnet-trace-collect; tools: skill	✅ 0.13	✅
dotnet-trace-collect	Slow requests on Windows with PerfView	5.0/5 → 5.0/5	✅ dotnet-trace-collect; tools: skill, report_intent, view	✅ 0.13	❌ [1]
dotnet-trace-collect	Excessive GC on Linux (.NET 8)	5.0/5 → 5.0/5	✅ dotnet-trace-collect; tools: skill	✅ 0.13	❌ [2]
dotnet-trace-collect	Hang or deadlock diagnosis on Linux	3.0/5 → 5.0/5 🟢	✅ dotnet-trace-collect; tools: skill	✅ 0.13	✅
dotnet-trace-collect	Windows container high CPU with PerfView	1.0/5 → 5.0/5 🟢	✅ dotnet-trace-collect; tools: report_intent, skill, view, glob	✅ 0.13	✅
dotnet-trace-collect	Long-running intermittent issue with PerfView triggers	3.0/5 → 5.0/5 🟢	✅ dotnet-trace-collect; tools: skill, report_intent, view	✅ 0.13	✅
dotnet-trace-collect	Linux pre-.NET 10 needing native call stacks	2.0/5 → 5.0/5 🟢	✅ dotnet-trace-collect; tools: skill, report_intent, view, glob	✅ 0.13	✅
dotnet-trace-collect	Windows modern .NET with admin high CPU	3.0/5 → 5.0/5 🟢	✅ dotnet-trace-collect; tools: skill, report_intent, view	✅ 0.13	✅
dotnet-trace-collect	Memory leak on .NET Framework Windows	3.0/5 → 5.0/5 🟢	✅ dotnet-trace-collect; tools: skill, report_intent, view	✅ 0.13	✅
dotnet-trace-collect	Kubernetes with console access prefers console tools	5.0/5 → 5.0/5	✅ dotnet-trace-collect; tools: skill	✅ 0.13	❌ [3]
dotnet-trace-collect	Container installation without .NET SDK	4.0/5 → 4.0/5	✅ dotnet-trace-collect; tools: skill	✅ 0.13	✅
dotnet-trace-collect	HTTP 500s from downstream service on Linux (.NET 8)	4.0/5 → 5.0/5 🟢	✅ dotnet-trace-collect; tools: skill, report_intent, view	✅ 0.13	✅
dotnet-trace-collect	Networking timeouts on Windows with admin (.NET 8)	2.0/5 → 5.0/5 🟢	✅ dotnet-trace-collect; tools: skill, report_intent, view, glob	✅ 0.13	✅
microbenchmarking	Investigate runtime upgrade performance impact	3.0/5 → 5.0/5 🟢	✅ microbenchmarking; tools: skill, glob, stop_bash	✅ 0.10	✅
clr-activation-debugging	Diagnose unexpected FOD dialog from native build tool	1.0/5 → 5.0/5 🟢	✅ clr-activation-debugging; tools: skill	✅ 0.09	✅
clr-activation-debugging	Diagnose FOD suppressed but activation still failing	1.0/5 → 5.0/5 🟢	✅ clr-activation-debugging; tools: skill	✅ 0.09	✅
clr-activation-debugging	Explain why same binary behaves differently under different launch methods	1.0/5 → 5.0/5 🟢	✅ clr-activation-debugging; tools: skill	✅ 0.09	✅
clr-activation-debugging	Analyze healthy managed EXE activation	1.0/5 → 5.0/5 🟢	✅ clr-activation-debugging; tools: skill	✅ 0.09	✅
clr-activation-debugging	Identify multiple activation sequences in a single log	1.0/5 → 5.0/5 🟢	✅ clr-activation-debugging; tools: skill	✅ 0.09	✅
clr-activation-debugging	Explain useLegacyV2RuntimeActivationPolicy in activation log	2.0/5 → 3.0/5 🟢	✅ clr-activation-debugging; tools: skill	✅ 0.09	✅
clr-activation-debugging	Decline non-CLR-activation issue	1.0/5 → 5.0/5 🟢	ℹ️ not activated (expected)	✅ 0.09	✅
analyzing-dotnet-performance	Detects compiled regex startup budget and regex chain allocations	1.0/5 → 5.0/5 🟢	✅ analyzing-dotnet-performance; tools: skill	✅ 0.12	✅
analyzing-dotnet-performance	Detects CurrentCulture comparer and compiled regex budget in inflection rules	1.0/5 → 5.0/5 🟢	✅ analyzing-dotnet-performance; tools: skill	✅ 0.12	✅
analyzing-dotnet-performance	Finds per-call Dictionary allocation not hoisted to static	1.0/5 → 5.0/5 🟢	✅ analyzing-dotnet-performance; tools: skill	✅ 0.12	✅
analyzing-dotnet-performance	Catches compound allocations in recursive number converter with ToLower	1.0/5 → 4.0/5 🟢	✅ analyzing-dotnet-performance; tools: skill	✅ 0.12	✅
analyzing-dotnet-performance	Finds StringComparison.Ordinal missing and FrozenDictionary opportunities	1.0/5 → 5.0/5 🟢	✅ analyzing-dotnet-performance; tools: skill	✅ 0.12	✅
analyzing-dotnet-performance	Detects Aggregate+Replace chain and struct missing IEquatable	1.0/5 → 5.0/5 🟢	✅ analyzing-dotnet-performance; tools: skill	✅ 0.12	✅
analyzing-dotnet-performance	Finds branched Replace chain in format string manipulation	1.0/5 → 1.0/5 ⏰	✅ analyzing-dotnet-performance; tools: skill, read_bash, stop_bash	✅ 0.12	✅
analyzing-dotnet-performance	Catches LINQ on hot-path string processing and All(char.IsUpper)	1.0/5 → 5.0/5 🟢	✅ analyzing-dotnet-performance; tools: skill	✅ 0.12	✅
analyzing-dotnet-performance	Detects LINQ pipeline in TimeSpan formatting and collection processing	1.0/5 → 5.0/5 🟢	✅ analyzing-dotnet-performance; tools: skill	✅ 0.12	✅
analyzing-dotnet-performance	Flags Span inconsistencies and compound method chains in truncation library	1.0/5 → 5.0/5 🟢	✅ analyzing-dotnet-performance; tools: skill	✅ 0.12	✅
analyzing-dotnet-performance	Identifies unsealed leaf classes and locale hierarchy patterns	1.0/5 → 5.0/5 🟢	✅ analyzing-dotnet-performance; tools: skill	✅ 0.12	✅
android-tombstone-symbolication	Symbolicate .NET frames in an Android tombstone	4.0/5 → 5.0/5 🟢	✅ android-tombstone-symbolication; tools: skill, glob	✅ 0.19	✅
android-tombstone-symbolication	Recognize tombstone with no .NET frames	5.0/5 → 5.0/5	✅ android-tombstone-symbolication; tools: skill	✅ 0.19	❌ [4]
android-tombstone-symbolication	Symbolicate CoreCLR frames in an Android tombstone	3.0/5 → 3.0/5	✅ android-tombstone-symbolication; tools: skill, stop_bash	✅ 0.19	✅
android-tombstone-symbolication	Recognize NativeAOT tombstone with app binary and libSystem.Native.so	3.0/5 → 4.0/5 🟢	✅ android-tombstone-symbolication; tools: skill, bash, glob	✅ 0.19	✅
android-tombstone-symbolication	Symbolicate multi-thread tombstone	4.0/5 → 5.0/5 🟢	✅ android-tombstone-symbolication; tools: skill, glob	✅ 0.19	✅
android-tombstone-symbolication	Handle .NET frames with no BuildId metadata	5.0/5 → 4.0/5 🔴	✅ android-tombstone-symbolication; tools: skill, bash, glob	✅ 0.19	✅ [5]
android-tombstone-symbolication	Symbolicate tombstone with multiple .NET libraries and different BuildIds	4.0/5 → 5.0/5 🟢	✅ android-tombstone-symbolication; tools: skill, glob	✅ 0.19	✅
android-tombstone-symbolication	Reject iOS crash log as wrong format	4.0/5 → 5.0/5 🟢	ℹ️ not activated (expected)	✅ 0.19	✅
dump-collect	Configure automatic crash dumps for CoreCLR app on Linux	5.0/5 → 5.0/5	✅ dump-collect; tools: skill, report_intent, view, glob	🟡 0.25	✅
dump-collect	Set up NativeAOT crash dumps with createdump in Kubernetes	2.0/5 → 5.0/5 🟢	✅ dump-collect; tools: skill	🟡 0.25	✅
dump-collect	Recover crash dump from macOS NativeAOT without createdump	2.0/5 → 4.0/5 🟢	✅ dump-collect; tools: skill, report_intent, view, glob	🟡 0.25	✅
dump-collect	Configure CoreCLR dump collection in Alpine Docker as non-root	4.0/5 → 4.0/5	✅ dump-collect; tools: skill, report_intent, view, glob	🟡 0.25	✅
dump-collect	Advisory: macOS NativeAOT crash dump recovery steps	4.0/5 → 5.0/5 🟢	✅ dump-collect; tools: skill	🟡 0.25	✅
dump-collect	Advisory: CoreCLR Alpine Docker non-root configuration	4.0/5 → 4.0/5	✅ dump-collect; tools: skill, report_intent, view, glob	🟡 0.25	✅
dump-collect	Advisory: NativeAOT Kubernetes dump collection setup	3.0/5 → 5.0/5 🟢	✅ dump-collect; tools: skill	🟡 0.25	✅
dump-collect	Detect runtime and configure crash dumps for unknown .NET app on Linux	4.0/5 → 4.0/5	✅ dump-collect; tools: skill	🟡 0.25	✅
dump-collect	Decline dump analysis request	2.0/5 → 4.0/5 🟢	ℹ️ not activated (expected)	🟡 0.25	✅
optimizing-ef-core-queries	Optimize bulk operations with EF Core 7+ ExecuteUpdate and ExecuteDelete	4.0/5 → 4.0/5	✅ optimizing-ef-core-queries; tools: skill	✅ 0.16	✅
build-parallelism	Analyze build parallelism bottlenecks	4.0/5 → 5.0/5 🟢	✅ build-parallelism; binlog-generation; tools: skill, task, glob	✅ 0.14	✅
including-generated-files	Diagnose generated file inclusion failure	3.0/5 → 5.0/5 🟢	✅ including-generated-files; tools: skill	🟡 0.24	✅
msbuild-antipatterns	Review MSBuild files for anti-patterns and style issues	5.0/5 → 5.0/5	✅ msbuild-antipatterns; tools: skill	✅ 0.06	✅
build-perf-baseline	Establish build performance baseline and recommend optimizations	3.0/5 → 4.0/5 🟢	✅ build-perf-baseline; build-perf-diagnostics; tools: skill	🟡 0.30	✅
msbuild-modernization	Modernize legacy project to SDK-style	5.0/5 → 5.0/5	✅ msbuild-modernization; tools: skill	✅ 0.04	✅
directory-build-organization	Organize build infrastructure for a multi-project repo	3.0/5 → 5.0/5 🟢	✅ directory-build-organization; msbuild-antipatterns; tools: skill	✅ 0.13	✅
check-bin-obj-clash	Diagnose bin/obj output path clashes	5.0/5 → 5.0/5	✅ check-bin-obj-clash; binlog-generation; tools: skill, glob, edit	✅ 0.15	❌ [6]
incremental-build	Analyze incremental build issues	3.0/5 → 4.0/5 🟢	✅ incremental-build; tools: skill	✅ 0.12	✅
eval-performance	Analyze MSBuild evaluation performance issues	4.0/5 → 5.0/5 🟢	✅ eval-performance; tools: skill	✅ 0.11	✅
build-perf-diagnostics	Analyze analyzer performance impact on builds	5.0/5 → 5.0/5	✅ binlog-generation; binlog-failure-analysis; build-perf-diagnostics; tools: skill, edit	🟡 0.25	❌ [7]
binlog-generation	Build project with /bl flag	1.0/5 → 5.0/5 🟢	✅ binlog-generation; tools: skill	🟡 0.40	✅
binlog-generation	Build with /bl in PowerShell	3.0/5 → 5.0/5 🟢	✅ binlog-generation; tools: skill	🟡 0.40	✅
binlog-generation	Build multiple configurations with unique binlogs	2.0/5 → 5.0/5 🟢	✅ binlog-generation; tools: skill	🟡 0.40	✅
binlog-failure-analysis	Diagnose build failures from binlog only (no source files)	3.0/5 → 5.0/5 🟢	✅ binlog-failure-analysis; tools: skill	✅ 0.05	✅
dotnet-maui-doctor	Plan macOS MAUI setup with Xcode	3.0/5 → 5.0/5 🟢	✅ dotnet-maui-doctor; tools: skill	🟡 0.23	✅
dotnet-maui-doctor	Plan Linux MAUI environment for Android	3.0/5 → 3.0/5 ⏰	✅ dotnet-maui-doctor; tools: skill, view, glob	🟡 0.23	✅
dotnet-maui-doctor	Guardrail against workload update and repair	1.0/5 → 3.0/5 🟢	✅ dotnet-maui-doctor; tools: skill	🟡 0.23	✅
dotnet-maui-doctor	Diagnose non-Microsoft JDK causing build failure	2.0/5 → 5.0/5 🟢	✅ dotnet-maui-doctor; tools: skill	🟡 0.23	✅
dotnet-maui-doctor	Plan complete MAUI setup on Windows	3.0/5 → 3.0/5	✅ dotnet-maui-doctor; tools: skill, bash	🟡 0.23	❌ [8]
dotnet-maui-doctor	Prevent incorrect JAVA_HOME configuration	3.0/5 → 5.0/5 🟢	✅ dotnet-maui-doctor; tools: report_intent, skill	🟡 0.23	✅
dotnet-maui-doctor	Determine required Android SDK packages for specific .NET version	3.0/5 → 5.0/5 🟢	✅ dotnet-maui-doctor; tools: report_intent, skill, view, bash, glob	🟡 0.23	✅
dotnet-maui-doctor	Fix stale MAUI workloads after SDK update	2.0/5 → 4.0/5 🟢	✅ dotnet-maui-doctor; tools: skill, glob	🟡 0.23	✅
thread-abort-migration	Worker thread with abort-based cancellation	4.0/5 → 5.0/5 🟢	✅ thread-abort-migration; tools: skill	✅ 0.12	✅
thread-abort-migration	Timeout enforcement via Thread.Abort	4.0/5 → 5.0/5 🟢	✅ thread-abort-migration; tools: skill	✅ 0.12	✅
thread-abort-migration	Blocking WaitHandle with Thread.Interrupt	3.0/5 → 4.0/5 🟢	✅ thread-abort-migration; tools: skill	✅ 0.12	✅
thread-abort-migration	ASP.NET Response.End and Response.Redirect with Thread.Abort	4.0/5 → 5.0/5 🟢	✅ thread-abort-migration; tools: skill	✅ 0.12	✅
thread-abort-migration	Thread.Join and Thread.Sleep only — should not migrate	3.0/5 → 5.0/5 🟢	✅ thread-abort-migration; tools: skill	✅ 0.12	✅
migrate-nullable-references	Enable NRT in a small library with mixed nullability	5.0/5 → 5.0/5	✅ migrate-nullable-references; tools: skill	✅ 0.14	❌ [9]
migrate-nullable-references	File-by-file migration: only modify the targeted file	5.0/5 → 5.0/5	⚠️ NOT ACTIVATED	✅ 0.14	✅
migrate-nullable-references	Enable NRT in ASP.NET Core Web API with EF Core	4.0/5 → 3.0/5 🔴	✅ migrate-nullable-references; tools: skill	✅ 0.14	❌
dotnet-aot-compat	Make Azure.ResourceManager AOT-compatible	1.0/5 → 3.0/5 ⏰ 🟢	✅ dotnet-aot-compat; tools: skill, create, read_agent, stop_bash	✅ 0.12	✅

[1] Quality unchanged but weighted score is -23.8% due to: judgment, tokens (11337 → 52185), quality, tool calls (0 → 3), time (16.3s → 23.7s)
[2] Quality unchanged but weighted score is -2.2% due to: tokens (33609 → 50757)
[3] Quality unchanged but weighted score is -8.5% due to: tokens (11141 → 30808), tool calls (0 → 1), time (12.3s → 17.4s)
[4] Quality unchanged but weighted score is -7.9% due to: tokens (23024 → 43425), tool calls (2 → 4), time (11.0s → 15.4s)
[5] Quality dropped but weighted score is +8.4% due to: efficiency metrics
[6] Quality unchanged but weighted score is -29.8% due to: judgment, quality, tokens (94044 → 320434), tool calls (15 → 25), time (61.4s → 98.7s)
[7] Quality unchanged but weighted score is -0.9% due to: tokens (219911 → 267999)
[8] Quality unchanged but weighted score is -2.4% due to: completion (✓ → ✗), tokens (34906 → 94171), tool calls (3 → 12), time (39.3s → 69.6s)
[9] Quality unchanged but weighted score is -7.0% due to: tokens (119804 → 272015), time (68.3s → 98.9s), tool calls (25 → 34)

⏰ timeout — run hit the scenario timeout limit; scoring may be impacted by aborting model execution before it could produce its full output

Model: claude-opus-4.6 | Judge: claude-opus-4.6

Full results

JanKrivanek · 2026-03-10T13:22:43Z

There is expect_activation option in eval.yaml for this case. Can you reuse/expand that one?

…ading

DeagleGross · 2026-03-10T19:38:39Z

There is expect_activation option in eval.yaml for this case. Can you reuse/expand that one?

I dont think that expect_activation is suitable for such "lightweight" definition of tests, since eval.yaml schema becomes "heavy". Consider:

scenarios:
  - name: "Analyze analyzer perf"
    prompt: "Build this project..."
    assertions: [...]
    rubric: [...]
  
  - name: "Activate on slow build query"
    prompt: "My .NET build takes 5 minutes, how can I speed it up?"
    expect_activation: true  (default)
  
  - name: "Decline NuGet question"  
    prompt: "How do I add a NuGet package reference?"
    expect_activation: false

  - name: "Decline Unit test failure"  
    prompt: "My unit tests are failing with a NullReferenceException"
    expect_activation: false

against how laconic it is with selectivity definitions:

selectivity:
  should_activate:
    - "My .NET build takes over 5 minutes, how can I speed it up?"
  should_not_activate:
    - "How do I add a NuGet package reference to my project?"
    - "My unit tests are failing with a NullReferenceException"

Why cant we leave both expect_activation and selectivity? One can be used if you need to also provide a rubric to assert something (like for expect_activation: false agent did not take 15 minutes to run query, did not invoke other tools etc). Selectivity can be used only for this prompt validation.

ViktorHofer · 2026-03-10T20:01:05Z

Thanks for sharing your perspective. I now better understand the intent. I'm in favor of using the existing expect_activation parameter. It's used in the existing eval tests and people are familiar with it. I would also rather not support both to keep the eval.yml schema minimal.

github-actions · 2026-03-10T20:15:11Z

Skill Validation Results

Skill	Scenario	Quality	Skills Loaded	Overfit	Verdict
csharp-scripts	Test a C# language feature with a script	3.0/5 → 5.0/5 🟢	✅ csharp-scripts; tools: skill, create	🟡 0.34	✅
nuget-trusted-publishing	Set up trusted publishing for a new NuGet library	3.0/5 → 5.0/5 🟢	✅ nuget-trusted-publishing; tools: skill	✅ 0.10	✅
nuget-trusted-publishing	Set up NuGet publishing without mentioning trusted publishing	3.0/5 → 5.0/5 🟢	✅ nuget-trusted-publishing; tools: skill, report_intent, glob, view, bash, create	✅ 0.10	✅
nuget-trusted-publishing	Migrate existing workflow from API key to trusted publishing	2.0/5 → 4.0/5 🟢	✅ nuget-trusted-publishing; tools: skill, glob	✅ 0.10	✅
dotnet-pinvoke	Generate LibraryImport declaration from C header (.NET 8+)	4.0/5 → 5.0/5 🟢	✅ dotnet-pinvoke; tools: skill	✅ 0.08	✅
dotnet-pinvoke	Generate LibraryImport declaration from C header (.NET Framework)	4.0/5 → 5.0/5 🟢	✅ dotnet-pinvoke; tools: skill	✅ 0.08	✅
dotnet-trace-collect	High CPU in Kubernetes on Linux (.NET 8)	4.0/5 → 4.0/5	✅ dotnet-trace-collect; tools: skill, report_intent, view	✅ 0.10	✅
dotnet-trace-collect	.NET Framework on Windows without admin privileges	2.0/5 → 5.0/5 🟢	✅ dotnet-trace-collect; tools: skill	✅ 0.10	✅
dotnet-trace-collect	.NET 10 on Linux with root access and native call stacks	1.0/5 → 4.0/5 🟢	✅ dotnet-trace-collect; tools: skill	✅ 0.10	✅
dotnet-trace-collect	Memory leak on Linux (.NET 8)	3.0/5 → 3.0/5	✅ dotnet-trace-collect; tools: skill, report_intent, view, glob	✅ 0.10	✅
dotnet-trace-collect	Slow requests on Windows with PerfView	5.0/5 → 5.0/5	✅ dotnet-trace-collect; tools: skill, report_intent, view	✅ 0.10	✅
dotnet-trace-collect	Excessive GC on Linux (.NET 8)	5.0/5 → 5.0/5	✅ dotnet-trace-collect; tools: skill	✅ 0.10	❌ [1]
dotnet-trace-collect	Hang or deadlock diagnosis on Linux	3.0/5 → 4.0/5 🟢	✅ dotnet-trace-collect; tools: skill, report_intent, view	✅ 0.10	✅
dotnet-trace-collect	Windows container high CPU with PerfView	2.0/5 → 5.0/5 🟢	✅ dotnet-trace-collect; tools: report_intent, skill, view	✅ 0.10	✅
dotnet-trace-collect	Long-running intermittent issue with PerfView triggers	3.0/5 → 4.0/5 🟢	✅ dotnet-trace-collect; tools: skill, report_intent, view	✅ 0.10	✅
dotnet-trace-collect	Linux pre-.NET 10 needing native call stacks	2.0/5 → 5.0/5 🟢	✅ dotnet-trace-collect; tools: skill, report_intent, view, glob	✅ 0.10	✅
dotnet-trace-collect	Windows modern .NET with admin high CPU	2.0/5 → 5.0/5 🟢	✅ dotnet-trace-collect; tools: skill, report_intent, view, glob	✅ 0.10	✅
dotnet-trace-collect	Memory leak on .NET Framework Windows	3.0/5 → 5.0/5 🟢	✅ dotnet-trace-collect; tools: skill, report_intent, view	✅ 0.10	✅
dotnet-trace-collect	Kubernetes with console access prefers console tools	5.0/5 → 5.0/5	✅ dotnet-trace-collect; tools: skill, report_intent, view, glob	✅ 0.10	❌ [2]
dotnet-trace-collect	Container installation without .NET SDK	4.0/5 → 5.0/5 🟢	✅ dotnet-trace-collect; tools: skill	✅ 0.10	✅
dotnet-trace-collect	HTTP 500s from downstream service on Linux (.NET 8)	5.0/5 → 5.0/5	✅ dotnet-trace-collect; tools: skill, report_intent, view	✅ 0.10	✅
dotnet-trace-collect	Networking timeouts on Windows with admin (.NET 8)	2.0/5 → 5.0/5 🟢	✅ dotnet-trace-collect; tools: skill, report_intent, view	✅ 0.10	✅
microbenchmarking	Investigate runtime upgrade performance impact	4.0/5 → 5.0/5 🟢	✅ microbenchmarking; tools: skill, glob	✅ 0.12	✅
clr-activation-debugging	Diagnose unexpected FOD dialog from native build tool	1.0/5 → 5.0/5 🟢	✅ clr-activation-debugging; tools: skill	✅ 0.07	✅
clr-activation-debugging	Diagnose FOD suppressed but activation still failing	1.0/5 → 5.0/5 🟢	✅ clr-activation-debugging; tools: skill	✅ 0.07	✅
clr-activation-debugging	Explain why same binary behaves differently under different launch methods	1.0/5 → 5.0/5 🟢	✅ clr-activation-debugging; tools: skill	✅ 0.07	✅
clr-activation-debugging	Analyze healthy managed EXE activation	1.0/5 → 2.0/5 ⏰ 🟢	✅ clr-activation-debugging; tools: skill	✅ 0.07	✅
clr-activation-debugging	Identify multiple activation sequences in a single log	1.0/5 → 5.0/5 🟢	✅ clr-activation-debugging; tools: skill, task, glob	✅ 0.07	✅
clr-activation-debugging	Explain useLegacyV2RuntimeActivationPolicy in activation log	3.0/5 → 3.0/5	✅ clr-activation-debugging; tools: skill	✅ 0.07	✅
clr-activation-debugging	Decline non-CLR-activation issue	1.0/5 → 5.0/5 🟢	ℹ️ not activated (expected)	✅ 0.07	✅
analyzing-dotnet-performance	Detects compiled regex startup budget and regex chain allocations	1.0/5 → 3.0/5 ⏰ 🟢	✅ analyzing-dotnet-performance; tools: skill	✅ 0.13	✅
analyzing-dotnet-performance	Detects CurrentCulture comparer and compiled regex budget in inflection rules	1.0/5 → 5.0/5 🟢	✅ analyzing-dotnet-performance; tools: skill	✅ 0.13	✅
analyzing-dotnet-performance	Finds per-call Dictionary allocation not hoisted to static	1.0/5 → 1.0/5 ⏰	✅ analyzing-dotnet-performance; tools: skill	✅ 0.13	❌ [3]
analyzing-dotnet-performance	Catches compound allocations in recursive number converter with ToLower	1.0/5 → 4.0/5 🟢	✅ analyzing-dotnet-performance; tools: skill	✅ 0.13	✅
analyzing-dotnet-performance	Finds StringComparison.Ordinal missing and FrozenDictionary opportunities	1.0/5 → 4.0/5 🟢	✅ analyzing-dotnet-performance; tools: skill	✅ 0.13	✅
analyzing-dotnet-performance	Detects Aggregate+Replace chain and struct missing IEquatable	1.0/5 → 5.0/5 🟢	✅ analyzing-dotnet-performance; tools: skill	✅ 0.13	✅
analyzing-dotnet-performance	Finds branched Replace chain in format string manipulation	1.0/5 → 3.0/5 🟢	✅ analyzing-dotnet-performance; tools: skill	✅ 0.13	✅
analyzing-dotnet-performance	Catches LINQ on hot-path string processing and All(char.IsUpper)	1.0/5 → 5.0/5 🟢	✅ analyzing-dotnet-performance; tools: skill	✅ 0.13	✅
analyzing-dotnet-performance	Detects LINQ pipeline in TimeSpan formatting and collection processing	1.0/5 → 5.0/5 🟢	✅ analyzing-dotnet-performance; tools: skill, read_bash	✅ 0.13	✅
analyzing-dotnet-performance	Flags Span inconsistencies and compound method chains in truncation library	1.0/5 → 5.0/5 🟢	✅ analyzing-dotnet-performance; tools: skill	✅ 0.13	✅
analyzing-dotnet-performance	Identifies unsealed leaf classes and locale hierarchy patterns	1.0/5 → 1.0/5 ⏰	✅ analyzing-dotnet-performance; tools: skill	✅ 0.13	✅
android-tombstone-symbolication	Symbolicate .NET frames in an Android tombstone	4.0/5 → 5.0/5 🟢	✅ android-tombstone-symbolication; tools: skill	🟡 0.22	✅
android-tombstone-symbolication	Recognize tombstone with no .NET frames	5.0/5 → 5.0/5	✅ android-tombstone-symbolication; tools: skill	🟡 0.22	✅
android-tombstone-symbolication	Symbolicate CoreCLR frames in an Android tombstone	3.0/5 → 5.0/5 🟢	✅ android-tombstone-symbolication; tools: skill	🟡 0.22	✅
android-tombstone-symbolication	Recognize NativeAOT tombstone with app binary and libSystem.Native.so	3.0/5 → 4.0/5 🟢	✅ android-tombstone-symbolication; tools: skill, stop_bash	🟡 0.22	✅
android-tombstone-symbolication	Symbolicate multi-thread tombstone	4.0/5 → 4.0/5	✅ android-tombstone-symbolication; tools: skill, glob	🟡 0.22	✅
android-tombstone-symbolication	Handle .NET frames with no BuildId metadata	5.0/5 → 5.0/5	✅ android-tombstone-symbolication; tools: skill, bash, glob	🟡 0.22	✅
android-tombstone-symbolication	Symbolicate tombstone with multiple .NET libraries and different BuildIds	3.0/5 → 5.0/5 🟢	✅ android-tombstone-symbolication; tools: skill, glob	🟡 0.22	✅
android-tombstone-symbolication	Reject iOS crash log as wrong format	5.0/5 → 5.0/5	ℹ️ not activated (expected)	🟡 0.22	❌ [4]
dump-collect	Configure automatic crash dumps for CoreCLR app on Linux	5.0/5 → 5.0/5	✅ dump-collect; tools: skill, report_intent, view, glob	🟡 0.24	✅
dump-collect	Set up NativeAOT crash dumps with createdump in Kubernetes	2.0/5 → 5.0/5 🟢	✅ dump-collect; tools: skill	🟡 0.24	✅
dump-collect	Recover crash dump from macOS NativeAOT without createdump	2.0/5 → 5.0/5 🟢	✅ dump-collect; tools: skill, report_intent, view, glob, bash	🟡 0.24	✅
dump-collect	Configure CoreCLR dump collection in Alpine Docker as non-root	4.0/5 → 4.0/5	✅ dump-collect; tools: skill, view	🟡 0.24	✅
dump-collect	Advisory: macOS NativeAOT crash dump recovery steps	4.0/5 → 5.0/5 🟢	✅ dump-collect; tools: skill	🟡 0.24	✅
dump-collect	Advisory: CoreCLR Alpine Docker non-root configuration	4.0/5 → 5.0/5 🟢	✅ dump-collect; tools: skill, report_intent, view	🟡 0.24	✅
dump-collect	Advisory: NativeAOT Kubernetes dump collection setup	2.0/5 → 5.0/5 🟢	✅ dump-collect; tools: skill	🟡 0.24	✅
dump-collect	Detect runtime and configure crash dumps for unknown .NET app on Linux	3.0/5 → 4.0/5 🟢	✅ dump-collect; tools: skill	🟡 0.24	❌ [5]
dump-collect	Decline dump analysis request	2.0/5 → 4.0/5 🟢	ℹ️ not activated (expected)	🟡 0.24	✅
optimizing-ef-core-queries	Optimize bulk operations with EF Core 7+ ExecuteUpdate and ExecuteDelete	5.0/5 → 5.0/5	✅ optimizing-ef-core-queries; tools: skill	🟡 0.25	❌ [6]
build-parallelism	Analyze build parallelism bottlenecks	1.0/5 ⏰ → 3.0/5 🟢	✅ build-parallelism; binlog-generation; tools: skill	✅ 0.14	✅
including-generated-files	Diagnose generated file inclusion failure	3.0/5 → 5.0/5 🟢	✅ including-generated-files; tools: skill	🟡 0.23	✅
msbuild-antipatterns	Review MSBuild files for anti-patterns and style issues	5.0/5 → 5.0/5	✅ msbuild-antipatterns; tools: skill, glob	✅ 0.06	✅
build-perf-baseline	Establish build performance baseline and recommend optimizations	3.0/5 → 4.0/5 🟢	✅ build-perf-baseline; tools: skill	🟡 0.21	✅
msbuild-modernization	Modernize legacy project to SDK-style	5.0/5 → 5.0/5	✅ msbuild-modernization; tools: skill	✅ 0.05	✅
directory-build-organization	Organize build infrastructure for a multi-project repo	3.0/5 → 5.0/5 🟢	✅ directory-build-organization; msbuild-antipatterns; tools: skill	✅ 0.15	✅
check-bin-obj-clash	Diagnose bin/obj output path clashes	5.0/5 → 5.0/5	✅ check-bin-obj-clash; binlog-generation; tools: skill, glob	✅ 0.14	❌ [7]
incremental-build	Analyze incremental build issues	3.0/5 → 5.0/5 🟢	✅ incremental-build; tools: skill, bash	✅ 0.14	✅
eval-performance	Analyze MSBuild evaluation performance issues	4.0/5 → 5.0/5 🟢	✅ eval-performance; tools: skill	✅ 0.11	✅
build-perf-diagnostics	Analyze analyzer performance impact on builds	1.0/5 ⏰ → 5.0/5 🟢	✅ build-perf-diagnostics; tools: skill	🟡 0.31	✅
binlog-generation	Build project with /bl flag	2.0/5 → 5.0/5 🟢	✅ binlog-generation; tools: skill	🟡 0.46	✅
binlog-generation	Build with /bl in PowerShell	3.0/5 → 5.0/5 🟢	✅ binlog-generation; tools: skill	🟡 0.46	✅
binlog-generation	Build multiple configurations with unique binlogs	5.0/5 → 5.0/5	✅ binlog-generation; tools: skill	🟡 0.46	✅
binlog-failure-analysis	Diagnose build failures from binlog only (no source files)	5.0/5 → 5.0/5	✅ binlog-failure-analysis; tools: skill	✅ 0.05	❌ [8]
dotnet-maui-doctor	Plan macOS MAUI setup with Xcode	3.0/5 → 5.0/5 🟢	✅ dotnet-maui-doctor; tools: skill	🟡 0.21	✅
dotnet-maui-doctor	Plan Linux MAUI environment for Android	3.0/5 → 5.0/5 🟢	✅ dotnet-maui-doctor; tools: skill, bash	🟡 0.21	✅
dotnet-maui-doctor	Guardrail against workload update and repair	1.0/5 → 5.0/5 🟢	✅ dotnet-maui-doctor; tools: skill	🟡 0.21	✅
dotnet-maui-doctor	Diagnose non-Microsoft JDK causing build failure	1.0/5 → 5.0/5 🟢	✅ dotnet-maui-doctor; tools: skill	🟡 0.21	✅
dotnet-maui-doctor	Plan complete MAUI setup on Windows	3.0/5 → 5.0/5 🟢	✅ dotnet-maui-doctor; tools: skill	🟡 0.21	❌ [9]
dotnet-maui-doctor	Prevent incorrect JAVA_HOME configuration	3.0/5 → 5.0/5 🟢	✅ dotnet-maui-doctor; tools: skill	🟡 0.21	✅
dotnet-maui-doctor	Determine required Android SDK packages for specific .NET version	3.0/5 → 5.0/5 🟢	✅ dotnet-maui-doctor; tools: skill, report_intent, view, bash, glob	🟡 0.21	✅
dotnet-maui-doctor	Fix stale MAUI workloads after SDK update	2.0/5 → 5.0/5 🟢	✅ dotnet-maui-doctor; tools: skill, glob	🟡 0.21	✅
thread-abort-migration	Worker thread with abort-based cancellation	5.0/5 → 5.0/5	✅ thread-abort-migration; tools: skill	✅ 0.10	✅
thread-abort-migration	Timeout enforcement via Thread.Abort	4.0/5 → 5.0/5 🟢	✅ thread-abort-migration; tools: skill	✅ 0.10	✅
thread-abort-migration	Blocking WaitHandle with Thread.Interrupt	4.0/5 → 5.0/5 🟢	✅ thread-abort-migration; tools: skill	✅ 0.10	✅
thread-abort-migration	ASP.NET Response.End and Response.Redirect with Thread.Abort	4.0/5 → 5.0/5 🟢	✅ thread-abort-migration; tools: skill	✅ 0.10	✅
thread-abort-migration	Thread.Join and Thread.Sleep only — should not migrate	4.0/5 → 5.0/5 🟢	✅ thread-abort-migration; tools: skill	✅ 0.10	✅
migrate-nullable-references	Enable NRT in a small library with mixed nullability	5.0/5 → 5.0/5	✅ migrate-nullable-references; tools: skill	✅ 0.13	❌ [10]
migrate-nullable-references	File-by-file migration: only modify the targeted file	5.0/5 → 5.0/5	⚠️ NOT ACTIVATED	✅ 0.13	✅
migrate-nullable-references	Enable NRT in ASP.NET Core Web API with EF Core	3.0/5 → 3.0/5	✅ migrate-nullable-references; tools: skill	✅ 0.13	❌ [11]
dotnet-aot-compat	Make Azure.ResourceManager AOT-compatible	3.0/5 → 1.0/5 ⏰ 🔴	✅ dotnet-aot-compat; tools: skill, create	✅ 0.13	❌

[1] Quality unchanged but weighted score is -31.6% due to: judgment, quality, tokens (22888 → 106660), tool calls (3 → 6), time (25.0s → 50.2s)
[2] Quality unchanged but weighted score is -10.0% due to: tokens (11218 → 105974), tool calls (0 → 6), time (12.7s → 46.8s)
[3] Quality unchanged but weighted score is -3.0% due to: tokens (33051 → 89420), errors (0 → 1), tool calls (3 → 9), time (20.9s → 120.1s)
[4] Quality unchanged but weighted score is -5.5% due to: quality, time (20.0s → 27.6s), tokens (23436 → 26090)
[5] Quality improved but weighted score is -1.6% due to: tokens (45508 → 86029), tool calls (4 → 9), time (38.0s → 54.2s)
[6] Quality unchanged but weighted score is -9.1% due to: tokens (11311 → 25199), tool calls (0 → 1), time (10.5s → 17.4s)
[7] Quality unchanged but weighted score is -6.8% due to: tokens (79354 → 392453), tool calls (12 → 28), time (48.6s → 131.5s)
[8] Quality unchanged but weighted score is -1.5% due to: tokens (386341 → 677126), tool calls (19 → 34)
[9] Quality improved but weighted score is -3.1% due to: completion (✓ → ✗), tokens (34628 → 52518), tool calls (3 → 8), time (39.8s → 56.1s)
[10] Quality unchanged but weighted score is -1.3% due to: tokens (118171 → 158369)
[11] Quality unchanged but weighted score is -6.1% due to: tokens (86274 → 188835), tool calls (19 → 25)

⏰ timeout — run hit the scenario timeout limit; scoring may be impacted by aborting model execution before it could produce its full output

Model: claude-opus-4.6 | Judge: claude-opus-4.6

Full results

JanKrivanek · 2026-03-11T14:37:11Z

eng/skill-validator/src/Commands/ValidateCommand.cs

+        var prefix = $"[{skill.Name}/selectivity]";
+        var log = (string msg) => spinner.Log($"{prefix} {msg}");
+
+        // Launch all probes in parallel


This can quickly lead to throttling/rejections from the inference api

JanKrivanek · 2026-03-11T14:42:00Z

@ViktorHofer How about if we flip entirely to the new suggested format - so far there is just 3 usages of expect_activation, so if this PR removes the duplication and updates those usages as well - we'll have a clean state.

Having more expressive and faster scenarios for activation sounds as a good benefit.

We might just need to figure out the 'visualisation' in the report and if/how to have this in the dashboards https://dotnet.github.io/skills/

ViktorHofer · 2026-03-30T10:41:36Z

Sorry I didn't see your ping. Yes, I'm fine with any solution that doesn't introduce an additional schema.

DeagleGross added 2 commits March 7, 2026 12:55

use BPE tokenizer

d870730

selectivity test

77eb1be

DeagleGross requested review from JanKrivanek and ViktorHofer as code owners March 10, 2026 12:32

Copilot AI review requested due to automatic review settings March 10, 2026 12:32

DeagleGross mentioned this pull request Mar 10, 2026

feat: skill loading tests #283

Closed

Copilot started reviewing on behalf of DeagleGross March 10, 2026 12:33 View session

Copilot AI reviewed Mar 10, 2026

View reviewed changes

Merge remote-tracking branch 'upstream/main' into dmkorolev/skills-lo…

4f40198

…ading

JanKrivanek reviewed Mar 11, 2026

View reviewed changes

ViktorHofer added the infrastructure label Mar 16, 2026

github-actions bot mentioned this pull request Mar 25, 2026

🏥 Repository Health Dashboard #288

Open

Conversation

DeagleGross commented Mar 10, 2026

Problem

Solution

Usage & Example

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 10, 2026

Skill Validation Results

Uh oh!

JanKrivanek commented Mar 10, 2026

Uh oh!

DeagleGross commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ViktorHofer commented Mar 10, 2026

Uh oh!

github-actions bot commented Mar 10, 2026

Skill Validation Results

Uh oh!

JanKrivanek Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

JanKrivanek commented Mar 11, 2026

Uh oh!

ViktorHofer commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

DeagleGross commented Mar 10, 2026 •

edited

Loading