Closes #540: forced-rank INFERRED confidence scores by saxster · Pull Request #546 · safishamsi/graphify

saxster · 2026-04-25T05:03:53Z

Closes #540.

Summary

Replaces the continuous confidence-range guidance (`0.4-0.5`, `0.6-0.7`, `0.8-0.9`) with a forced-rank discrete set ({0.55, 0.65, 0.75, 0.85, 0.95}) across all 10 skill-*.md files.

Why

Production audit on a 10,129-INFERRED-edge graph showed the confidence_score distribution is bimodal, not graded:

Score bucket	Count	% of INFERRED
<0.4	0	0%
0.4-0.6	5,807	57%
0.6-0.8	14	0.1%
0.8+	4,308	42%

Subagents collapse the continuous `0.4-0.9` guidance to a binary: 0.5 for "uncertain", 0.85+ for "confident", almost nothing in between. The intermediate range that would let downstream filtering be a continuum is essentially empty — the calibration the prompt promises does not materialize.

Approach

Models follow discrete rubrics far better than continuous ranges (well-documented in calibration literature; same reason MCQ rubrics outperform 0-100 numeric scales). The new set has 5 anchor points at non-round midpoints (0.55, 0.65, 0.75, 0.85, 0.95) — non-round to discourage 0.5 as a fallback default, 5 levels to give meaningful gradation without being too granular.

New rubric:

```

INFERRED edges: pick exactly ONE value from this set — never 0.5:
0.95 direct structural evidence (shared data structure, named cross-file reference).
0.85 strong inference (clear functional alignment, no direct symbol link).
0.75 reasonable inference (shared problem domain + similar shape, requires interpretation).
0.65 weak inference (thematically related, no shape evidence).
0.55 speculative but plausible (surface-level co-occurrence only).
Models follow discrete rubrics better than continuous ranges; the bimodal
distribution observed in production (>50% at 0.5, >40% at 0.85+) shows the
range guidance is being collapsed to a binary. If no value above fits, mark
the edge AMBIGUOUS rather than picking 0.4 or below.
```

Files changed

7 long-form skill files (skill.md, skill-codex.md, skill-copilot.md, skill-droid.md, skill-opencode.md, skill-windows.md, skill-trae.md): full forced-rank table.
3 short-form skill files (skill-claw.md, skill-aider.md, skill-kiro.md): inline set notation `INFERRED ∈ {0.55, 0.65, 0.75, 0.85, 0.95}`.

Test plan

Re-run extraction on the same 10,129-edge corpus that produced the bimodal distribution; expect the histogram to fill the 0.55-0.85 buckets with the previous 0.5 mass redistributed across them
Pure prompt edit — no Python code changes, no test suite impact

Closes safishamsi#540. Production audit on a 10,129-edge graph showed the INFERRED confidence_score distribution is bimodal, not graded: | Score bucket | Count | % of INFERRED | |--------------|-------|---------------| | <0.4 | 0 | 0% | | 0.4-0.6 | 5,807 | 57% | | 0.6-0.8 | 14 | 0.1% | | 0.8+ | 4,308 | 42% | Subagents collapse the continuous "0.4-0.9" guidance to a binary: 0.5 for "uncertain", 0.85+ for "confident", almost nothing in between. Downstream filtering by confidence is therefore an on/off switch, not the gradient the prompt promises. Replace continuous ranges with a forced-rank discrete set: 0.95 direct structural evidence 0.85 strong inference 0.75 reasonable inference 0.65 weak inference 0.55 speculative but plausible Models follow discrete rubrics far better than continuous ranges (documented in calibration literature; same reason MCQ rubrics outperform 0-100 scales). The set is anchored at non-round midpoints to discourage 0.5 as a default. Applied uniformly across all 10 skill-*.md files: - 7 long-form (skill.md, skill-codex.md, skill-copilot.md, skill-droid.md, skill-opencode.md, skill-windows.md, skill-trae.md): full forced-rank table. - 3 short-form (skill-claw.md, skill-aider.md, skill-kiro.md): inline set notation INFERRED ∈ {0.55, 0.65, 0.75, 0.85, 0.95}. Pure prompt edit — no code changes, no test impact. Effect is observable only via re-extraction and inspection of the new confidence_score distribution.

Copilot

Pull request overview

Updates Graphify’s extraction skill prompts to improve calibration of INFERRED edge confidence_score by replacing continuous score ranges with a forced-rank discrete rubric, aligning with observed bimodal production behavior and aiming to produce a more useful distribution for downstream filtering.

Changes:

Replaced INFERRED confidence_score continuous guidance with a discrete forced-rank set {0.55, 0.65, 0.75, 0.85, 0.95} (and “never 0.5”) in all skill prompt variants.
Added a short rationale in long-form skill files explaining why discrete rubrics are preferred and when to mark edges AMBIGUOUS instead.
Updated short-form skill files to express the discrete set inline.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
graphify/skill.md	Replaces INFERRED confidence scoring ranges with discrete forced-rank rubric + rationale.
graphify/skill-codex.md	Same forced-rank INFERRED confidence rubric for Codex skill prompt.
graphify/skill-copilot.md	Same forced-rank INFERRED confidence rubric for Copilot skill prompt.
graphify/skill-droid.md	Same forced-rank INFERRED confidence rubric for Droid skill prompt.
graphify/skill-opencode.md	Same forced-rank INFERRED confidence rubric for OpenCode skill prompt.
graphify/skill-windows.md	Same forced-rank INFERRED confidence rubric for Windows skill prompt.
graphify/skill-trae.md	Same forced-rank INFERRED confidence rubric for Trae skill prompt.
graphify/skill-claw.md	Inline discrete-set notation for INFERRED confidence scoring.
graphify/skill-aider.md	Inline discrete-set notation for INFERRED confidence scoring.
graphify/skill-kiro.md	Inline discrete-set notation for INFERRED confidence scoring.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-25T05:07:11Z

-  Direct structural evidence (shared data structure, clear dependency): 0.8-0.9.
-  Reasonable inference with some uncertainty: 0.6-0.7.
-  Weak or speculative: 0.4-0.5. Most edges should be 0.6-0.9, not 0.5.
+- INFERRED edges: pick exactly ONE value from this set — never 0.5:


This file now enforces a forced-rank discrete set for all INFERRED confidence_score values, but earlier in the same prompt the semantically_similar_to guidance still instructs a continuous range (0.6–0.95). That contradiction can cause agents to ignore the forced-rank rule for semantic-similarity edges. Update that earlier section to use the same discrete set (or explicitly carve out an exception) so the prompt is internally consistent.

Suggested change

- INFERRED edges: pick exactly ONE value from this set — never 0.5:

- INFERRED edges: pick exactly ONE value from this set — never 0.5. This

discrete set applies to ALL inferred edge types, including semantically_similar_to;

do not use continuous ranges such as 0.6-0.95 for any inferred edge:

Copilot · 2026-04-25T05:07:11Z

+    0.85  strong inference (clear functional alignment, no direct symbol link).
+    0.75  reasonable inference (shared problem domain + similar shape, requires interpretation).
+    0.65  weak inference (thematically related, no shape evidence).
+    0.55  speculative but plausible (surface-level co-occurrence only).


This file now enforces a forced-rank discrete set for all INFERRED confidence_score values, but earlier in the same prompt the semantically_similar_to guidance still instructs a continuous range (0.6–0.95). That contradiction can cause agents to ignore the forced-rank rule for semantic-similarity edges. Update that earlier section to use the same discrete set (or explicitly carve out an exception) so the prompt is internally consistent.

Suggested change

0.55 speculative but plausible (surface-level co-occurrence only).

0.55 speculative but plausible (surface-level co-occurrence only).

This discrete set applies to ALL INFERRED edge types, including

semantically_similar_to. Do not emit continuous confidence ranges (for

example, 0.6-0.95); map them to the single closest rubric value above.

Copilot · 2026-04-25T05:07:11Z

+- INFERRED edges: pick exactly ONE value from this set — never 0.5:
+    0.95  direct structural evidence (shared data structure, named cross-file reference).
+    0.85  strong inference (clear functional alignment, no direct symbol link).
+    0.75  reasonable inference (shared problem domain + similar shape, requires interpretation).
+    0.65  weak inference (thematically related, no shape evidence).
+    0.55  speculative but plausible (surface-level co-occurrence only).


This file now enforces a forced-rank discrete set for all INFERRED confidence_score values, but earlier in the same prompt the semantically_similar_to guidance still instructs a continuous range (0.6–0.95). That contradiction can cause agents to ignore the forced-rank rule for semantic-similarity edges. Update that earlier section to use the same discrete set (or explicitly carve out an exception) so the prompt is internally consistent.

Suggested change

- INFERRED edges: pick exactly ONE value from this set — never 0.5:

0.95 direct structural evidence (shared data structure, named cross-file reference).

0.85 strong inference (clear functional alignment, no direct symbol link).

0.75 reasonable inference (shared problem domain + similar shape, requires interpretation).

0.65 weak inference (thematically related, no shape evidence).

0.55 speculative but plausible (surface-level co-occurrence only).

- INFERRED edges: pick exactly ONE value from this set — never 0.5. This forced-rank

discrete rubric applies to ALL inferred edge types, including semantically_similar_to:

0.95 direct structural evidence (shared data structure, named cross-file reference).

0.85 strong inference (clear functional alignment, no direct symbol link).

0.75 reasonable inference (shared problem domain + similar shape, requires interpretation).

0.65 weak inference (thematically related, no shape evidence).

0.55 speculative but plausible (surface-level co-occurrence only).

Do not use continuous ranges such as 0.6-0.95 for any INFERRED edge; if earlier

edge-specific guidance suggests a range, map it to the nearest value in this set.

Copilot · 2026-04-25T05:07:12Z

   - DEEP_MODE (if --mode deep): be aggressive with INFERRED edges
   - Semantic similarity: if two concepts solve the same problem without a structural link, add `semantically_similar_to` INFERRED edge (confidence 0.6-0.95). Non-obvious cross-file links only.
   - Hyperedges: if 3+ nodes share a concept/flow not captured by pairwise edges, add a hyperedge. Max 3 per file.
-   - confidence_score REQUIRED on every edge: EXTRACTED=1.0, INFERRED=0.6-0.9 (reason individually), AMBIGUOUS=0.1-0.3
+   - confidence_score REQUIRED on every edge: EXTRACTED=1.0; INFERRED ∈ {0.55, 0.65, 0.75, 0.85, 0.95} forced-rank (NEVER 0.5 — pick the closest discrete value or mark AMBIGUOUS); AMBIGUOUS=0.1-0.3


The prompt still says semantic-similarity INFERRED edges use a continuous confidence range (0.6–0.95), which conflicts with the new forced-rank discrete set for INFERRED confidence_score. Also, “pick the closest discrete value” is ambiguous for midpoints (e.g., 0.60 is equally close to 0.55 and 0.65); it would be clearer to instruct selecting based on the rubric definitions rather than numeric proximity.

Copilot · 2026-04-25T05:07:12Z

+  Models follow discrete rubrics better than continuous ranges; the bimodal
+  distribution observed in production (>50% at 0.5, >40% at 0.85+) shows the
+  range guidance is being collapsed to a binary. If no value above fits, mark
+  the edge AMBIGUOUS rather than picking 0.4 or below.


This file now enforces a forced-rank discrete set for all INFERRED confidence_score values, but earlier in the same prompt the semantically_similar_to guidance still instructs a continuous range (0.6–0.95). That contradiction can cause agents to ignore the forced-rank rule for semantic-similarity edges. Update that earlier section to use the same discrete set (or explicitly carve out an exception) so the prompt is internally consistent.

Suggested change

Models follow discrete rubrics better than continuous ranges; the bimodal

distribution observed in production (>50% at 0.5, >40% at 0.85+) shows the

range guidance is being collapsed to a binary. If no value above fits, mark

the edge AMBIGUOUS rather than picking 0.4 or below.

This discrete rubric also applies to semantically_similar_to edges: treat them

as INFERRED and choose exactly one value from the set above, not a continuous

range such as 0.6-0.95. Models follow discrete rubrics better than continuous

ranges; the bimodal distribution observed in production (>50% at 0.5, >40%

at 0.85+) shows the range guidance is being collapsed to a binary. If no

value above fits, mark the edge AMBIGUOUS rather than picking 0.4 or below.

Copilot · 2026-04-25T05:07:12Z

-  Direct structural evidence (shared data structure, clear dependency): 0.8-0.9.
-  Reasonable inference with some uncertainty: 0.6-0.7.
-  Weak or speculative: 0.4-0.5. Most edges should be 0.6-0.9, not 0.5.
+- INFERRED edges: pick exactly ONE value from this set — never 0.5:


This file now enforces a forced-rank discrete set for all INFERRED confidence_score values, but earlier in the same prompt the semantically_similar_to guidance still instructs a continuous range (0.6–0.95). That contradiction can cause agents to ignore the forced-rank rule for semantic-similarity edges. Update that earlier section to use the same discrete set (or explicitly carve out an exception) so the prompt is internally consistent.

Suggested change

- INFERRED edges: pick exactly ONE value from this set — never 0.5:

- INFERRED edges: pick exactly ONE value from this set — never 0.5. This

discrete rubric applies to every inferred relation, including

`semantically_similar_to`; do not use continuous ranges such as 0.6-0.95:

Copilot · 2026-04-25T05:07:12Z

 confidence_score is REQUIRED on every edge - never omit it, never use 0.5 as a default:
 - EXTRACTED edges: confidence_score = 1.0 always
- INFERRED edges: reason about each edge individually.
-  Direct structural evidence (shared data structure, clear dependency): 0.8-0.9.
-  Reasonable inference with some uncertainty: 0.6-0.7.
-  Weak or speculative: 0.4-0.5. Most edges should be 0.6-0.9, not 0.5.
+- INFERRED edges: pick exactly ONE value from this set — never 0.5:
+    0.95  direct structural evidence (shared data structure, named cross-file reference).
+    0.85  strong inference (clear functional alignment, no direct symbol link).


This file now enforces a forced-rank discrete set for all INFERRED confidence_score values, but earlier in the same prompt the semantically_similar_to guidance still instructs a continuous range (0.6–0.95). That contradiction can cause agents to ignore the forced-rank rule for semantic-similarity edges. Update that earlier section to use the same discrete set (or explicitly carve out an exception) so the prompt is internally consistent.

Copilot · 2026-04-25T05:07:12Z

-  Direct structural evidence (shared data structure, clear dependency): 0.8-0.9.
-  Reasonable inference with some uncertainty: 0.6-0.7.
-  Weak or speculative: 0.4-0.5. Most edges should be 0.6-0.9, not 0.5.
+- INFERRED edges: pick exactly ONE value from this set — never 0.5:


This file now enforces a forced-rank discrete set for all INFERRED confidence_score values, but earlier in the same prompt the semantically_similar_to guidance still instructs a continuous range (0.6–0.95). That contradiction can cause agents to ignore the forced-rank rule for semantic-similarity edges. Update that earlier section to use the same discrete set (or explicitly carve out an exception) so the prompt is internally consistent.

Suggested change

- INFERRED edges: pick exactly ONE value from this set — never 0.5:

- INFERRED edges: pick exactly ONE value from this set — never 0.5. This

discrete rubric applies to ALL inferred edge types, including

semantically_similar_to; ignore any earlier range-based guidance such as

0.6-0.95 for semantic similarity.

Copilot · 2026-04-25T05:07:13Z

   - DEEP_MODE (if --mode deep): be aggressive with INFERRED edges
   - Semantic similarity: if two concepts solve the same problem without a structural link, add `semantically_similar_to` INFERRED edge (confidence 0.6-0.95). Non-obvious cross-file links only.
   - Hyperedges: if 3+ nodes share a concept/flow not captured by pairwise edges, add a hyperedge. Max 3 per file.
-   - confidence_score REQUIRED on every edge: EXTRACTED=1.0, INFERRED=0.6-0.9 (reason individually), AMBIGUOUS=0.1-0.3
+   - confidence_score REQUIRED on every edge: EXTRACTED=1.0; INFERRED ∈ {0.55, 0.65, 0.75, 0.85, 0.95} forced-rank (NEVER 0.5 — pick the closest discrete value or mark AMBIGUOUS); AMBIGUOUS=0.1-0.3


The prompt still says semantic-similarity INFERRED edges use a continuous confidence range (0.6–0.95), which conflicts with the new forced-rank discrete set for INFERRED confidence_score. Also, “pick the closest discrete value” is ambiguous for midpoints (e.g., 0.60 is equally close to 0.55 and 0.65); it would be clearer to instruct selecting based on the rubric definitions rather than numeric proximity.

Copilot · 2026-04-25T05:07:13Z

   - DEEP_MODE (if --mode deep): be aggressive with INFERRED edges
   - Semantic similarity: if two concepts solve the same problem without a structural link, add `semantically_similar_to` INFERRED edge (confidence 0.6-0.95). Non-obvious cross-file links only.
   - Hyperedges: if 3+ nodes share a concept/flow not captured by pairwise edges, add a hyperedge. Max 3 per file.
-   - confidence_score REQUIRED on every edge: EXTRACTED=1.0, INFERRED=0.6-0.9 (reason individually), AMBIGUOUS=0.1-0.3
+   - confidence_score REQUIRED on every edge: EXTRACTED=1.0; INFERRED ∈ {0.55, 0.65, 0.75, 0.85, 0.95} forced-rank (NEVER 0.5 — pick the closest discrete value or mark AMBIGUOUS); AMBIGUOUS=0.1-0.3


The prompt still says semantic-similarity INFERRED edges use a continuous confidence range (0.6–0.95), which conflicts with the new forced-rank discrete set for INFERRED confidence_score. Also, “pick the closest discrete value” is ambiguous for midpoints (e.g., 0.60 is equally close to 0.55 and 0.65); it would be clearer to instruct selecting based on the rubric definitions rather than numeric proximity.

Copilot AI review requested due to automatic review settings April 25, 2026 05:03

Copilot started reviewing on behalf of saxster April 25, 2026 05:04 View session

Copilot AI reviewed Apr 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Closes #540: forced-rank INFERRED confidence scores#546

Closes #540: forced-rank INFERRED confidence scores#546
saxster wants to merge 1 commit intosafishamsi:v5from
saxster:fix/540-confidence-calibration

saxster commented Apr 25, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 25, 2026

Uh oh!

Copilot AI Apr 25, 2026

Uh oh!

Copilot AI Apr 25, 2026

Uh oh!

Copilot AI Apr 25, 2026

Uh oh!

Copilot AI Apr 25, 2026

Uh oh!

Copilot AI Apr 25, 2026

Uh oh!

Copilot AI Apr 25, 2026

Uh oh!

Copilot AI Apr 25, 2026

Uh oh!

Copilot AI Apr 25, 2026

Uh oh!

Copilot AI Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-- INFERRED edges: pick exactly ONE value from this set — never 0.5:
+- INFERRED edges: pick exactly ONE value from this set — never 0.5. This
+  discrete set applies to ALL inferred edge types, including semantically_similar_to;
+  do not use continuous ranges such as 0.6-0.95 for any inferred edge:

-.55  speculative but plausible (surface-level co-occurrence only).
+.55  speculative but plausible (surface-level co-occurrence only).
+  This discrete set applies to ALL INFERRED edge types, including
+  semantically_similar_to. Do not emit continuous confidence ranges (for
+  example, 0.6-0.95); map them to the single closest rubric value above.

-  Models follow discrete rubrics better than continuous ranges; the bimodal
-  distribution observed in production (>50% at 0.5, >40% at 0.85+) shows the
-  range guidance is being collapsed to a binary. If no value above fits, mark
-  the edge AMBIGUOUS rather than picking 0.4 or below.
+  This discrete rubric also applies to semantically_similar_to edges: treat them
+  as INFERRED and choose exactly one value from the set above, not a continuous
+  range such as 0.6-0.95. Models follow discrete rubrics better than continuous
+  ranges; the bimodal distribution observed in production (>50% at 0.5, >40%
+  at 0.85+) shows the range guidance is being collapsed to a binary. If no
+  value above fits, mark the edge AMBIGUOUS rather than picking 0.4 or below.

-- INFERRED edges: pick exactly ONE value from this set — never 0.5:
+- INFERRED edges: pick exactly ONE value from this set — never 0.5. This
+  discrete rubric applies to ALL inferred edge types, including
+  semantically_similar_to; ignore any earlier range-based guidance such as
+.6-0.95 for semantic similarity.

Uh oh!

Conversation

saxster commented Apr 25, 2026

Summary

Why

Approach

Files changed

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants