Closes #540: forced-rank INFERRED confidence scores#546
Closes #540: forced-rank INFERRED confidence scores#546saxster wants to merge 1 commit intosafishamsi:v5from
Conversation
Closes safishamsi#540. Production audit on a 10,129-edge graph showed the INFERRED confidence_score distribution is bimodal, not graded: | Score bucket | Count | % of INFERRED | |--------------|-------|---------------| | <0.4 | 0 | 0% | | 0.4-0.6 | 5,807 | 57% | | 0.6-0.8 | 14 | 0.1% | | 0.8+ | 4,308 | 42% | Subagents collapse the continuous "0.4-0.9" guidance to a binary: 0.5 for "uncertain", 0.85+ for "confident", almost nothing in between. Downstream filtering by confidence is therefore an on/off switch, not the gradient the prompt promises. Replace continuous ranges with a forced-rank discrete set: 0.95 direct structural evidence 0.85 strong inference 0.75 reasonable inference 0.65 weak inference 0.55 speculative but plausible Models follow discrete rubrics far better than continuous ranges (documented in calibration literature; same reason MCQ rubrics outperform 0-100 scales). The set is anchored at non-round midpoints to discourage 0.5 as a default. Applied uniformly across all 10 skill-*.md files: - 7 long-form (skill.md, skill-codex.md, skill-copilot.md, skill-droid.md, skill-opencode.md, skill-windows.md, skill-trae.md): full forced-rank table. - 3 short-form (skill-claw.md, skill-aider.md, skill-kiro.md): inline set notation INFERRED ∈ {0.55, 0.65, 0.75, 0.85, 0.95}. Pure prompt edit — no code changes, no test impact. Effect is observable only via re-extraction and inspection of the new confidence_score distribution.
There was a problem hiding this comment.
Pull request overview
Updates Graphify’s extraction skill prompts to improve calibration of INFERRED edge confidence_score by replacing continuous score ranges with a forced-rank discrete rubric, aligning with observed bimodal production behavior and aiming to produce a more useful distribution for downstream filtering.
Changes:
- Replaced INFERRED
confidence_scorecontinuous guidance with a discrete forced-rank set{0.55, 0.65, 0.75, 0.85, 0.95}(and “never 0.5”) in all skill prompt variants. - Added a short rationale in long-form skill files explaining why discrete rubrics are preferred and when to mark edges AMBIGUOUS instead.
- Updated short-form skill files to express the discrete set inline.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| graphify/skill.md | Replaces INFERRED confidence scoring ranges with discrete forced-rank rubric + rationale. |
| graphify/skill-codex.md | Same forced-rank INFERRED confidence rubric for Codex skill prompt. |
| graphify/skill-copilot.md | Same forced-rank INFERRED confidence rubric for Copilot skill prompt. |
| graphify/skill-droid.md | Same forced-rank INFERRED confidence rubric for Droid skill prompt. |
| graphify/skill-opencode.md | Same forced-rank INFERRED confidence rubric for OpenCode skill prompt. |
| graphify/skill-windows.md | Same forced-rank INFERRED confidence rubric for Windows skill prompt. |
| graphify/skill-trae.md | Same forced-rank INFERRED confidence rubric for Trae skill prompt. |
| graphify/skill-claw.md | Inline discrete-set notation for INFERRED confidence scoring. |
| graphify/skill-aider.md | Inline discrete-set notation for INFERRED confidence scoring. |
| graphify/skill-kiro.md | Inline discrete-set notation for INFERRED confidence scoring. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| Direct structural evidence (shared data structure, clear dependency): 0.8-0.9. | ||
| Reasonable inference with some uncertainty: 0.6-0.7. | ||
| Weak or speculative: 0.4-0.5. Most edges should be 0.6-0.9, not 0.5. | ||
| - INFERRED edges: pick exactly ONE value from this set — never 0.5: |
There was a problem hiding this comment.
This file now enforces a forced-rank discrete set for all INFERRED confidence_score values, but earlier in the same prompt the semantically_similar_to guidance still instructs a continuous range (0.6–0.95). That contradiction can cause agents to ignore the forced-rank rule for semantic-similarity edges. Update that earlier section to use the same discrete set (or explicitly carve out an exception) so the prompt is internally consistent.
| - INFERRED edges: pick exactly ONE value from this set — never 0.5: | |
| - INFERRED edges: pick exactly ONE value from this set — never 0.5. This | |
| discrete set applies to ALL inferred edge types, including semantically_similar_to; | |
| do not use continuous ranges such as 0.6-0.95 for any inferred edge: |
| 0.85 strong inference (clear functional alignment, no direct symbol link). | ||
| 0.75 reasonable inference (shared problem domain + similar shape, requires interpretation). | ||
| 0.65 weak inference (thematically related, no shape evidence). | ||
| 0.55 speculative but plausible (surface-level co-occurrence only). |
There was a problem hiding this comment.
This file now enforces a forced-rank discrete set for all INFERRED confidence_score values, but earlier in the same prompt the semantically_similar_to guidance still instructs a continuous range (0.6–0.95). That contradiction can cause agents to ignore the forced-rank rule for semantic-similarity edges. Update that earlier section to use the same discrete set (or explicitly carve out an exception) so the prompt is internally consistent.
| 0.55 speculative but plausible (surface-level co-occurrence only). | |
| 0.55 speculative but plausible (surface-level co-occurrence only). | |
| This discrete set applies to ALL INFERRED edge types, including | |
| semantically_similar_to. Do not emit continuous confidence ranges (for | |
| example, 0.6-0.95); map them to the single closest rubric value above. |
| - INFERRED edges: pick exactly ONE value from this set — never 0.5: | ||
| 0.95 direct structural evidence (shared data structure, named cross-file reference). | ||
| 0.85 strong inference (clear functional alignment, no direct symbol link). | ||
| 0.75 reasonable inference (shared problem domain + similar shape, requires interpretation). | ||
| 0.65 weak inference (thematically related, no shape evidence). | ||
| 0.55 speculative but plausible (surface-level co-occurrence only). |
There was a problem hiding this comment.
This file now enforces a forced-rank discrete set for all INFERRED confidence_score values, but earlier in the same prompt the semantically_similar_to guidance still instructs a continuous range (0.6–0.95). That contradiction can cause agents to ignore the forced-rank rule for semantic-similarity edges. Update that earlier section to use the same discrete set (or explicitly carve out an exception) so the prompt is internally consistent.
| - INFERRED edges: pick exactly ONE value from this set — never 0.5: | |
| 0.95 direct structural evidence (shared data structure, named cross-file reference). | |
| 0.85 strong inference (clear functional alignment, no direct symbol link). | |
| 0.75 reasonable inference (shared problem domain + similar shape, requires interpretation). | |
| 0.65 weak inference (thematically related, no shape evidence). | |
| 0.55 speculative but plausible (surface-level co-occurrence only). | |
| - INFERRED edges: pick exactly ONE value from this set — never 0.5. This forced-rank | |
| discrete rubric applies to ALL inferred edge types, including semantically_similar_to: | |
| 0.95 direct structural evidence (shared data structure, named cross-file reference). | |
| 0.85 strong inference (clear functional alignment, no direct symbol link). | |
| 0.75 reasonable inference (shared problem domain + similar shape, requires interpretation). | |
| 0.65 weak inference (thematically related, no shape evidence). | |
| 0.55 speculative but plausible (surface-level co-occurrence only). | |
| Do not use continuous ranges such as 0.6-0.95 for any INFERRED edge; if earlier | |
| edge-specific guidance suggests a range, map it to the nearest value in this set. |
| - DEEP_MODE (if --mode deep): be aggressive with INFERRED edges | ||
| - Semantic similarity: if two concepts solve the same problem without a structural link, add `semantically_similar_to` INFERRED edge (confidence 0.6-0.95). Non-obvious cross-file links only. | ||
| - Hyperedges: if 3+ nodes share a concept/flow not captured by pairwise edges, add a hyperedge. Max 3 per file. | ||
| - confidence_score REQUIRED on every edge: EXTRACTED=1.0, INFERRED=0.6-0.9 (reason individually), AMBIGUOUS=0.1-0.3 | ||
| - confidence_score REQUIRED on every edge: EXTRACTED=1.0; INFERRED ∈ {0.55, 0.65, 0.75, 0.85, 0.95} forced-rank (NEVER 0.5 — pick the closest discrete value or mark AMBIGUOUS); AMBIGUOUS=0.1-0.3 |
There was a problem hiding this comment.
The prompt still says semantic-similarity INFERRED edges use a continuous confidence range (0.6–0.95), which conflicts with the new forced-rank discrete set for INFERRED confidence_score. Also, “pick the closest discrete value” is ambiguous for midpoints (e.g., 0.60 is equally close to 0.55 and 0.65); it would be clearer to instruct selecting based on the rubric definitions rather than numeric proximity.
| Models follow discrete rubrics better than continuous ranges; the bimodal | ||
| distribution observed in production (>50% at 0.5, >40% at 0.85+) shows the | ||
| range guidance is being collapsed to a binary. If no value above fits, mark | ||
| the edge AMBIGUOUS rather than picking 0.4 or below. |
There was a problem hiding this comment.
This file now enforces a forced-rank discrete set for all INFERRED confidence_score values, but earlier in the same prompt the semantically_similar_to guidance still instructs a continuous range (0.6–0.95). That contradiction can cause agents to ignore the forced-rank rule for semantic-similarity edges. Update that earlier section to use the same discrete set (or explicitly carve out an exception) so the prompt is internally consistent.
| Models follow discrete rubrics better than continuous ranges; the bimodal | |
| distribution observed in production (>50% at 0.5, >40% at 0.85+) shows the | |
| range guidance is being collapsed to a binary. If no value above fits, mark | |
| the edge AMBIGUOUS rather than picking 0.4 or below. | |
| This discrete rubric also applies to semantically_similar_to edges: treat them | |
| as INFERRED and choose exactly one value from the set above, not a continuous | |
| range such as 0.6-0.95. Models follow discrete rubrics better than continuous | |
| ranges; the bimodal distribution observed in production (>50% at 0.5, >40% | |
| at 0.85+) shows the range guidance is being collapsed to a binary. If no | |
| value above fits, mark the edge AMBIGUOUS rather than picking 0.4 or below. |
| Direct structural evidence (shared data structure, clear dependency): 0.8-0.9. | ||
| Reasonable inference with some uncertainty: 0.6-0.7. | ||
| Weak or speculative: 0.4-0.5. Most edges should be 0.6-0.9, not 0.5. | ||
| - INFERRED edges: pick exactly ONE value from this set — never 0.5: |
There was a problem hiding this comment.
This file now enforces a forced-rank discrete set for all INFERRED confidence_score values, but earlier in the same prompt the semantically_similar_to guidance still instructs a continuous range (0.6–0.95). That contradiction can cause agents to ignore the forced-rank rule for semantic-similarity edges. Update that earlier section to use the same discrete set (or explicitly carve out an exception) so the prompt is internally consistent.
| - INFERRED edges: pick exactly ONE value from this set — never 0.5: | |
| - INFERRED edges: pick exactly ONE value from this set — never 0.5. This | |
| discrete rubric applies to every inferred relation, including | |
| `semantically_similar_to`; do not use continuous ranges such as 0.6-0.95: |
| confidence_score is REQUIRED on every edge - never omit it, never use 0.5 as a default: | ||
| - EXTRACTED edges: confidence_score = 1.0 always | ||
| - INFERRED edges: reason about each edge individually. | ||
| Direct structural evidence (shared data structure, clear dependency): 0.8-0.9. | ||
| Reasonable inference with some uncertainty: 0.6-0.7. | ||
| Weak or speculative: 0.4-0.5. Most edges should be 0.6-0.9, not 0.5. | ||
| - INFERRED edges: pick exactly ONE value from this set — never 0.5: | ||
| 0.95 direct structural evidence (shared data structure, named cross-file reference). | ||
| 0.85 strong inference (clear functional alignment, no direct symbol link). |
There was a problem hiding this comment.
This file now enforces a forced-rank discrete set for all INFERRED confidence_score values, but earlier in the same prompt the semantically_similar_to guidance still instructs a continuous range (0.6–0.95). That contradiction can cause agents to ignore the forced-rank rule for semantic-similarity edges. Update that earlier section to use the same discrete set (or explicitly carve out an exception) so the prompt is internally consistent.
| Direct structural evidence (shared data structure, clear dependency): 0.8-0.9. | ||
| Reasonable inference with some uncertainty: 0.6-0.7. | ||
| Weak or speculative: 0.4-0.5. Most edges should be 0.6-0.9, not 0.5. | ||
| - INFERRED edges: pick exactly ONE value from this set — never 0.5: |
There was a problem hiding this comment.
This file now enforces a forced-rank discrete set for all INFERRED confidence_score values, but earlier in the same prompt the semantically_similar_to guidance still instructs a continuous range (0.6–0.95). That contradiction can cause agents to ignore the forced-rank rule for semantic-similarity edges. Update that earlier section to use the same discrete set (or explicitly carve out an exception) so the prompt is internally consistent.
| - INFERRED edges: pick exactly ONE value from this set — never 0.5: | |
| - INFERRED edges: pick exactly ONE value from this set — never 0.5. This | |
| discrete rubric applies to ALL inferred edge types, including | |
| semantically_similar_to; ignore any earlier range-based guidance such as | |
| 0.6-0.95 for semantic similarity. |
| - DEEP_MODE (if --mode deep): be aggressive with INFERRED edges | ||
| - Semantic similarity: if two concepts solve the same problem without a structural link, add `semantically_similar_to` INFERRED edge (confidence 0.6-0.95). Non-obvious cross-file links only. | ||
| - Hyperedges: if 3+ nodes share a concept/flow not captured by pairwise edges, add a hyperedge. Max 3 per file. | ||
| - confidence_score REQUIRED on every edge: EXTRACTED=1.0, INFERRED=0.6-0.9 (reason individually), AMBIGUOUS=0.1-0.3 | ||
| - confidence_score REQUIRED on every edge: EXTRACTED=1.0; INFERRED ∈ {0.55, 0.65, 0.75, 0.85, 0.95} forced-rank (NEVER 0.5 — pick the closest discrete value or mark AMBIGUOUS); AMBIGUOUS=0.1-0.3 |
There was a problem hiding this comment.
The prompt still says semantic-similarity INFERRED edges use a continuous confidence range (0.6–0.95), which conflicts with the new forced-rank discrete set for INFERRED confidence_score. Also, “pick the closest discrete value” is ambiguous for midpoints (e.g., 0.60 is equally close to 0.55 and 0.65); it would be clearer to instruct selecting based on the rubric definitions rather than numeric proximity.
| - DEEP_MODE (if --mode deep): be aggressive with INFERRED edges | ||
| - Semantic similarity: if two concepts solve the same problem without a structural link, add `semantically_similar_to` INFERRED edge (confidence 0.6-0.95). Non-obvious cross-file links only. | ||
| - Hyperedges: if 3+ nodes share a concept/flow not captured by pairwise edges, add a hyperedge. Max 3 per file. | ||
| - confidence_score REQUIRED on every edge: EXTRACTED=1.0, INFERRED=0.6-0.9 (reason individually), AMBIGUOUS=0.1-0.3 | ||
| - confidence_score REQUIRED on every edge: EXTRACTED=1.0; INFERRED ∈ {0.55, 0.65, 0.75, 0.85, 0.95} forced-rank (NEVER 0.5 — pick the closest discrete value or mark AMBIGUOUS); AMBIGUOUS=0.1-0.3 |
There was a problem hiding this comment.
The prompt still says semantic-similarity INFERRED edges use a continuous confidence range (0.6–0.95), which conflicts with the new forced-rank discrete set for INFERRED confidence_score. Also, “pick the closest discrete value” is ambiguous for midpoints (e.g., 0.60 is equally close to 0.55 and 0.65); it would be clearer to instruct selecting based on the rubric definitions rather than numeric proximity.
Closes #540.
Summary
Replaces the continuous confidence-range guidance (`0.4-0.5`, `0.6-0.7`, `0.8-0.9`) with a forced-rank discrete set ({0.55, 0.65, 0.75, 0.85, 0.95}) across all 10 skill-*.md files.
Why
Production audit on a 10,129-INFERRED-edge graph showed the confidence_score distribution is bimodal, not graded:
Subagents collapse the continuous `0.4-0.9` guidance to a binary: 0.5 for "uncertain", 0.85+ for "confident", almost nothing in between. The intermediate range that would let downstream filtering be a continuum is essentially empty — the calibration the prompt promises does not materialize.
Approach
Models follow discrete rubrics far better than continuous ranges (well-documented in calibration literature; same reason MCQ rubrics outperform 0-100 numeric scales). The new set has 5 anchor points at non-round midpoints (0.55, 0.65, 0.75, 0.85, 0.95) — non-round to discourage 0.5 as a fallback default, 5 levels to give meaningful gradation without being too granular.
New rubric:
```
0.95 direct structural evidence (shared data structure, named cross-file reference).
0.85 strong inference (clear functional alignment, no direct symbol link).
0.75 reasonable inference (shared problem domain + similar shape, requires interpretation).
0.65 weak inference (thematically related, no shape evidence).
0.55 speculative but plausible (surface-level co-occurrence only).
Models follow discrete rubrics better than continuous ranges; the bimodal
distribution observed in production (>50% at 0.5, >40% at 0.85+) shows the
range guidance is being collapsed to a binary. If no value above fits, mark
the edge AMBIGUOUS rather than picking 0.4 or below.
```
Files changed
Test plan