Skip to content

Closes #540: forced-rank INFERRED confidence scores#546

Open
saxster wants to merge 1 commit intosafishamsi:v5from
saxster:fix/540-confidence-calibration
Open

Closes #540: forced-rank INFERRED confidence scores#546
saxster wants to merge 1 commit intosafishamsi:v5from
saxster:fix/540-confidence-calibration

Conversation

@saxster
Copy link
Copy Markdown

@saxster saxster commented Apr 25, 2026

Closes #540.

Summary

Replaces the continuous confidence-range guidance (`0.4-0.5`, `0.6-0.7`, `0.8-0.9`) with a forced-rank discrete set ({0.55, 0.65, 0.75, 0.85, 0.95}) across all 10 skill-*.md files.

Why

Production audit on a 10,129-INFERRED-edge graph showed the confidence_score distribution is bimodal, not graded:

Score bucket Count % of INFERRED
<0.4 0 0%
0.4-0.6 5,807 57%
0.6-0.8 14 0.1%
0.8+ 4,308 42%

Subagents collapse the continuous `0.4-0.9` guidance to a binary: 0.5 for "uncertain", 0.85+ for "confident", almost nothing in between. The intermediate range that would let downstream filtering be a continuum is essentially empty — the calibration the prompt promises does not materialize.

Approach

Models follow discrete rubrics far better than continuous ranges (well-documented in calibration literature; same reason MCQ rubrics outperform 0-100 numeric scales). The new set has 5 anchor points at non-round midpoints (0.55, 0.65, 0.75, 0.85, 0.95) — non-round to discourage 0.5 as a fallback default, 5 levels to give meaningful gradation without being too granular.

New rubric:

```

  • INFERRED edges: pick exactly ONE value from this set — never 0.5:
    0.95 direct structural evidence (shared data structure, named cross-file reference).
    0.85 strong inference (clear functional alignment, no direct symbol link).
    0.75 reasonable inference (shared problem domain + similar shape, requires interpretation).
    0.65 weak inference (thematically related, no shape evidence).
    0.55 speculative but plausible (surface-level co-occurrence only).
    Models follow discrete rubrics better than continuous ranges; the bimodal
    distribution observed in production (>50% at 0.5, >40% at 0.85+) shows the
    range guidance is being collapsed to a binary. If no value above fits, mark
    the edge AMBIGUOUS rather than picking 0.4 or below.
    ```

Files changed

  • 7 long-form skill files (skill.md, skill-codex.md, skill-copilot.md, skill-droid.md, skill-opencode.md, skill-windows.md, skill-trae.md): full forced-rank table.
  • 3 short-form skill files (skill-claw.md, skill-aider.md, skill-kiro.md): inline set notation `INFERRED ∈ {0.55, 0.65, 0.75, 0.85, 0.95}`.

Test plan

  • Re-run extraction on the same 10,129-edge corpus that produced the bimodal distribution; expect the histogram to fill the 0.55-0.85 buckets with the previous 0.5 mass redistributed across them
  • Pure prompt edit — no Python code changes, no test suite impact

Closes safishamsi#540.

Production audit on a 10,129-edge graph showed the INFERRED
confidence_score distribution is bimodal, not graded:

  | Score bucket | Count | % of INFERRED |
  |--------------|-------|---------------|
  | <0.4         |     0 | 0%            |
  | 0.4-0.6      | 5,807 | 57%           |
  | 0.6-0.8      |    14 | 0.1%          |
  | 0.8+         | 4,308 | 42%           |

Subagents collapse the continuous "0.4-0.9" guidance to a binary:
0.5 for "uncertain", 0.85+ for "confident", almost nothing in between.
Downstream filtering by confidence is therefore an on/off switch, not
the gradient the prompt promises.

Replace continuous ranges with a forced-rank discrete set:

  0.95  direct structural evidence
  0.85  strong inference
  0.75  reasonable inference
  0.65  weak inference
  0.55  speculative but plausible

Models follow discrete rubrics far better than continuous ranges
(documented in calibration literature; same reason MCQ rubrics
outperform 0-100 scales). The set is anchored at non-round midpoints
to discourage 0.5 as a default.

Applied uniformly across all 10 skill-*.md files:
- 7 long-form (skill.md, skill-codex.md, skill-copilot.md,
  skill-droid.md, skill-opencode.md, skill-windows.md, skill-trae.md):
  full forced-rank table.
- 3 short-form (skill-claw.md, skill-aider.md, skill-kiro.md):
  inline set notation INFERRED ∈ {0.55, 0.65, 0.75, 0.85, 0.95}.

Pure prompt edit — no code changes, no test impact. Effect is
observable only via re-extraction and inspection of the new
confidence_score distribution.
Copilot AI review requested due to automatic review settings April 25, 2026 05:03
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates Graphify’s extraction skill prompts to improve calibration of INFERRED edge confidence_score by replacing continuous score ranges with a forced-rank discrete rubric, aligning with observed bimodal production behavior and aiming to produce a more useful distribution for downstream filtering.

Changes:

  • Replaced INFERRED confidence_score continuous guidance with a discrete forced-rank set {0.55, 0.65, 0.75, 0.85, 0.95} (and “never 0.5”) in all skill prompt variants.
  • Added a short rationale in long-form skill files explaining why discrete rubrics are preferred and when to mark edges AMBIGUOUS instead.
  • Updated short-form skill files to express the discrete set inline.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
graphify/skill.md Replaces INFERRED confidence scoring ranges with discrete forced-rank rubric + rationale.
graphify/skill-codex.md Same forced-rank INFERRED confidence rubric for Codex skill prompt.
graphify/skill-copilot.md Same forced-rank INFERRED confidence rubric for Copilot skill prompt.
graphify/skill-droid.md Same forced-rank INFERRED confidence rubric for Droid skill prompt.
graphify/skill-opencode.md Same forced-rank INFERRED confidence rubric for OpenCode skill prompt.
graphify/skill-windows.md Same forced-rank INFERRED confidence rubric for Windows skill prompt.
graphify/skill-trae.md Same forced-rank INFERRED confidence rubric for Trae skill prompt.
graphify/skill-claw.md Inline discrete-set notation for INFERRED confidence scoring.
graphify/skill-aider.md Inline discrete-set notation for INFERRED confidence scoring.
graphify/skill-kiro.md Inline discrete-set notation for INFERRED confidence scoring.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Direct structural evidence (shared data structure, clear dependency): 0.8-0.9.
Reasonable inference with some uncertainty: 0.6-0.7.
Weak or speculative: 0.4-0.5. Most edges should be 0.6-0.9, not 0.5.
- INFERRED edges: pick exactly ONE value from this set — never 0.5:
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file now enforces a forced-rank discrete set for all INFERRED confidence_score values, but earlier in the same prompt the semantically_similar_to guidance still instructs a continuous range (0.6–0.95). That contradiction can cause agents to ignore the forced-rank rule for semantic-similarity edges. Update that earlier section to use the same discrete set (or explicitly carve out an exception) so the prompt is internally consistent.

Suggested change
- INFERRED edges: pick exactly ONE value from this set — never 0.5:
- INFERRED edges: pick exactly ONE value from this set — never 0.5. This
discrete set applies to ALL inferred edge types, including semantically_similar_to;
do not use continuous ranges such as 0.6-0.95 for any inferred edge:

Copilot uses AI. Check for mistakes.
Comment thread graphify/skill-droid.md
0.85 strong inference (clear functional alignment, no direct symbol link).
0.75 reasonable inference (shared problem domain + similar shape, requires interpretation).
0.65 weak inference (thematically related, no shape evidence).
0.55 speculative but plausible (surface-level co-occurrence only).
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file now enforces a forced-rank discrete set for all INFERRED confidence_score values, but earlier in the same prompt the semantically_similar_to guidance still instructs a continuous range (0.6–0.95). That contradiction can cause agents to ignore the forced-rank rule for semantic-similarity edges. Update that earlier section to use the same discrete set (or explicitly carve out an exception) so the prompt is internally consistent.

Suggested change
0.55 speculative but plausible (surface-level co-occurrence only).
0.55 speculative but plausible (surface-level co-occurrence only).
This discrete set applies to ALL INFERRED edge types, including
semantically_similar_to. Do not emit continuous confidence ranges (for
example, 0.6-0.95); map them to the single closest rubric value above.

Copilot uses AI. Check for mistakes.
Comment thread graphify/skill-codex.md
Comment on lines +295 to +300
- INFERRED edges: pick exactly ONE value from this set — never 0.5:
0.95 direct structural evidence (shared data structure, named cross-file reference).
0.85 strong inference (clear functional alignment, no direct symbol link).
0.75 reasonable inference (shared problem domain + similar shape, requires interpretation).
0.65 weak inference (thematically related, no shape evidence).
0.55 speculative but plausible (surface-level co-occurrence only).
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file now enforces a forced-rank discrete set for all INFERRED confidence_score values, but earlier in the same prompt the semantically_similar_to guidance still instructs a continuous range (0.6–0.95). That contradiction can cause agents to ignore the forced-rank rule for semantic-similarity edges. Update that earlier section to use the same discrete set (or explicitly carve out an exception) so the prompt is internally consistent.

Suggested change
- INFERRED edges: pick exactly ONE value from this set — never 0.5:
0.95 direct structural evidence (shared data structure, named cross-file reference).
0.85 strong inference (clear functional alignment, no direct symbol link).
0.75 reasonable inference (shared problem domain + similar shape, requires interpretation).
0.65 weak inference (thematically related, no shape evidence).
0.55 speculative but plausible (surface-level co-occurrence only).
- INFERRED edges: pick exactly ONE value from this set — never 0.5. This forced-rank
discrete rubric applies to ALL inferred edge types, including semantically_similar_to:
0.95 direct structural evidence (shared data structure, named cross-file reference).
0.85 strong inference (clear functional alignment, no direct symbol link).
0.75 reasonable inference (shared problem domain + similar shape, requires interpretation).
0.65 weak inference (thematically related, no shape evidence).
0.55 speculative but plausible (surface-level co-occurrence only).
Do not use continuous ranges such as 0.6-0.95 for any INFERRED edge; if earlier
edge-specific guidance suggests a range, map it to the nearest value in this set.

Copilot uses AI. Check for mistakes.
Comment thread graphify/skill-aider.md
Comment on lines 240 to +243
- DEEP_MODE (if --mode deep): be aggressive with INFERRED edges
- Semantic similarity: if two concepts solve the same problem without a structural link, add `semantically_similar_to` INFERRED edge (confidence 0.6-0.95). Non-obvious cross-file links only.
- Hyperedges: if 3+ nodes share a concept/flow not captured by pairwise edges, add a hyperedge. Max 3 per file.
- confidence_score REQUIRED on every edge: EXTRACTED=1.0, INFERRED=0.6-0.9 (reason individually), AMBIGUOUS=0.1-0.3
- confidence_score REQUIRED on every edge: EXTRACTED=1.0; INFERRED ∈ {0.55, 0.65, 0.75, 0.85, 0.95} forced-rank (NEVER 0.5 — pick the closest discrete value or mark AMBIGUOUS); AMBIGUOUS=0.1-0.3
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prompt still says semantic-similarity INFERRED edges use a continuous confidence range (0.6–0.95), which conflicts with the new forced-rank discrete set for INFERRED confidence_score. Also, “pick the closest discrete value” is ambiguous for midpoints (e.g., 0.60 is equally close to 0.55 and 0.65); it would be clearer to instruct selecting based on the rubric definitions rather than numeric proximity.

Copilot uses AI. Check for mistakes.
Comment thread graphify/skill.md
Comment on lines +338 to +341
Models follow discrete rubrics better than continuous ranges; the bimodal
distribution observed in production (>50% at 0.5, >40% at 0.85+) shows the
range guidance is being collapsed to a binary. If no value above fits, mark
the edge AMBIGUOUS rather than picking 0.4 or below.
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file now enforces a forced-rank discrete set for all INFERRED confidence_score values, but earlier in the same prompt the semantically_similar_to guidance still instructs a continuous range (0.6–0.95). That contradiction can cause agents to ignore the forced-rank rule for semantic-similarity edges. Update that earlier section to use the same discrete set (or explicitly carve out an exception) so the prompt is internally consistent.

Suggested change
Models follow discrete rubrics better than continuous ranges; the bimodal
distribution observed in production (>50% at 0.5, >40% at 0.85+) shows the
range guidance is being collapsed to a binary. If no value above fits, mark
the edge AMBIGUOUS rather than picking 0.4 or below.
This discrete rubric also applies to semantically_similar_to edges: treat them
as INFERRED and choose exactly one value from the set above, not a continuous
range such as 0.6-0.95. Models follow discrete rubrics better than continuous
ranges; the bimodal distribution observed in production (>50% at 0.5, >40%
at 0.85+) shows the range guidance is being collapsed to a binary. If no
value above fits, mark the edge AMBIGUOUS rather than picking 0.4 or below.

Copilot uses AI. Check for mistakes.
Comment thread graphify/skill-windows.md
Direct structural evidence (shared data structure, clear dependency): 0.8-0.9.
Reasonable inference with some uncertainty: 0.6-0.7.
Weak or speculative: 0.4-0.5. Most edges should be 0.6-0.9, not 0.5.
- INFERRED edges: pick exactly ONE value from this set — never 0.5:
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file now enforces a forced-rank discrete set for all INFERRED confidence_score values, but earlier in the same prompt the semantically_similar_to guidance still instructs a continuous range (0.6–0.95). That contradiction can cause agents to ignore the forced-rank rule for semantic-similarity edges. Update that earlier section to use the same discrete set (or explicitly carve out an exception) so the prompt is internally consistent.

Suggested change
- INFERRED edges: pick exactly ONE value from this set — never 0.5:
- INFERRED edges: pick exactly ONE value from this set — never 0.5. This
discrete rubric applies to every inferred relation, including
`semantically_similar_to`; do not use continuous ranges such as 0.6-0.95:

Copilot uses AI. Check for mistakes.
Comment thread graphify/skill-trae.md
Comment on lines 280 to +284
confidence_score is REQUIRED on every edge - never omit it, never use 0.5 as a default:
- EXTRACTED edges: confidence_score = 1.0 always
- INFERRED edges: reason about each edge individually.
Direct structural evidence (shared data structure, clear dependency): 0.8-0.9.
Reasonable inference with some uncertainty: 0.6-0.7.
Weak or speculative: 0.4-0.5. Most edges should be 0.6-0.9, not 0.5.
- INFERRED edges: pick exactly ONE value from this set — never 0.5:
0.95 direct structural evidence (shared data structure, named cross-file reference).
0.85 strong inference (clear functional alignment, no direct symbol link).
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file now enforces a forced-rank discrete set for all INFERRED confidence_score values, but earlier in the same prompt the semantically_similar_to guidance still instructs a continuous range (0.6–0.95). That contradiction can cause agents to ignore the forced-rank rule for semantic-similarity edges. Update that earlier section to use the same discrete set (or explicitly carve out an exception) so the prompt is internally consistent.

Copilot uses AI. Check for mistakes.
Comment thread graphify/skill-copilot.md
Direct structural evidence (shared data structure, clear dependency): 0.8-0.9.
Reasonable inference with some uncertainty: 0.6-0.7.
Weak or speculative: 0.4-0.5. Most edges should be 0.6-0.9, not 0.5.
- INFERRED edges: pick exactly ONE value from this set — never 0.5:
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file now enforces a forced-rank discrete set for all INFERRED confidence_score values, but earlier in the same prompt the semantically_similar_to guidance still instructs a continuous range (0.6–0.95). That contradiction can cause agents to ignore the forced-rank rule for semantic-similarity edges. Update that earlier section to use the same discrete set (or explicitly carve out an exception) so the prompt is internally consistent.

Suggested change
- INFERRED edges: pick exactly ONE value from this set — never 0.5:
- INFERRED edges: pick exactly ONE value from this set — never 0.5. This
discrete rubric applies to ALL inferred edge types, including
semantically_similar_to; ignore any earlier range-based guidance such as
0.6-0.95 for semantic similarity.

Copilot uses AI. Check for mistakes.
Comment thread graphify/skill-claw.md
Comment on lines 240 to +243
- DEEP_MODE (if --mode deep): be aggressive with INFERRED edges
- Semantic similarity: if two concepts solve the same problem without a structural link, add `semantically_similar_to` INFERRED edge (confidence 0.6-0.95). Non-obvious cross-file links only.
- Hyperedges: if 3+ nodes share a concept/flow not captured by pairwise edges, add a hyperedge. Max 3 per file.
- confidence_score REQUIRED on every edge: EXTRACTED=1.0, INFERRED=0.6-0.9 (reason individually), AMBIGUOUS=0.1-0.3
- confidence_score REQUIRED on every edge: EXTRACTED=1.0; INFERRED ∈ {0.55, 0.65, 0.75, 0.85, 0.95} forced-rank (NEVER 0.5 — pick the closest discrete value or mark AMBIGUOUS); AMBIGUOUS=0.1-0.3
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prompt still says semantic-similarity INFERRED edges use a continuous confidence range (0.6–0.95), which conflicts with the new forced-rank discrete set for INFERRED confidence_score. Also, “pick the closest discrete value” is ambiguous for midpoints (e.g., 0.60 is equally close to 0.55 and 0.65); it would be clearer to instruct selecting based on the rubric definitions rather than numeric proximity.

Copilot uses AI. Check for mistakes.
Comment thread graphify/skill-kiro.md
Comment on lines 239 to +242
- DEEP_MODE (if --mode deep): be aggressive with INFERRED edges
- Semantic similarity: if two concepts solve the same problem without a structural link, add `semantically_similar_to` INFERRED edge (confidence 0.6-0.95). Non-obvious cross-file links only.
- Hyperedges: if 3+ nodes share a concept/flow not captured by pairwise edges, add a hyperedge. Max 3 per file.
- confidence_score REQUIRED on every edge: EXTRACTED=1.0, INFERRED=0.6-0.9 (reason individually), AMBIGUOUS=0.1-0.3
- confidence_score REQUIRED on every edge: EXTRACTED=1.0; INFERRED ∈ {0.55, 0.65, 0.75, 0.85, 0.95} forced-rank (NEVER 0.5 — pick the closest discrete value or mark AMBIGUOUS); AMBIGUOUS=0.1-0.3
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prompt still says semantic-similarity INFERRED edges use a continuous confidence range (0.6–0.95), which conflicts with the new forced-rank discrete set for INFERRED confidence_score. Also, “pick the closest discrete value” is ambiguous for midpoints (e.g., 0.60 is equally close to 0.55 and 0.65); it would be clearer to instruct selecting based on the rubric definitions rather than numeric proximity.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bimodal INFERRED confidence-score distribution — calibration needed

2 participants