Skip to content

Commit 05fd423

Browse files
Lum1104claude
andcommitted
fix(merge): recover imports edges file-analyzer batches drop
A controlled-experiment audit on a 1240-file Python project (opensre) showed that 27.2% of resolved-internal imports never made it from project-scanner's `importMap` into the final knowledge graph. Of the 404 source files with internal imports, 91 ended up with ZERO imports edges in the graph despite their `file:` node being present (consistent with main-session orchestrator dropping the entry from `batchImportData` during batch construction), and 104 had partial coverage (consistent with file-analyzer agent dropping rows during edge enumeration). GitHub issue #128 reported the same failure mode at 16-21% on a Go monorepo. The fix has two layers: 1. `merge-batch-graphs.py` now runs a deterministic recovery pass after merge: for every `(source, target)` in scan-result.json's `importMap` whose source `file:` node exists in the assembled graph and whose target `file:` node also exists, emit an `imports` edge if the batches didn't already. Recovered edges are tagged `recoveredFromImportMap: true` so downstream consumers can audit which edges came from the deterministic source vs. agent emission. The merge report logs the recovered count plus how many importMap entries were skipped because their source/target had no graph node. 2. `file-analyzer.md` rewrites the imports edge rule to demand 1:1 emission with a self-check: "the number of `imports` edges in your output MUST equal `sum(batchImportData[file].length)` across the batch's code files". This drives the agent to enumerate every row instead of summarizing — recovery should report 0 when this works. Tests: +6 cases covering the recovery path — drops, no-double-emit, missing source/target nodes, missing scan-result.json (incremental update), and self-import suppression. 770 passing (was 764). Bumps version to 2.6.3 across the five tracked manifests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent c49c46d commit 05fd423

8 files changed

Lines changed: 303 additions & 6 deletions

File tree

.claude-plugin/plugin.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
{
22
"name": "understand-anything",
33
"description": "AI-powered codebase understanding — analyze, visualize, and explain any project",
4-
"version": "2.6.2",
4+
"version": "2.6.3",
55
"author": {
66
"name": "Lum1104"
77
},

.copilot-plugin/plugin.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
{
22
"name": "understand-anything",
33
"description": "AI-powered codebase understanding — analyze, visualize, and explain any project",
4-
"version": "2.6.2",
4+
"version": "2.6.3",
55
"author": {
66
"name": "Lum1104"
77
},

.cursor-plugin/plugin.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
"name": "understand-anything",
33
"displayName": "Understand Anything",
44
"description": "AI-powered codebase understanding — analyze, visualize, and explain any project",
5-
"version": "2.6.2",
5+
"version": "2.6.3",
66
"author": {
77
"name": "Lum1104"
88
},

understand-anything-plugin/.claude-plugin/plugin.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
{
22
"name": "understand-anything",
33
"description": "AI-powered codebase understanding — analyze, visualize, and explain any project",
4-
"version": "2.6.2",
4+
"version": "2.6.3",
55
"author": {
66
"name": "Lum1104"
77
},

understand-anything-plugin/agents/file-analyzer.md

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -260,7 +260,17 @@ Using the script's structural data and file categories, create edges:
260260
| `related` | Non-code file is topically related to another file without a specific structural relationship | `0.5` | `forward` |
261261
| `depends_on` | Non-code file depends on another file (e.g., docker-compose depends on Dockerfile, CI workflow depends on Makefile targets) | `0.6` | `forward` |
262262

263-
**Import edge creation rule for code files:** For each resolved path in `batchImportData[filePath]` (provided in the input JSON), create an `imports` edge from the current file node to `file:<resolvedPath>`. The `batchImportData` values contain only resolved project-internal paths — external packages have already been filtered out. Do NOT attempt to re-resolve imports from source.
263+
**Import edge creation rule for code files (1:1 emission, NO aggregation):**
264+
265+
For every code file in this batch:
266+
267+
1. Read its `batchImportData[filePath]` array (provided in the input JSON).
268+
2. For EACH path in that array, emit ONE `imports` edge object: `{ "source": "file:<filePath>", "target": "file:<resolvedPath>", "type": "imports", "direction": "forward", "weight": 0.7 }`.
269+
3. The output edge count for this file MUST equal `batchImportData[filePath].length`. Not 90% of it. Not "the meaningful ones". All of them.
270+
271+
The `batchImportData` values contain only resolved project-internal paths — external packages have already been filtered out, so every path is safe to emit. Do NOT attempt to re-resolve imports from source. Do NOT skip imports because the target lives in another batch (cross-batch references are explicitly allowed for `imports` edges, since the project-scanner already verified the path exists).
272+
273+
**Self-check before writing the batch JSON:** sum `batchImportData[file].length` across every code file in your batch. The number of `imports` edges in your output MUST equal that sum. If it doesn't, you dropped some during enumeration — go back and add them. (A deterministic post-processing pass in `merge-batch-graphs.py` will recover anything you still miss, but it is your job to get this right at emission time so the recovery report stays empty.)
264274

265275
**Non-code edge creation guidance:**
266276
- **Config files:** Look at the config file's purpose. `tsconfig.json` configures all `.ts` files; `package.json` configures the build. Create `configures` edges to the most relevant entry points or directories.

understand-anything-plugin/package.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
22
"name": "@understand-anything/skill",
3-
"version": "2.6.2",
3+
"version": "2.6.3",
44
"type": "module",
55
"main": "dist/index.js",
66
"types": "dist/index.d.ts",

understand-anything-plugin/skills/understand/merge-batch-graphs.py

Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -367,6 +367,96 @@ def merge_and_normalize(batches: list[dict[str, Any]]) -> tuple[dict[str, Any],
367367
return assembled, report
368368

369369

370+
# ── Imports-edge recovery from importMap ──────────────────────────────────
371+
372+
def recover_imports_from_scan(
373+
assembled: dict[str, Any],
374+
scan_result_path: Path,
375+
) -> tuple[int, list[str]]:
376+
"""Re-emit any `imports` edges that exist in `scan-result.json#importMap`
377+
but never made it into a batch's output. The project-scanner's importMap
378+
is the deterministic source of truth for resolved internal imports;
379+
file-analyzer agents are expected to transcribe those into edges 1:1
380+
but in practice drop ~25% of them on real projects (orchestrator-side
381+
batch construction loses entries, agent-side enumeration drops more).
382+
383+
Returns (recovered_count, report_lines).
384+
"""
385+
if not scan_result_path.is_file():
386+
return 0, [f" importMap recovery skipped — {scan_result_path.name} not found"]
387+
388+
try:
389+
scan = json.loads(scan_result_path.read_text(encoding="utf-8"))
390+
except (OSError, json.JSONDecodeError) as e:
391+
return 0, [f" importMap recovery skipped — could not parse {scan_result_path.name}: {e}"]
392+
393+
import_map = scan.get("importMap")
394+
if not isinstance(import_map, dict):
395+
return 0, [f" importMap recovery skipped — no importMap field in {scan_result_path.name}"]
396+
397+
# Build the set of file: node ids actually present in the assembled graph.
398+
file_node_ids: set[str] = set()
399+
for node in assembled["nodes"]:
400+
if node.get("type") == "file":
401+
file_node_ids.add(node.get("id", ""))
402+
403+
# Build the set of (source, target) imports edges already present.
404+
existing: set[tuple[str, str]] = set()
405+
for edge in assembled["edges"]:
406+
if edge.get("type") == "imports":
407+
existing.add((edge.get("source", ""), edge.get("target", "")))
408+
409+
recovered = 0
410+
skipped_no_src_node = 0
411+
skipped_no_tgt_node = 0
412+
for src_path, targets in import_map.items():
413+
if not isinstance(targets, list):
414+
continue
415+
src_id = f"file:{src_path}"
416+
if src_id not in file_node_ids:
417+
if targets:
418+
skipped_no_src_node += 1
419+
continue
420+
for tgt_path in targets:
421+
if not isinstance(tgt_path, str) or not tgt_path:
422+
continue
423+
tgt_id = f"file:{tgt_path}"
424+
if tgt_id not in file_node_ids:
425+
skipped_no_tgt_node += 1
426+
continue
427+
if src_id == tgt_id:
428+
continue
429+
if (src_id, tgt_id) in existing:
430+
continue
431+
assembled["edges"].append({
432+
"source": src_id,
433+
"target": tgt_id,
434+
"type": "imports",
435+
"direction": "forward",
436+
"weight": 0.7,
437+
"recoveredFromImportMap": True,
438+
})
439+
existing.add((src_id, tgt_id))
440+
recovered += 1
441+
442+
lines: list[str] = []
443+
lines.append(
444+
f" Recovered {recovered} `imports` edges from importMap "
445+
f"({len(import_map)} entries scanned)"
446+
)
447+
if skipped_no_src_node:
448+
lines.append(
449+
f" Skipped {skipped_no_src_node} importMap source files "
450+
f"with no `file:` node in graph"
451+
)
452+
if skipped_no_tgt_node:
453+
lines.append(
454+
f" Skipped {skipped_no_tgt_node} importMap target paths "
455+
f"with no `file:` node in graph"
456+
)
457+
return recovered, lines
458+
459+
370460
# ── Main ──────────────────────────────────────────────────────────────────
371461

372462
def main() -> None:
@@ -411,6 +501,16 @@ def main() -> None:
411501
# Merge and normalize
412502
assembled, report = merge_and_normalize(batches)
413503

504+
# Recover any imports edges file-analyzer batches dropped despite
505+
# `batchImportData` containing them. The project-scanner's importMap
506+
# is the deterministic source of truth.
507+
scan_result_path = intermediate_dir / "scan-result.json"
508+
recovered, recovery_report = recover_imports_from_scan(assembled, scan_result_path)
509+
if recovery_report:
510+
report.append("")
511+
report.append("Imports edge recovery:")
512+
report.extend(recovery_report)
513+
414514
# Print report
415515
print("", file=sys.stderr)
416516
for line in report:
Lines changed: 187 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,187 @@
1+
import { describe, it, expect, beforeEach, afterEach } from "vitest";
2+
import { spawnSync } from "node:child_process";
3+
import { mkdirSync, mkdtempSync, rmSync, writeFileSync, readFileSync } from "node:fs";
4+
import { tmpdir } from "node:os";
5+
import { join } from "node:path";
6+
import { fileURLToPath } from "node:url";
7+
import { dirname, resolve } from "node:path";
8+
9+
const __dirname = dirname(fileURLToPath(import.meta.url));
10+
const MERGE_SCRIPT = resolve(__dirname, "../../skills/understand/merge-batch-graphs.py");
11+
12+
let projectRoot;
13+
let intermediateDir;
14+
15+
function runMerge() {
16+
const result = spawnSync("python3", [MERGE_SCRIPT, projectRoot], {
17+
encoding: "utf-8",
18+
});
19+
if (result.status !== 0) {
20+
throw new Error(`merge script failed: status=${result.status}\nstderr:\n${result.stderr}`);
21+
}
22+
const assembled = JSON.parse(
23+
readFileSync(join(intermediateDir, "assembled-graph.json"), "utf-8"),
24+
);
25+
return { assembled, stderr: result.stderr };
26+
}
27+
28+
function fileNode(path) {
29+
return {
30+
id: `file:${path}`,
31+
type: "file",
32+
name: path.split("/").pop(),
33+
filePath: path,
34+
summary: "",
35+
tags: [],
36+
complexity: "simple",
37+
};
38+
}
39+
40+
function importsEdge(src, tgt) {
41+
return {
42+
source: `file:${src}`,
43+
target: `file:${tgt}`,
44+
type: "imports",
45+
direction: "forward",
46+
weight: 0.7,
47+
};
48+
}
49+
50+
beforeEach(() => {
51+
projectRoot = mkdtempSync(join(tmpdir(), "ua-merge-test-"));
52+
intermediateDir = join(projectRoot, ".understand-anything", "intermediate");
53+
mkdirSync(intermediateDir, { recursive: true });
54+
});
55+
56+
afterEach(() => {
57+
rmSync(projectRoot, { recursive: true, force: true });
58+
});
59+
60+
describe("merge-batch-graphs.py imports recovery", () => {
61+
it("recovers imports edges that batches dropped despite importMap having them", () => {
62+
// Batch contains all the file nodes but only emits ONE of three imports edges.
63+
writeFileSync(
64+
join(intermediateDir, "batch-0.json"),
65+
JSON.stringify({
66+
nodes: [fileNode("src/a.py"), fileNode("src/b.py"), fileNode("src/c.py"), fileNode("src/d.py")],
67+
edges: [importsEdge("src/a.py", "src/b.py")],
68+
}),
69+
);
70+
// scan-result.json has the full importMap — agent dropped 2/3 of these.
71+
writeFileSync(
72+
join(intermediateDir, "scan-result.json"),
73+
JSON.stringify({
74+
importMap: {
75+
"src/a.py": ["src/b.py", "src/c.py", "src/d.py"],
76+
"src/b.py": [],
77+
},
78+
}),
79+
);
80+
81+
const { assembled, stderr } = runMerge();
82+
const importsEdges = assembled.edges.filter((e) => e.type === "imports");
83+
expect(importsEdges).toHaveLength(3);
84+
const targets = new Set(importsEdges.map((e) => e.target));
85+
expect(targets).toEqual(new Set(["file:src/b.py", "file:src/c.py", "file:src/d.py"]));
86+
// Recovered edges are tagged so downstream consumers can audit.
87+
const recovered = importsEdges.filter((e) => e.recoveredFromImportMap);
88+
expect(recovered).toHaveLength(2);
89+
expect(stderr).toContain("Recovered 2 `imports` edges");
90+
});
91+
92+
it("does not duplicate edges the batch already emitted", () => {
93+
writeFileSync(
94+
join(intermediateDir, "batch-0.json"),
95+
JSON.stringify({
96+
nodes: [fileNode("src/a.py"), fileNode("src/b.py")],
97+
edges: [importsEdge("src/a.py", "src/b.py")],
98+
}),
99+
);
100+
writeFileSync(
101+
join(intermediateDir, "scan-result.json"),
102+
JSON.stringify({
103+
importMap: { "src/a.py": ["src/b.py"], "src/b.py": [] },
104+
}),
105+
);
106+
107+
const { assembled, stderr } = runMerge();
108+
const importsEdges = assembled.edges.filter((e) => e.type === "imports");
109+
expect(importsEdges).toHaveLength(1);
110+
expect(stderr).toContain("Recovered 0 `imports` edges");
111+
});
112+
113+
it("skips importMap entries whose source file is missing from the graph", () => {
114+
// src/missing.py is in importMap but has no file: node — must not produce a dangling edge.
115+
writeFileSync(
116+
join(intermediateDir, "batch-0.json"),
117+
JSON.stringify({
118+
nodes: [fileNode("src/b.py")],
119+
edges: [],
120+
}),
121+
);
122+
writeFileSync(
123+
join(intermediateDir, "scan-result.json"),
124+
JSON.stringify({
125+
importMap: { "src/missing.py": ["src/b.py"] },
126+
}),
127+
);
128+
129+
const { assembled, stderr } = runMerge();
130+
expect(assembled.edges.filter((e) => e.type === "imports")).toHaveLength(0);
131+
expect(stderr).toContain("Skipped 1 importMap source files with no `file:` node");
132+
});
133+
134+
it("skips importMap targets that don't have a file: node", () => {
135+
writeFileSync(
136+
join(intermediateDir, "batch-0.json"),
137+
JSON.stringify({
138+
nodes: [fileNode("src/a.py")],
139+
edges: [],
140+
}),
141+
);
142+
writeFileSync(
143+
join(intermediateDir, "scan-result.json"),
144+
JSON.stringify({
145+
importMap: { "src/a.py": ["src/dropped.py", "src/also-missing.py"] },
146+
}),
147+
);
148+
149+
const { assembled, stderr } = runMerge();
150+
expect(assembled.edges.filter((e) => e.type === "imports")).toHaveLength(0);
151+
expect(stderr).toContain("Skipped 2 importMap target paths with no `file:` node");
152+
});
153+
154+
it("works when scan-result.json is missing (incremental update path)", () => {
155+
writeFileSync(
156+
join(intermediateDir, "batch-0.json"),
157+
JSON.stringify({
158+
nodes: [fileNode("src/a.py"), fileNode("src/b.py")],
159+
edges: [importsEdge("src/a.py", "src/b.py")],
160+
}),
161+
);
162+
// No scan-result.json written.
163+
164+
const { assembled, stderr } = runMerge();
165+
expect(assembled.edges.filter((e) => e.type === "imports")).toHaveLength(1);
166+
expect(stderr).toContain("importMap recovery skipped — scan-result.json not found");
167+
});
168+
169+
it("never produces self-import edges", () => {
170+
writeFileSync(
171+
join(intermediateDir, "batch-0.json"),
172+
JSON.stringify({
173+
nodes: [fileNode("src/a.py")],
174+
edges: [],
175+
}),
176+
);
177+
writeFileSync(
178+
join(intermediateDir, "scan-result.json"),
179+
JSON.stringify({
180+
importMap: { "src/a.py": ["src/a.py"] }, // pathological self-reference
181+
}),
182+
);
183+
184+
const { assembled } = runMerge();
185+
expect(assembled.edges.filter((e) => e.type === "imports")).toHaveLength(0);
186+
});
187+
});

0 commit comments

Comments
 (0)