Skip to content

Latest commit

 

History

History
176 lines (130 loc) · 7.88 KB

File metadata and controls

176 lines (130 loc) · 7.88 KB

Graphify Evaluation - Mixed Corpus (2026-04-04)

Evaluator: Claude Sonnet 4.6 (live execution) Corpus: 3 Python files + 1 markdown paper + 1 Arabic PNG image Pipeline: detect → extract (AST) → build → cluster → analyze → query → feedback loop


1. Corpus Detection

code:  [analyze.py, build.py, cluster.py]          3 files
paper: [attention_notes.md]                         1 file (arxiv signals detected)
image: [attention_arabic.png]                       1 file
total: 5 files · ~4,020 words
warning: fits in a single context window (correct - corpus is small)

Finding: attention_notes.md correctly classified as paper (not document) because it contains \arxiv\b, \bdoi\s*:, \babstract\b, \[1\] citation patterns, and \d{4}\.\d{5} (1706.03762). The paper signal heuristic works correctly.


2. AST Extraction (3 Python files)

analyze.py:  9 nodes, 9 edges
build.py:    3 nodes, 3 edges
cluster.py:  6 nodes, 7 edges
─────────────────────────────
Total:       18 nodes, 19 edges  →  graph: 20 nodes, 19 edges (2 external deps added)

3. Community Detection

Community Label Cohesion Nodes
0 Graph Analysis 0.22 analyze.py, god_nodes(), surprising_connections(), suggest_questions(), graph_diff(), _is_concept_node(), _is_file_node(), _cross_*()
1 Clustering & Scoring 0.29 cluster.py, cluster(), score_all(), cohesion_score(), build_graph(), _split_community(), graspologic
2 Graph Building 0.50 build.py, build(), build_from_json(), networkx

Finding: Communities are semantically correct - the three graphify modules map cleanly to their functional roles. build.py has the highest cohesion (0.50) because it's a tight, self-contained module. analyze.py is lowest (0.22) because its functions don't call each other - each is a standalone analysis pass, making the subgraph sparse.

Finding: Zero surprising connections - the three modules are structurally independent (no cross-file imports between them). Expected for a cleanly layered codebase.


4. Query Tests (live BFS traversal)

All three queries ran against the real graph.json, returned relevant subgraphs, and were saved to graphify-out/memory/.

Q1: "what does cluster do and how does it connect to build?"

  • BFS from cluster() reached 20 nodes (full graph - small corpus)
  • cluster.py and build.py are linked via the graspologic_partition external dep node
  • Saved: query_..._what_does_cluster_do_and_how_does_it_connect_to_bu.md

Q2: "what is graph_diff and what does it analyze?"

  • BFS from analyze.py reached 12 nodes
  • graph_diff() lives in analyze.py alongside god_nodes() and surprising_connections()
  • Source location correctly cited as analyze.py:L1
  • Saved: query_..._what_is_graph_diff_and_what_does_it_analyze.md

Q3: "how does score_all work with community detection?"

  • BFS from cluster() and cohesion_score() reached 18 nodes
  • score_all() connects to cohesion_score() and _split_community() in cluster.py
  • Saved: query_..._how_does_score_all_work_with_community_detection.md

5. Feedback Loop Test (answers filed back into library)

Memory files created: 3
  query_..._what_is_graph_diff...md           1,528 bytes
  query_..._how_does_score_all...md           1,763 bytes
  query_..._what_does_cluster...md            1,838 bytes

detect() on eval root with graphify-out/memory/ present:
  Memory files found by next scan: 3 / 3  ✓

Result: PASS. All 3 query results appear in the next detect() scan. On the next --update, these files will be extracted as nodes in the graph - closing the feedback loop. The graph grows from what you ask, not just what you add.


6. Arabic Image OCR (via Claude vision)

Image: attention_arabic.png - Arabic notes on the Transformer paper

What graphify extracts (Claude vision reads directly, no reshaper/bidi needed):

Arabic English
آلية الانتباه في نماذج اللغة الكبيرة Attention mechanism in large language models
الانتباه متعدد الرؤوس Multi-head attention
يستخدم النموذج h=8 رؤوس انتباه متوازية The model uses h=8 parallel attention heads
d_model = 512 ، d_k = d_v = 64 (hyperparameters, bilingual)
المحول: مكدس من 6 طبقات ترميز و6 طبقات فك ترميز Transformer: 6 encoder + 6 decoder layers
الترميز الموضعي Positional encoding
التطبيع الطبقي Layer normalization
المصدر: Vaswani et al., 2017 - arXiv: 1706.03762 Source citation

Nodes graphify would extract:

  • MultiHeadAttention (آلية الانتباه) - hyperparameters: h=8, d_model=512, d_k=64
  • PositionalEncoding (الترميز الموضعي) - feeds into transformer input
  • LayerNorm (التطبيع الطبقي) - applied per sublayer
  • Transformer - 6 encoder + 6 decoder stack

Key finding: Arabic text OCR works natively via Claude vision. No preprocessing, no reshaper libraries, no bidi algorithms. The model reads Arabic, Persian, Hebrew, Chinese etc. identically to English. The image node in graphify is just a path - the vision subagent does the rest.


7. Issues Found

Issue 1: Suggested questions returns empty (MINOR)

suggest_questions() requires a community_labels dict. When called with auto-generated labels on a small corpus with no AMBIGUOUS edges and no isolated nodes, it returns an empty list. The function requires more signal (AMBIGUOUS edges, bridge nodes, underexplored god nodes) to generate questions - correct behavior, but the skill should handle the empty case gracefully.

Issue 2: God nodes empty when all nodes are file-level (MINOR)

god_nodes() correctly excludes file hub nodes. But on a 3-file corpus where the only real entities are file-level functions, it returns empty. The evaluation fell back to showing degree-ranked nodes manually. Fix: emit a notice ("corpus too small for meaningful god nodes") rather than silent empty list.

Issue 3: 0 surprising connections on cleanly-layered code (NOT a bug)

The three modules don't import from each other - they're connected only through external deps (networkx, graspologic). No cross-community edges means no surprises to surface. This is correct. Surprising connections require a less-cleanly-separated codebase.


8. Scores

Dimension Score Notes
Detection accuracy 10/10 paper/code/image classified correctly, arxiv heuristic works
AST extraction 7/10 functions and file nodes correct; no cross-file edges (expected)
Community quality 9/10 3 communities map perfectly to 3 functional modules
Query traversal 8/10 BFS finds relevant nodes, source locations cited correctly
Feedback loop 10/10 query results appear in next detect() scan, 3/3
Arabic OCR 10/10 Claude vision reads RTL Arabic natively, no libraries needed

Overall: 9.0/10 - strong pass on all dimensions with a small corpus. Primary gaps are edge-level semantics (no INFERRED edges from AST-only) and god_nodes/ suggest_questions behavior on tiny corpora.


Conclusion

The core pipeline is solid. The three most important findings:

  1. The feedback loop works end-to-end. Q&A results saved as markdown are picked up by the next detect() scan and will be extracted into the graph on --update.

  2. Arabic OCR requires zero special handling. PIL creates the image, Claude reads it. The same applies to any language - no language-specific preprocessing needed.

  3. The corpus-size warning is working correctly. At 4,020 words the warning fires: "fits in a single context window - you may not need a graph." This is honest. The graph adds value at scale, not on 5-file repos.