feat: extract docstrings into node metadata and embed them for semantic search by SHudici · Pull Request #602 · tirth8205/code-review-graph

SHudici · 2026-07-03T21:58:35Z

Problem

Semantic search embeds only identifier-derived text — name, parent, params, return type, directory (_node_to_text). The one thing written in the same natural language as user queries — the docstring — is never captured anywhere in the graph. Queries that describe behavior ("parse uploaded rate sheets", "retry on transient errors") only match nodes whose names happen to share words with the query.

Fix

Parser: extract a documentation summary for every function/class into extra["docstring"] — no schema change, nodes.extra already exists and round-trips through the store:

Python: the real docstring, matching CPython semantics — plain/r/u string first statements including parenthesized and implicitly concatenated literals; bytes and f-strings are not docstrings (__doc__ stays None, so no key is stored). Handles both grammar shapes (bare string in the block, and expression_statement-wrapped).
JSDoc / Javadoc / Doxygen: the /** ... */ (or /*! ... */) block directly above the definition; for exported JS/TS declarations the block above the export statement is found too.
C#-style doc lines: /// and //!; attributes/decorators sitting between the comment and the definition are skipped.
Rust: /// only — inner doc comments (//!, /*! ... */) document the enclosing module/crate, never the following item, so they are dropped (leading //! lines above a /// block are trimmed).
Go: contiguous plain // block, matching godoc.

Deliberately conservative: plain // comments in non-Go languages are ignored as noise, and a blank line detaches a comment from the definition. Stored value is the first paragraph (the summary by every doc convention — PEP 257, JSDoc, godoc, rustdoc), whitespace-collapsed, capped at 400 chars so multi-page docstrings don't bloat the DB or drown the name/signature terms.

Embeddings: _node_to_text appends the docstring summary. Existing graphs pick this up automatically — the embedding text hash changes, so the next embed/refresh re-embeds documented nodes.

Measured effect (mid-size production repo)

Parsing a 352-file Python service (4,134 non-File nodes): 812 nodes (19.6%) gain a docstring summary that was previously invisible to search — 32.3% of Functions (641/1,983), 31.6% of Classes (125/396), 2.6% of Tests. Example of what enters the embedding text that no identifier carries:

require_auth → "FastAPI dependency that validates the Bearer token against atlas_api_key."

A query like "where is the bearer token validated" previously had zero lexical overlap with this node.

Testing

New tests (17): Python function/class/method docstrings, first-paragraph selection, 400-char cap, no-docstring leaves no key, bytes/f-string rejection, parenthesized implicit concatenation; JSDoc block; exported JS declarations; plain // ignored on JS; godoc plain-// block accepted on Go; Rust /// across #[inline] plus //!//*! inner-doc rejection; Javadoc first paragraph (param tags excluded); blank-line detachment. Plus _node_to_text includes the docstring and is byte-identical when absent. Full suite passes.

Follow-ups deliberately out of scope: docstrings for the JS const f = () => {} extraction site and language-specific creation sites (Elixir/Julia/Nix/R); adding docstring to the FTS index (needs an FTS5 table rebuild migration).

🤖 Generated with Claude Code

…ic search Semantic search embeds only identifier-derived text (name, parent, params, return type, directory). The one thing written in the same natural language as user queries — the docstring — was never captured. Queries that describe behavior ("parse uploaded rate sheets") only matched nodes whose names shared words with the query. Parser: extract a documentation summary for every function/class into extra["docstring"] — no schema change, nodes.extra already round-trips JSON through the store. - Python: the real docstring, matching CPython semantics — plain/r/u string first statements including parenthesized and implicitly concatenated literals; bytes and f-strings are NOT docstrings. Handles both grammar shapes (bare string in the block, and expression_statement-wrapped). - JSDoc / Javadoc / Doxygen: the /** ... */ (or /*! ... */) block directly above the definition; for exported JS/TS declarations the block above the export statement is found too. - C#-style doc lines: /// and //!; attributes/decorators between the comment and the definition are skipped. - Rust: /// only. Inner doc comments (//! and /*! ... */) document the enclosing module/crate, never the following item, so they are dropped (leading //! lines above a /// block are trimmed). - Go: contiguous plain // block, matching godoc. Deliberately conservative: plain // comments in non-Go languages are ignored as noise, and a blank line detaches a comment from the definition. Stored value is the first paragraph, whitespace-collapsed, capped at 400 chars. Embeddings: _node_to_text appends the docstring summary. Existing graphs pick this up automatically — the embedding text hash changes, so the next embed run re-embeds documented nodes. Measured on a 352-file production Python service: 812 of 4,134 non-File nodes (19.6%) gain searchable documentation text — 32.3% of Functions, 31.6% of Classes. 17 new tests: Python docstrings (first paragraph, cap, class+method, bytes/f-string rejection, parenthesized concatenation), JSDoc, exported JS declarations, godoc, Rust ///-across-attributes plus //! and /*! inner-doc rejection, Javadoc, blank-line detachment, and _node_to_text inclusion/absence.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: extract docstrings into node metadata and embed them for semantic search#602

feat: extract docstrings into node metadata and embed them for semantic search#602
SHudici wants to merge 1 commit into
tirth8205:mainfrom
SHudici:feat/docstring-embeddings

SHudici commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SHudici commented Jul 3, 2026

Problem

Fix

Measured effect (mid-size production repo)

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant