Skip to content

feat: extract docstrings into node metadata and embed them for semantic search#602

Open
SHudici wants to merge 1 commit into
tirth8205:mainfrom
SHudici:feat/docstring-embeddings
Open

feat: extract docstrings into node metadata and embed them for semantic search#602
SHudici wants to merge 1 commit into
tirth8205:mainfrom
SHudici:feat/docstring-embeddings

Conversation

@SHudici

@SHudici SHudici commented Jul 3, 2026

Copy link
Copy Markdown

Problem

Semantic search embeds only identifier-derived text — name, parent, params, return type, directory (_node_to_text). The one thing written in the same natural language as user queries — the docstring — is never captured anywhere in the graph. Queries that describe behavior ("parse uploaded rate sheets", "retry on transient errors") only match nodes whose names happen to share words with the query.

Fix

Parser: extract a documentation summary for every function/class into extra["docstring"] — no schema change, nodes.extra already exists and round-trips through the store:

  • Python: the real docstring, matching CPython semantics — plain/r/u string first statements including parenthesized and implicitly concatenated literals; bytes and f-strings are not docstrings (__doc__ stays None, so no key is stored). Handles both grammar shapes (bare string in the block, and expression_statement-wrapped).
  • JSDoc / Javadoc / Doxygen: the /** ... */ (or /*! ... */) block directly above the definition; for exported JS/TS declarations the block above the export statement is found too.
  • C#-style doc lines: /// and //!; attributes/decorators sitting between the comment and the definition are skipped.
  • Rust: /// only — inner doc comments (//!, /*! ... */) document the enclosing module/crate, never the following item, so they are dropped (leading //! lines above a /// block are trimmed).
  • Go: contiguous plain // block, matching godoc.

Deliberately conservative: plain // comments in non-Go languages are ignored as noise, and a blank line detaches a comment from the definition. Stored value is the first paragraph (the summary by every doc convention — PEP 257, JSDoc, godoc, rustdoc), whitespace-collapsed, capped at 400 chars so multi-page docstrings don't bloat the DB or drown the name/signature terms.

Embeddings: _node_to_text appends the docstring summary. Existing graphs pick this up automatically — the embedding text hash changes, so the next embed/refresh re-embeds documented nodes.

Measured effect (mid-size production repo)

Parsing a 352-file Python service (4,134 non-File nodes): 812 nodes (19.6%) gain a docstring summary that was previously invisible to search — 32.3% of Functions (641/1,983), 31.6% of Classes (125/396), 2.6% of Tests. Example of what enters the embedding text that no identifier carries:

require_auth"FastAPI dependency that validates the Bearer token against atlas_api_key."

A query like "where is the bearer token validated" previously had zero lexical overlap with this node.

Testing

New tests (17): Python function/class/method docstrings, first-paragraph selection, 400-char cap, no-docstring leaves no key, bytes/f-string rejection, parenthesized implicit concatenation; JSDoc block; exported JS declarations; plain // ignored on JS; godoc plain-// block accepted on Go; Rust /// across #[inline] plus //!//*! inner-doc rejection; Javadoc first paragraph (param tags excluded); blank-line detachment. Plus _node_to_text includes the docstring and is byte-identical when absent. Full suite passes.

Follow-ups deliberately out of scope: docstrings for the JS const f = () => {} extraction site and language-specific creation sites (Elixir/Julia/Nix/R); adding docstring to the FTS index (needs an FTS5 table rebuild migration).

🤖 Generated with Claude Code

…ic search

Semantic search embeds only identifier-derived text (name, parent,
params, return type, directory). The one thing written in the same
natural language as user queries — the docstring — was never captured.
Queries that describe behavior ("parse uploaded rate sheets") only
matched nodes whose names shared words with the query.

Parser: extract a documentation summary for every function/class into
extra["docstring"] — no schema change, nodes.extra already round-trips
JSON through the store.

- Python: the real docstring, matching CPython semantics — plain/r/u
  string first statements including parenthesized and implicitly
  concatenated literals; bytes and f-strings are NOT docstrings.
  Handles both grammar shapes (bare string in the block, and
  expression_statement-wrapped).
- JSDoc / Javadoc / Doxygen: the /** ... */ (or /*! ... */) block
  directly above the definition; for exported JS/TS declarations the
  block above the export statement is found too.
- C#-style doc lines: /// and //!; attributes/decorators between the
  comment and the definition are skipped.
- Rust: /// only. Inner doc comments (//! and /*! ... */) document
  the enclosing module/crate, never the following item, so they are
  dropped (leading //! lines above a /// block are trimmed).
- Go: contiguous plain // block, matching godoc.

Deliberately conservative: plain // comments in non-Go languages are
ignored as noise, and a blank line detaches a comment from the
definition. Stored value is the first paragraph, whitespace-collapsed,
capped at 400 chars.

Embeddings: _node_to_text appends the docstring summary. Existing
graphs pick this up automatically — the embedding text hash changes,
so the next embed run re-embeds documented nodes.

Measured on a 352-file production Python service: 812 of 4,134
non-File nodes (19.6%) gain searchable documentation text — 32.3% of
Functions, 31.6% of Classes.

17 new tests: Python docstrings (first paragraph, cap, class+method,
bytes/f-string rejection, parenthesized concatenation), JSDoc,
exported JS declarations, godoc, Rust ///-across-attributes plus
//! and /*! inner-doc rejection, Javadoc, blank-line detachment, and
_node_to_text inclusion/absence.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant