feat: extract docstrings into node metadata and embed them for semantic search#602
Open
SHudici wants to merge 1 commit into
Open
feat: extract docstrings into node metadata and embed them for semantic search#602SHudici wants to merge 1 commit into
SHudici wants to merge 1 commit into
Conversation
…ic search
Semantic search embeds only identifier-derived text (name, parent,
params, return type, directory). The one thing written in the same
natural language as user queries — the docstring — was never captured.
Queries that describe behavior ("parse uploaded rate sheets") only
matched nodes whose names shared words with the query.
Parser: extract a documentation summary for every function/class into
extra["docstring"] — no schema change, nodes.extra already round-trips
JSON through the store.
- Python: the real docstring, matching CPython semantics — plain/r/u
string first statements including parenthesized and implicitly
concatenated literals; bytes and f-strings are NOT docstrings.
Handles both grammar shapes (bare string in the block, and
expression_statement-wrapped).
- JSDoc / Javadoc / Doxygen: the /** ... */ (or /*! ... */) block
directly above the definition; for exported JS/TS declarations the
block above the export statement is found too.
- C#-style doc lines: /// and //!; attributes/decorators between the
comment and the definition are skipped.
- Rust: /// only. Inner doc comments (//! and /*! ... */) document
the enclosing module/crate, never the following item, so they are
dropped (leading //! lines above a /// block are trimmed).
- Go: contiguous plain // block, matching godoc.
Deliberately conservative: plain // comments in non-Go languages are
ignored as noise, and a blank line detaches a comment from the
definition. Stored value is the first paragraph, whitespace-collapsed,
capped at 400 chars.
Embeddings: _node_to_text appends the docstring summary. Existing
graphs pick this up automatically — the embedding text hash changes,
so the next embed run re-embeds documented nodes.
Measured on a 352-file production Python service: 812 of 4,134
non-File nodes (19.6%) gain searchable documentation text — 32.3% of
Functions, 31.6% of Classes.
17 new tests: Python docstrings (first paragraph, cap, class+method,
bytes/f-string rejection, parenthesized concatenation), JSDoc,
exported JS declarations, godoc, Rust ///-across-attributes plus
//! and /*! inner-doc rejection, Javadoc, blank-line detachment, and
_node_to_text inclusion/absence.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Semantic search embeds only identifier-derived text — name, parent, params, return type, directory (
_node_to_text). The one thing written in the same natural language as user queries — the docstring — is never captured anywhere in the graph. Queries that describe behavior ("parse uploaded rate sheets", "retry on transient errors") only match nodes whose names happen to share words with the query.Fix
Parser: extract a documentation summary for every function/class into
extra["docstring"]— no schema change,nodes.extraalready exists and round-trips through the store:__doc__stays None, so no key is stored). Handles both grammar shapes (barestringin the block, andexpression_statement-wrapped)./** ... */(or/*! ... */) block directly above the definition; for exported JS/TS declarations the block above theexportstatement is found too.///and//!; attributes/decorators sitting between the comment and the definition are skipped.///only — inner doc comments (//!,/*! ... */) document the enclosing module/crate, never the following item, so they are dropped (leading//!lines above a///block are trimmed).//block, matching godoc.Deliberately conservative: plain
//comments in non-Go languages are ignored as noise, and a blank line detaches a comment from the definition. Stored value is the first paragraph (the summary by every doc convention — PEP 257, JSDoc, godoc, rustdoc), whitespace-collapsed, capped at 400 chars so multi-page docstrings don't bloat the DB or drown the name/signature terms.Embeddings:
_node_to_textappends the docstring summary. Existing graphs pick this up automatically — the embedding text hash changes, so the next embed/refresh re-embeds documented nodes.Measured effect (mid-size production repo)
Parsing a 352-file Python service (4,134 non-File nodes): 812 nodes (19.6%) gain a docstring summary that was previously invisible to search — 32.3% of Functions (641/1,983), 31.6% of Classes (125/396), 2.6% of Tests. Example of what enters the embedding text that no identifier carries:
A query like "where is the bearer token validated" previously had zero lexical overlap with this node.
Testing
New tests (17): Python function/class/method docstrings, first-paragraph selection, 400-char cap, no-docstring leaves no key, bytes/f-string rejection, parenthesized implicit concatenation; JSDoc block; exported JS declarations; plain
//ignored on JS; godoc plain-//block accepted on Go; Rust///across#[inline]plus//!//*!inner-doc rejection; Javadoc first paragraph (param tags excluded); blank-line detachment. Plus_node_to_textincludes the docstring and is byte-identical when absent. Full suite passes.Follow-ups deliberately out of scope: docstrings for the JS
const f = () => {}extraction site and language-specific creation sites (Elixir/Julia/Nix/R); adding docstring to the FTS index (needs an FTS5 table rebuild migration).🤖 Generated with Claude Code