Summary
Add a search(query: str, ...) method to TripleStoreService that performs
semantic search across the knowledge graph by vectorizing each subject's TTL
representation and querying via a vector store. Results return matched
subjects with their types, ranked by similarity, optionally filtered by graph,
type, or schema/instance kind.
Motivation
Today, finding things in the graph requires knowing exactly what you're looking
for and writing SPARQL. We need a way to discover entities, classes, and
properties from natural-language queries — e.g. searching "Border Collie"
should surface the Dog class even when no instance literally contains that
string, by leveraging an LLM-based embedding model.
This will also unlock downstream features (agent retrieval, UI search, ontology
exploration) that don't have a fixed query shape.
Blocked by
The vectorizer consumes per-subject TTL documents and reacts to subject-level
change events. #879 is the natural source of both, and waiting avoids
duplicating that responsibility inside the vectorizer.
Design
Granularity: per-subject TTL
Each subject's full CBD/TTL representation is embedded as a single document.
This uniformly covers instances, classes (owl:Class), and properties
(owl:DatatypeProperty, owl:ObjectProperty) — a class is just another
subject whose TTL contains rdfs:label, rdfs:comment, etc. No separate
schema-term pipeline.
The class/instance/property distinction is encoded in metadata, not in
separate code paths.
Rejected alternatives:
- Per-triple — too shallow; embeddings of single SPOs lack context, and
re-indexing is fine-grained but the recall is poor.
- Hybrid per-subject + per-schema-term — redundant once schema terms are
themselves subjects with TTL.
Indexing topology: async worker
A separate vectorizer-worker container, same image as abi, different
command. Subscribes to subject-level change events from the versionstore
(post-#879), debounces per-subject, embeds, upserts to Qdrant.
The triple store path stays untouched — insert() / remove() do not block
on embedding.
Embedder configuration
Embedder config lives in the TripleStoreService engine configuration. The
collection name (or its stored fingerprint metadata) encodes
{embedder_id, model_version, dim, normalization}. On mismatch at startup,
a new collection is created and a full reindex is triggered as a Dagster job;
the old collection serves until the new one is warm, then the swap happens.
Vector metadata schema
{
"subject_uri": str,
"graph_name": str,
"types": list[str], # rdf:type URIs
"is_schema": bool, # derived: types ∩ {owl:Class, owl:*Property} ≠ ∅
# or graph == schema graph
"namespace": str,
"lang": str | None,
}
These filters are hard to add later without re-indexing — included from day one.
Search API
def search(
self,
query: str,
*,
graph: URIRef | None = None,
types: list[URIRef] | None = None,
is_schema: bool | None = None,
k: int = 10,
score_threshold: float | None = None,
) -> list[SearchHit]:
...
@dataclass
class SearchHit:
subject: URIRef
types: list[URIRef]
score: float
graph: URIRef
Implementation: embed query → vector store search with metadata filters →
SPARQL hydration of ?s a ?type for each hit → return.
Phases
Phase 1 — ISubjectDocumentSource port + versionstore adapter
Phase 2 — Vectorizer worker
- New app:
naas_abi_core/apps/workers/vectorizer/.
- Subscribes to
ISubjectDocumentSource, debounces per-subject changes
(configurable window, default ~500ms), embeds full TTL, upserts to Qdrant
with metadata schema above.
- Handles delete events: removes the subject's vector from the collection.
- New
vectorizer-worker service in abi/docker-compose.yml — same image,
different command, no inbound ports, healthcheck overridden to pgrep.
- Crash-only design; relies on
restart: unless-stopped + RabbitMQ acks for
redelivery.
Phase 3 — Embedder config + collection fingerprinting
- Embedder config added to TripleStoreService engine config.
- Collection fingerprint stored as Qdrant collection metadata (or encoded in
name). Mismatch at worker startup → create new collection, schedule full
reindex via Dagster, swap on completion.
- Dagster job: full reindex (iterates all subjects from the document source).
Phase 4 — TripleStoreService.search(...)
- Implement search method as specified above.
- SPARQL hydration step uses existing
query_view / query.
- Tests covering: text match, filter by graph, filter by types,
is_schema,
threshold, empty results.
Out of scope
- Re-ranking (cross-encoder) — can come later if recall@k is fine but ranking
isn't.
- Multi-language query routing.
- Hybrid BM25 + vector — possible follow-up if pure vector search has gaps on
exact identifier matches.
Acceptance criteria
Notes on embedder choice
The "Border Collie → Dog" semantic match relies on the embedder having
sufficient world knowledge from pretraining. API-based models (OpenAI
text-embedding-3-*, Voyage, Cohere) handle this comfortably; smaller local
models (all-MiniLM, embeddinggemma) work but recall/ranking will be
weaker. Configurable per the engine config in Phase 3.
Summary
Add a
search(query: str, ...)method toTripleStoreServicethat performssemantic search across the knowledge graph by vectorizing each subject's TTL
representation and querying via a vector store. Results return matched
subjects with their types, ranked by similarity, optionally filtered by graph,
type, or schema/instance kind.
Motivation
Today, finding things in the graph requires knowing exactly what you're looking
for and writing SPARQL. We need a way to discover entities, classes, and
properties from natural-language queries — e.g. searching "Border Collie"
should surface the
Dogclass even when no instance literally contains thatstring, by leveraging an LLM-based embedding model.
This will also unlock downstream features (agent retrieval, UI search, ontology
exploration) that don't have a fixed query shape.
Blocked by
The vectorizer consumes per-subject TTL documents and reacts to subject-level
change events. #879 is the natural source of both, and waiting avoids
duplicating that responsibility inside the vectorizer.
Design
Granularity: per-subject TTL
Each subject's full CBD/TTL representation is embedded as a single document.
This uniformly covers instances, classes (
owl:Class), and properties(
owl:DatatypeProperty,owl:ObjectProperty) — a class is just anothersubject whose TTL contains
rdfs:label,rdfs:comment, etc. No separateschema-term pipeline.
The class/instance/property distinction is encoded in metadata, not in
separate code paths.
Rejected alternatives:
re-indexing is fine-grained but the recall is poor.
themselves subjects with TTL.
Indexing topology: async worker
A separate
vectorizer-workercontainer, same image asabi, differentcommand. Subscribes to subject-level change events from the versionstore
(post-#879), debounces per-subject, embeds, upserts to Qdrant.
The triple store path stays untouched —
insert()/remove()do not blockon embedding.
Embedder configuration
Embedder config lives in the TripleStoreService engine configuration. The
collection name (or its stored fingerprint metadata) encodes
{embedder_id, model_version, dim, normalization}. On mismatch at startup,a new collection is created and a full reindex is triggered as a Dagster job;
the old collection serves until the new one is warm, then the swap happens.
Vector metadata schema
These filters are hard to add later without re-indexing — included from day one.
Search API
Implementation: embed query → vector store search with metadata filters →
SPARQL hydration of
?s a ?typefor each hit → return.Phases
Phase 1 —
ISubjectDocumentSourceport + versionstore adapterget_subject_document(s) -> str(turtle),subscribe_changes(callback).VersionStoreSubjectDocumentSourcereading per-subject TTL filesfrom naas-abi-core: triple_store + ontology versioned-store adapters with RDF merge resolver #879's versionstore.
Phase 2 — Vectorizer worker
naas_abi_core/apps/workers/vectorizer/.ISubjectDocumentSource, debounces per-subject changes(configurable window, default ~500ms), embeds full TTL, upserts to Qdrant
with metadata schema above.
vectorizer-workerservice inabi/docker-compose.yml— same image,different command, no inbound ports, healthcheck overridden to
pgrep.restart: unless-stopped+ RabbitMQ acks forredelivery.
Phase 3 — Embedder config + collection fingerprinting
name). Mismatch at worker startup → create new collection, schedule full
reindex via Dagster, swap on completion.
Phase 4 —
TripleStoreService.search(...)query_view/query.is_schema,threshold, empty results.
Out of scope
isn't.
exact identifier matches.
Acceptance criteria
TripleStoreService.search("Border Collie")returns theDogclass(assuming standard
rdfs:label/rdfs:commentare present) with areasonable score.
TripleStoreService.insert()results in theaffected subject's vector being updated within the debounce window,
without blocking the insert call.
collection without downtime on existing search queries.
vectorizer-workercontainer restarts cleanly and resumes processingfrom the bus without duplicating embeddings (idempotent upsert by
subject URI).
graph,types, andis_schemaworks as documented.Notes on embedder choice
The "Border Collie → Dog" semantic match relies on the embedder having
sufficient world knowledge from pretraining. API-based models (OpenAI
text-embedding-3-*, Voyage, Cohere) handle this comfortably; smaller localmodels (
all-MiniLM,embeddinggemma) work but recall/ranking will beweaker. Configurable per the engine config in Phase 3.