Skip to content

Add semantic search() to TripleStoreService via per-subject vectorization #916

@Dr0p42

Description

@Dr0p42

Summary

Add a search(query: str, ...) method to TripleStoreService that performs
semantic search across the knowledge graph by vectorizing each subject's TTL
representation and querying via a vector store. Results return matched
subjects with their types, ranked by similarity, optionally filtered by graph,
type, or schema/instance kind.

Motivation

Today, finding things in the graph requires knowing exactly what you're looking
for and writing SPARQL. We need a way to discover entities, classes, and
properties from natural-language queries — e.g. searching "Border Collie"
should surface the Dog class even when no instance literally contains that
string, by leveraging an LLM-based embedding model.

This will also unlock downstream features (agent retrieval, UI search, ontology
exploration) that don't have a fixed query shape.

Blocked by

The vectorizer consumes per-subject TTL documents and reacts to subject-level
change events. #879 is the natural source of both, and waiting avoids
duplicating that responsibility inside the vectorizer.

Design

Granularity: per-subject TTL

Each subject's full CBD/TTL representation is embedded as a single document.
This uniformly covers instances, classes (owl:Class), and properties
(owl:DatatypeProperty, owl:ObjectProperty) — a class is just another
subject whose TTL contains rdfs:label, rdfs:comment, etc. No separate
schema-term pipeline.

The class/instance/property distinction is encoded in metadata, not in
separate code paths.

Rejected alternatives:

  • Per-triple — too shallow; embeddings of single SPOs lack context, and
    re-indexing is fine-grained but the recall is poor.
  • Hybrid per-subject + per-schema-term — redundant once schema terms are
    themselves subjects with TTL.

Indexing topology: async worker

A separate vectorizer-worker container, same image as abi, different
command. Subscribes to subject-level change events from the versionstore
(post-#879), debounces per-subject, embeds, upserts to Qdrant.

The triple store path stays untouched — insert() / remove() do not block
on embedding.

Embedder configuration

Embedder config lives in the TripleStoreService engine configuration. The
collection name (or its stored fingerprint metadata) encodes
{embedder_id, model_version, dim, normalization}. On mismatch at startup,
a new collection is created and a full reindex is triggered as a Dagster job;
the old collection serves until the new one is warm, then the swap happens.

Vector metadata schema

{
  "subject_uri":  str,
  "graph_name":   str,
  "types":        list[str],   # rdf:type URIs
  "is_schema":    bool,        # derived: types ∩ {owl:Class, owl:*Property} ≠ ∅
                               # or graph == schema graph
  "namespace":    str,
  "lang":         str | None,
}

These filters are hard to add later without re-indexing — included from day one.

Search API

def search(
    self,
    query: str,
    *,
    graph: URIRef | None = None,
    types: list[URIRef] | None = None,
    is_schema: bool | None = None,
    k: int = 10,
    score_threshold: float | None = None,
) -> list[SearchHit]:
    ...

@dataclass
class SearchHit:
    subject: URIRef
    types: list[URIRef]
    score: float
    graph: URIRef

Implementation: embed query → vector store search with metadata filters →
SPARQL hydration of ?s a ?type for each hit → return.

Phases

Phase 1 — ISubjectDocumentSource port + versionstore adapter

Phase 2 — Vectorizer worker

  • New app: naas_abi_core/apps/workers/vectorizer/.
  • Subscribes to ISubjectDocumentSource, debounces per-subject changes
    (configurable window, default ~500ms), embeds full TTL, upserts to Qdrant
    with metadata schema above.
  • Handles delete events: removes the subject's vector from the collection.
  • New vectorizer-worker service in abi/docker-compose.yml — same image,
    different command, no inbound ports, healthcheck overridden to pgrep.
  • Crash-only design; relies on restart: unless-stopped + RabbitMQ acks for
    redelivery.

Phase 3 — Embedder config + collection fingerprinting

  • Embedder config added to TripleStoreService engine config.
  • Collection fingerprint stored as Qdrant collection metadata (or encoded in
    name). Mismatch at worker startup → create new collection, schedule full
    reindex via Dagster, swap on completion.
  • Dagster job: full reindex (iterates all subjects from the document source).

Phase 4 — TripleStoreService.search(...)

  • Implement search method as specified above.
  • SPARQL hydration step uses existing query_view / query.
  • Tests covering: text match, filter by graph, filter by types, is_schema,
    threshold, empty results.

Out of scope

  • Re-ranking (cross-encoder) — can come later if recall@k is fine but ranking
    isn't.
  • Multi-language query routing.
  • Hybrid BM25 + vector — possible follow-up if pure vector search has gaps on
    exact identifier matches.

Acceptance criteria

  • TripleStoreService.search("Border Collie") returns the Dog class
    (assuming standard rdfs:label / rdfs:comment are present) with a
    reasonable score.
  • Inserting a new triple via TripleStoreService.insert() results in the
    affected subject's vector being updated within the debounce window,
    without blocking the insert call.
  • Changing the embedder config triggers a full reindex into a new
    collection without downtime on existing search queries.
  • vectorizer-worker container restarts cleanly and resumes processing
    from the bus without duplicating embeddings (idempotent upsert by
    subject URI).
  • Filtering by graph, types, and is_schema works as documented.

Notes on embedder choice

The "Border Collie → Dog" semantic match relies on the embedder having
sufficient world knowledge from pretraining. API-based models (OpenAI
text-embedding-3-*, Voyage, Cohere) handle this comfortably; smaller local
models (all-MiniLM, embeddinggemma) work but recall/ranking will be
weaker. Configurable per the engine config in Phase 3.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions