Skip to content

research: raw SPARQL → Ad4mModel migration inventory + analysis#605

Draft
HexaField wants to merge 9 commits into
refactor/literal-channel-v-separationfrom
research/sparql-to-ad4m-model-migration
Draft

research: raw SPARQL → Ad4mModel migration inventory + analysis#605
HexaField wants to merge 9 commits into
refactor/literal-channel-v-separationfrom
research/sparql-to-ad4m-model-migration

Conversation

@HexaField

Copy link
Copy Markdown
Contributor

Summary

Inventory + analysis of every perspective.querySparql<T>() call site in packages/api/src/, categorised by whether and how each could be expressed via Ad4mModel.findAll. Identifies 6 prioritised Ad4mModel features that would unlock the conversions, and lists the call sites that should stay as raw SPARQL with reasons.

Full document at docs/sparql-to-ad4m-model-migration.md.

cc @lucksus @jhweir — stacks logically on top of coasys/ad4m#837, #842, #846 and interacts with #812 (named graphs).

What this PR is

Stage 1 of a multi-stage investigation: pure documentation. No code changes.

The user asked for SPARQL→Ad4mModel conversions with performance comparison, but the scope (28 call sites across 6 files, mixing reifier-metadata reads, set-difference, multi-hop joins, and inter-class traversals) is large enough that the responsible first deliverable is a structured catalog with explicit categorization and capability-gap analysis. Going straight to conversion + benchmarking without that grounding would have us picking conversion candidates blindly.

Findings (one-line summary)

  • 5 sites (Category A) are trivially convertible to findAll({ where }) today — likely a slight perf regression because the model-query builder adds conformance joins the raw SPARQL skips. Trade-off.
  • 10 sites (Category B) need one or two new Ad4mModel features but are otherwise straightforward — likely a perf win over the status quo, because most currently follow a raw-SPARQL + N+1 getExpression() pattern that batched include: would collapse to one round-trip.
  • 4 sites (Category C) read reifier metadata (per-link author/timestamp). Better as raw SPARQL unless Ad4mModel grows a per-link meta: projection.
  • 2 sites (Category D) implement set-difference (unprocessedItems). Keep as raw SPARQL — FILTER NOT EXISTS hits the planner cliff ac57680b9 warned about; would degrade to the same plan via where: { NOT: {...} }.
  • 4 sites (Category E) do inter-class joins or bidirectional traversals not modeled by current decorators. Some convertible with the proposed features; others (e.g. `subgroupsData` ancestor walk) cleaner as raw SPARQL.

Prioritised Ad4mModel additions

Rank Feature Sites unlocked
1 `tag` as typed `@HasOne(Embedding Topic)` with discriminator
2 `@BelongsTo()` / first-class reverse relations 8
3 Multi-class polymorphic `findAll` 4
4 Per-link reifier metadata sidecar (`include: { rel: { meta: [...] } }`) 4
5 Nested `where` on relations 1
6 `UNION` across query shapes 1

Planned follow-up commits on this branch

  • Stage 2 (next): upgrade `SemanticRelationship.tag` from `@Property(string)` to `@HasOne(() => Embedding | Topic)`, with type discrimination. Adapt internal callers. Standalone mechanical change — uses existing `@HasOne`.
  • Stage 3: convert Category B sites that benefit from `include: { tag: true }`.
  • Stage 4: add a one-off Node.js benchmark harness that runs each query both ways against a seeded executor (Channel × {10, 100, 1000} items × {2, 20, 200} conversations), measures wall-clock + round-trip count, tabulates. Validate per-site whether to convert or keep as raw SPARQL.
  • Stage 5: report findings + final conversion decisions, then convert the sites where the data supports it.

Stages 3–5 are explicitly deferred — the investigation is meant to ground decisions, not pre-decide them.

Why this is a draft PR

The PR is open as a draft so the analysis can be reviewed and the categorisation challenged before I start writing code. If the conclusions on what should stay as raw SPARQL look wrong, or if a feature gap is mis-prioritised, much cheaper to argue here than after I've converted half the call sites.

The expectation is that this PR will grow with Stage 2+ commits, or it gets closed and replaced with separate per-stage PRs once we agree on direction.

28 querySparql call sites in packages/api/src/, categorised by
convertibility against the current Ad4mModel decorator + findAll API.

Categories:
- A. Trivially convertible (5 sites) — single-class findAll + where
- B. Convertible with deep include + new reverse relations (10 sites)
- C. Reifier-metadata reads where raw SPARQL stays correct (4 sites)
- D. Set-difference (2 sites) — keep SPARQL, planner cliff documented
- E. Inter-class joins not modelled (4 sites) — keep SPARQL or new
       Ad4mModel features

Identifies 6 prioritised Ad4mModel additions, ordered by call sites
unlocked:
1. tag as typed @hasone(Embedding | Topic) — unlocks 10 sites
2. @BelongsTo() / first-class reverse relations — unlocks 8
3. Multi-class polymorphic findAll — unlocks 4
4. Per-link reifier metadata sidecar — unlocks 4
5. Nested where on relations — unlocks 1
6. UNION across query shapes — unlocks 1

References AD4M #837, #842, #846 and the #812 named-graphs interaction
for the FILTER NOT EXISTS / planner cliff context.
@netlify

netlify Bot commented Jun 4, 2026

Copy link
Copy Markdown

Deploy Preview for fluxsocial-dev ready!

Name Link
🔨 Latest commit e748bfc
🔍 Latest deploy log https://app.netlify.com/projects/fluxsocial-dev/deploys/6a214c24b9e22b0008f71690
😎 Deploy Preview https://deploy-preview-605--fluxsocial-dev.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@coderabbitai

coderabbitai Bot commented Jun 4, 2026

Copy link
Copy Markdown

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d476fe34-dcec-4de0-90a5-d6b57b2d6aa8

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch research/sparql-to-ad4m-model-migration

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@HexaField HexaField changed the base branch from dev to refactor/literal-channel-v-separation June 4, 2026 10:16
HexaField added 2 commits June 4, 2026 20:23
After re-checking @coasys/ad4m's decorator API, found that @hasone,
@BelongsToOne, @BelongsToMany, and 'where' on relations are all already
exposed. Three of the six recommendations in the inventory document are
flux-side-only changes rather than needing AD4M SDK PRs.

This commit implements recommendation #1:

- Adds @hasone(() => Embedding) embeddingTag and @hasone(() => Topic)
  topicTag on SemanticRelationship, both routed through the same
  flux://has_tag predicate. The conformance filter on each target
  class's @Flag (entry_type = has_embedding / has_topic) discriminates
  at hydration time.
- Keeps the existing @Property tag: string for back-compat; callers
  that want the raw IRI continue to work.
- Adds itemEmbeddingViaModel() as a demonstrator conversion sitting
  alongside the raw-SPARQL itemEmbedding(). Same return shape;
  behavioural parity caveat documented (both still need
  perspective.getExpression() for the actual vector).

Also checks in scripts/bench-sparql-vs-ad4m.ts as a documented
skeleton — connection + seed are stubs (~200 LOC of additional work to
make runnable, tracked as Stage 5 in the migration doc).

The migration doc gains an 'Implementation log' section recording
what's done, what the AD4M re-check changed about the plan, why
no perf numbers in this commit, and the staged remaining work.
…ults

The earlier `scripts/bench-sparql-vs-ad4m.py` harness was a one-off.
Migrated the empirical comparison to a proper wind tunnel scenario
(ad4m-wind-tunnel `feat/s16-sparql-vs-model`) so it's reproducible,
version-controlled, and rerunnable as a regression gate on future
`model_query` work.

Updated the "Empirical bench results" section with the s16 numbers
from `dev`. Five cases × two tiers (100 items / 1000 items):

  small  — raw SPARQL 0.21-0.67 ms, modelQuery 3.4-14.4 ms (14-31x)
  medium — raw SPARQL 0.47-7.57 ms, modelQuery 28-155 ms  (18-150x)

The three structural findings stand and now have proper data behind
them:

  1. `@HasOne` polymorphic-on-same-predicate doesn't fire — s16's
     `includeWorks` check flags it (within-5% noise tolerance vs the
     no-include baseline).
  2. `findAll` overhead scales linearly in N regardless of LIMIT 1.
  3. `model_query` is order-of-magnitude slower than raw SPARQL even
     when doing strictly less work.

Bottom-line recommendation unchanged: the right work isn't migrating
flux call sites — it's bringing down `model_query` per-instance
overhead in coasys/ad4m. S16 is now the regression gate.
HexaField added 2 commits June 4, 2026 22:27
The v1 s16 ratios (14-150x) were inflated by malformed SHACL — the
scenario emitted relations in a separate `relations: []` array which
the Rust executor's `SHACLShape` deserializer silently drops. Once
relations are emitted as `properties:` entries with `relation_kind`
set (ad4m-wind-tunnel commit 1f11143), `include actually fires: yes`
and the ratios collapse to 2.4-10.6x on medium tier (worst case
`topics_all` at 25x because the raw baseline is sub-millisecond).

Re-ran S16 against an executor patched with MODEL_QUERY_PROFILE=1
instrumentation that emits per-phase wall-clock timings. The patch
times each step in execute_model_query_inner: SPARQL build, two-phase
pagination subquery exec, properties subquery exec, count exec,
hydration, language transforms, getters, recursive includes.

For sr_by_expression_with_include (model 2.99ms; raw 0.55ms):
  twophase-pagination-exec       0.164 ms
  twophase-properties-exec       0.101 ms
  count-exec                     0.118 ms
  include sub-query              0.741 ms
  Rust orchestration            ~0.01  ms
  --- model_query work ---      ~1.13  ms
  RPC roundtrip + JSON marshal  ~1.86  ms

For sr_all (model 68ms; raw 7.3ms):
  single-instance-exec          ~65    ms  (3090 rows × reifier join)
  Rust orchestration            ~1.5   ms

The reifier-metadata join (`?_reifier reifies <<(?s ?p ?t)>> .
?_reifier author ?author . ?_reifier timestamp ?timestamp .`) is
unconditional in `build_instance_sparql`, even when the caller never
reads author/createdAt/updatedAt. It adds 3 triple-pattern matches
per result row — ~3.4x the per-row SPARQL cost vs raw.

What can land alongside #842/#846:

1. Make the reifier-metadata join opt-in via a `with_metadata: bool`
   on ModelQueryInput. Expected impact: ~10x → ~2x on scan-all.
2. Skip the COUNT query unless caller reads `total_count`.
3. Collapse two-phase pagination into Single plan when WHERE is
   already selective enough that no timestamp probe is needed.

None of these are SPARQL-language extensions — they're orchestrator
changes that reduce SPARQL work. Natural follow-up PR to #846.

Pre-fix numbers retained alongside post-fix in the doc for the
review trail.
Full inventory of every store.query fan-out point in model_query
(14 SPARQL sites + 1 Holochain-RPC site) plus the structural
analysis of why each one is in Rust today and what would let it
collapse into a single SPARQL.

Headline finding: the remaining 5-25x model_query gap is a fan-out
problem, not a per-query problem. A non-trivial findAll fires
1 + N + M + K SPARQL queries through the same evaluator, with
tree-shaping in Rust between them. SPARQL itself is fast; the
orchestration around it isn't.

11 numbered PRs (A-K) cover the path from "opt-in reifier metadata"
(50 LOC, biggest immediate win) to "CONSTRUCT-based subgraph
hydration with a tree walker" (500 LOC, removes recursion entirely
so deep includes are constant-roundtrip). Each PR adds or extends
one S16 case so the regression gate sees the ratio collapse cleanly.

Audit also cross-references the 6 original prioritised additions
against current state — #1 (polymorphic @hasone) works at the
executor level; #2 (@BelongsTo) needs benching; #3-#6 still open as
originally described. Adds 11 new items the inventory surfaces:
opt-in metadata join, opt-in total_count, single-plan WHERE
heuristic, reverse-relation UNION fusion, multi-key sort pushdown,
relation where_filter pushdown, getter inlining via BIND(EXISTS),
projection subquery inlining, CONSTRUCT-based hydration, parallel
resolveLanguage batching, streaming Solutions->Value.

Predicted post-state in S16: 5-25x ratios collapse to 1.5-3x with
all fixes; the residual is RPC + JSON-marshal floor.
HexaField added 4 commits June 4, 2026 23:54
Append "Realised wins" section with per-tier S16 ratio tables comparing
`dev` against `refactor/sparql-pushdown-last-write-wins` (coasys/ad4m#846).
Includes the same data the PR description carries, plus the per-category
verdict refresh: most flux call sites that don't need link-level
metadata or unpaginated counts are now convertible at 1.5-3x cost
rather than 5-10x.

Headline opt-in ratios from #846:
- sr_by_id_single_plan medium: 4.2x -> 1.2x (3.45x)
- sr_all_no_metadata_no_count medium: 8.7x -> 3.3x (2.63x)
- embeddings_all_no_metadata medium: 10.0x -> 3.7x (2.71x)
- sr_by_expression_limit1_no_count medium: 4.9x -> 2.8x (1.76x)

Back-compat cases stay within ±10% noise.  The residual 3-5x on
scan-all is what audit items H and I would close; both deferred for
a follow-up PR (H is low-priority, I is the 500-LOC CONSTRUCT rewrite).
The yesterday's "Realised wins" table used a stale dev binary
(~/workspaces/coasys/ad4m/target/release/ad4m-executor from May 22,
test-2 build). When swapped for a fresh dev binary built from HEAD of
`dev` (1f29d0b1, test-8) into the same shared CARGO_TARGET_DIR as
#846's binary, the opt-in wins land slightly smaller in absolute
terms but the structural conclusion is the same:

- sr_by_id_single_plan medium: 4.1× → 1.2× (3.36× improvement)
- sr_all_no_metadata_no_count medium: 9.3× → 3.3× (2.80× improvement)
- embeddings_all_no_metadata medium: 9.0× → 3.7× (2.44× improvement)

Default-behaviour cases stay within ±15% noise as before.

Adds cross-scenario verification:

- S5 (queryLinks scaling, 100/500/1000): 0.87-0.98×, modest wins
- S8 (raw SPARQL Flux graph, 1865 + 58460 links): small tier wins
  0.71-0.96×, medium tier parity ±8%

S5/S8 results confirm K (Solutions → Vec<Value>) doesn't regress the
shared SparqlStore path used by pre-existing query routes and yields
incidental 5-30% wins at small data sizes where the serialise+parse
round trip was a meaningful fraction of total latency.
Audit-driven migration of the convert-viable raw `querySparql` call
sites in `packages/api/src/{channel,conversation,conversation-subgroup,
semantic-relationship}/index.ts` to their `Ad4mModel.findAll{,AndCount}`
equivalents.  Lands on top of coasys/ad4m#846 — the orchestrator
overhaul that makes these conversions viable at 1.5-3x cost (rather
than the 5-10x of pre-#846).

Migrated sites (5 production methods):

1. `Channel.pinnedConversations()`  (Cat A site #7)
   - `findAll(Channel, { where: { isPinned: true }, include: { conversations: { limit: 1, withMetadata: false } }, withMetadata: false, count: false })`
   - Single-plan path engaged: the executor sees `isPinned == true` as
     uniquely-selective via the `@Property` flag form.

2. `Conversation.stats()`  (Cat A site #10 — partial)
   - Replaced the subgroup count SPARQL with
     `ConversationSubgroup.findAllAndCount(..., { limit: 0, count: true, withMetadata: false })` — count-only fast path.
   - Replaced the participant SPARQL with native `LinkQuery` via
     `perspective.get(...)` — no SPARQL roundtrip needed for this
     simple link enumeration.

3. `ConversationSubgroup.stats()`  (Cat A site #15)
   - Both queries (item count + participant list) replaced with
     parallel `LinkQuery`s via `perspective.get(...)`.  Avoids the
     model_query orchestrator entirely because there's no
     multi-property hydration needed.

4. `SemanticRelationship.itemEmbedding()`  (Cat B site #19)
   - Promoted the `itemEmbeddingViaModel()` demonstrator (added during
     the audit) to be the primary implementation.  Uses
     `include: { embeddingTag: { withMetadata: false } }` with the
     polymorphic-on-same-predicate `@HasOne` confirmed working in
     wind tunnel S16 (`include actually fires: yes`).
   - Removes the `itemEmbeddingViaModel` method — superseded.

5. `findEmbeddingSRId(itemId)` in `conversation/util.ts`  (Cat E site #26)
   - Promoted from "needs nested-where-on-relation" to convertible
     because the executor's polymorphic discrimination on the include
     resolves to the Embedding tag only (the Topic variant doesn't
     fire when the conformance doesn't match).  Looks at `embeddingTag`
     on the result instances.

Kept as raw SPARQL with audit-grounded rationale (16 production sites):

- Cat C reifier-metadata reads (4):  Channel.allItems, Channel.unprocessedItems
  data fetch, Conversation.subgroupsData, ConversationSubgroup.itemsData —
  all want per-link timestamp (the reifier on `<chan> has_child <msg>`,
  not on the message entity).  Ad4mModel's `createdAt` synthesis refers
  to the entity hydrate, not to the link.
- Cat D set-difference (2):  Channel.unprocessedItems set-difference
  pair — `FILTER NOT EXISTS` planner cliff still present.
- Cat E inter-class joins (4):  Topic.linkedConversations,
  Topic.linkedSubgroups, Conversation.subgroupsData batch timestamp,
  Conversation.topics UNION — all need multi-hop joins through
  predicates that aren't declared as relations on the source model.
- Cat A site #5  Channel.totalItemCount — multi-type FILTER would split
  into 3 parallel `findAllAndCount` calls and a sum.  Worse than 1 SPARQL.
- Cat A site #6  Channel.recentConversations — already hand-optimised to
  avoid reifier joins (uses native link API for timestamps).
- ConversationSubgroup.topics, .topicsWithRelevance, Topic.* — Cat E.
- SemanticRelationship.allConversationEmbeddings, .allSubgroupEmbeddings,
  .allItemEmbeddings, .allItemEmbeddingsByType — needs reverse
  `@HasMany`/`@BelongsTo` from Conversation/Subgroup/Message back to SR,
  which isn't declared today.  Convertible in a follow-up PR.

Tests
- Updated `channel.test.ts` and `conversation.test.ts` mocks to include
  `perspective.modelQuery` (for the migrated paths) and seed
  Ad4mModel static-method spies where the @coasys/ad4m mock can't
  exercise the real findAll/findAllAndCount pipeline.
- 119 tests total: 115 pass / 4 pre-existing failures unrelated to this
  migration (`Channel.unprocessedItems` × 3 + parseLit JSON-stringify ×
  1, all caused by the test file's `@coasys/ad4m` vi.mock missing
  `fileToDataUri` — present before this change).

Stacked dependency
- This PR's runtime correctness depends on coasys/ad4m#846 (and the
  upstream stack #837 / #842).  The `withMetadata: false`,
  `count: false` and selective-WHERE single-plan paths the migrations
  rely on are only honoured by the post-#846 executor.  Running
  against `dev`'s executor falls back to the back-compat default
  (slower but still functional).
Append a final section to the migration doc that mirrors what was
landed in commit dd9de23 — a per-site table of every querySparql
call site in packages/api/src, classified into migrated / kept-as-
SPARQL with the audit-grounded rationale for each.

5 production methods migrated:
- Channel.pinnedConversations()
- Conversation.stats() (partial: count via findAllAndCount, participants via LinkQuery)
- ConversationSubgroup.stats() (both via LinkQuery)
- SemanticRelationship.itemEmbedding() (promoted from itemEmbeddingViaModel)
- findEmbeddingSRId() in conversation/util.ts

17 kept as raw SPARQL, each with the category (C/D/E or special case)
that makes raw SPARQL the structurally correct choice.

Test coverage notes + stacked-dependency notice for reviewers.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant