research: raw SPARQL → Ad4mModel migration inventory + analysis by HexaField · Pull Request #605 · coasys/flux

HexaField · 2026-06-04T09:57:54Z

Summary

Inventory + analysis of every perspective.querySparql<T>() call site in packages/api/src/, categorised by whether and how each could be expressed via Ad4mModel.findAll. Identifies 6 prioritised Ad4mModel features that would unlock the conversions, and lists the call sites that should stay as raw SPARQL with reasons.

Full document at docs/sparql-to-ad4m-model-migration.md.

cc @lucksus @jhweir — stacks logically on top of coasys/ad4m#837, #842, #846 and interacts with #812 (named graphs).

What this PR is

Stage 1 of a multi-stage investigation: pure documentation. No code changes.

The user asked for SPARQL→Ad4mModel conversions with performance comparison, but the scope (28 call sites across 6 files, mixing reifier-metadata reads, set-difference, multi-hop joins, and inter-class traversals) is large enough that the responsible first deliverable is a structured catalog with explicit categorization and capability-gap analysis. Going straight to conversion + benchmarking without that grounding would have us picking conversion candidates blindly.

Findings (one-line summary)

5 sites (Category A) are trivially convertible to findAll({ where }) today — likely a slight perf regression because the model-query builder adds conformance joins the raw SPARQL skips. Trade-off.
10 sites (Category B) need one or two new Ad4mModel features but are otherwise straightforward — likely a perf win over the status quo, because most currently follow a raw-SPARQL + N+1 getExpression() pattern that batched include: would collapse to one round-trip.
4 sites (Category C) read reifier metadata (per-link author/timestamp). Better as raw SPARQL unless Ad4mModel grows a per-link meta: projection.
2 sites (Category D) implement set-difference (unprocessedItems). Keep as raw SPARQL — FILTER NOT EXISTS hits the planner cliff ac57680b9 warned about; would degrade to the same plan via where: { NOT: {...} }.
4 sites (Category E) do inter-class joins or bidirectional traversals not modeled by current decorators. Some convertible with the proposed features; others (e.g. `subgroupsData` ancestor walk) cleaner as raw SPARQL.

Prioritised Ad4mModel additions

Rank	Feature	Sites unlocked
1	`tag` as typed `@HasOne(Embedding	Topic)` with discriminator
2	`@BelongsTo()` / first-class reverse relations	8
3	Multi-class polymorphic `findAll`	4
4	Per-link reifier metadata sidecar (`include: { rel: { meta: [...] } }`)	4
5	Nested `where` on relations	1
6	`UNION` across query shapes	1

Planned follow-up commits on this branch

Stage 2 (next): upgrade `SemanticRelationship.tag` from `@Property(string)` to `@HasOne(() => Embedding | Topic)`, with type discrimination. Adapt internal callers. Standalone mechanical change — uses existing `@HasOne`.
Stage 3: convert Category B sites that benefit from `include: { tag: true }`.
Stage 4: add a one-off Node.js benchmark harness that runs each query both ways against a seeded executor (Channel × {10, 100, 1000} items × {2, 20, 200} conversations), measures wall-clock + round-trip count, tabulates. Validate per-site whether to convert or keep as raw SPARQL.
Stage 5: report findings + final conversion decisions, then convert the sites where the data supports it.

Stages 3–5 are explicitly deferred — the investigation is meant to ground decisions, not pre-decide them.

Why this is a draft PR

The PR is open as a draft so the analysis can be reviewed and the categorisation challenged before I start writing code. If the conclusions on what should stay as raw SPARQL look wrong, or if a feature gap is mis-prioritised, much cheaper to argue here than after I've converted half the call sites.

The expectation is that this PR will grow with Stage 2+ commits, or it gets closed and replaced with separate per-stage PRs once we agree on direction.

@hasone

28 querySparql call sites in packages/api/src/, categorised by convertibility against the current Ad4mModel decorator + findAll API. Categories: - A. Trivially convertible (5 sites) — single-class findAll + where - B. Convertible with deep include + new reverse relations (10 sites) - C. Reifier-metadata reads where raw SPARQL stays correct (4 sites) - D. Set-difference (2 sites) — keep SPARQL, planner cliff documented - E. Inter-class joins not modelled (4 sites) — keep SPARQL or new Ad4mModel features Identifies 6 prioritised Ad4mModel additions, ordered by call sites unlocked: 1. tag as typed @hasone(Embedding | Topic) — unlocks 10 sites 2. @BelongsTo() / first-class reverse relations — unlocks 8 3. Multi-class polymorphic findAll — unlocks 4 4. Per-link reifier metadata sidecar — unlocks 4 5. Nested where on relations — unlocks 1 6. UNION across query shapes — unlocks 1 References AD4M #837, #842, #846 and the #812 named-graphs interaction for the FILTER NOT EXISTS / planner cliff context.

netlify · 2026-06-04T09:58:00Z

✅ Deploy Preview for fluxsocial-dev ready!

Name	Link
🔨 Latest commit	`e748bfc`
🔍 Latest deploy log	https://app.netlify.com/projects/fluxsocial-dev/deploys/6a214c24b9e22b0008f71690
😎 Deploy Preview	https://deploy-preview-605--fluxsocial-dev.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

coderabbitai · 2026-06-04T09:58:01Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d476fe34-dcec-4de0-90a5-d6b57b2d6aa8

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch research/sparql-to-ad4m-model-migration

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

@hasone

After re-checking @coasys/ad4m's decorator API, found that @hasone, @BelongsToOne, @BelongsToMany, and 'where' on relations are all already exposed. Three of the six recommendations in the inventory document are flux-side-only changes rather than needing AD4M SDK PRs. This commit implements recommendation #1: - Adds @hasone(() => Embedding) embeddingTag and @hasone(() => Topic) topicTag on SemanticRelationship, both routed through the same flux://has_tag predicate. The conformance filter on each target class's @Flag (entry_type = has_embedding / has_topic) discriminates at hydration time. - Keeps the existing @Property tag: string for back-compat; callers that want the raw IRI continue to work. - Adds itemEmbeddingViaModel() as a demonstrator conversion sitting alongside the raw-SPARQL itemEmbedding(). Same return shape; behavioural parity caveat documented (both still need perspective.getExpression() for the actual vector). Also checks in scripts/bench-sparql-vs-ad4m.ts as a documented skeleton — connection + seed are stubs (~200 LOC of additional work to make runnable, tracked as Stage 5 in the migration doc). The migration doc gains an 'Implementation log' section recording what's done, what the AD4M re-check changed about the plan, why no perf numbers in this commit, and the staged remaining work.

…ults The earlier `scripts/bench-sparql-vs-ad4m.py` harness was a one-off. Migrated the empirical comparison to a proper wind tunnel scenario (ad4m-wind-tunnel `feat/s16-sparql-vs-model`) so it's reproducible, version-controlled, and rerunnable as a regression gate on future `model_query` work. Updated the "Empirical bench results" section with the s16 numbers from `dev`. Five cases × two tiers (100 items / 1000 items): small — raw SPARQL 0.21-0.67 ms, modelQuery 3.4-14.4 ms (14-31x) medium — raw SPARQL 0.47-7.57 ms, modelQuery 28-155 ms (18-150x) The three structural findings stand and now have proper data behind them: 1. `@HasOne` polymorphic-on-same-predicate doesn't fire — s16's `includeWorks` check flags it (within-5% noise tolerance vs the no-include baseline). 2. `findAll` overhead scales linearly in N regardless of LIMIT 1. 3. `model_query` is order-of-magnitude slower than raw SPARQL even when doing strictly less work. Bottom-line recommendation unchanged: the right work isn't migrating flux call sites — it's bringing down `model_query` per-instance overhead in coasys/ad4m. S16 is now the regression gate.

The v1 s16 ratios (14-150x) were inflated by malformed SHACL — the scenario emitted relations in a separate `relations: []` array which the Rust executor's `SHACLShape` deserializer silently drops. Once relations are emitted as `properties:` entries with `relation_kind` set (ad4m-wind-tunnel commit 1f11143), `include actually fires: yes` and the ratios collapse to 2.4-10.6x on medium tier (worst case `topics_all` at 25x because the raw baseline is sub-millisecond). Re-ran S16 against an executor patched with MODEL_QUERY_PROFILE=1 instrumentation that emits per-phase wall-clock timings. The patch times each step in execute_model_query_inner: SPARQL build, two-phase pagination subquery exec, properties subquery exec, count exec, hydration, language transforms, getters, recursive includes. For sr_by_expression_with_include (model 2.99ms; raw 0.55ms): twophase-pagination-exec 0.164 ms twophase-properties-exec 0.101 ms count-exec 0.118 ms include sub-query 0.741 ms Rust orchestration ~0.01 ms --- model_query work --- ~1.13 ms RPC roundtrip + JSON marshal ~1.86 ms For sr_all (model 68ms; raw 7.3ms): single-instance-exec ~65 ms (3090 rows × reifier join) Rust orchestration ~1.5 ms The reifier-metadata join (`?_reifier reifies <<(?s ?p ?t)>> . ?_reifier author ?author . ?_reifier timestamp ?timestamp .`) is unconditional in `build_instance_sparql`, even when the caller never reads author/createdAt/updatedAt. It adds 3 triple-pattern matches per result row — ~3.4x the per-row SPARQL cost vs raw. What can land alongside #842/#846: 1. Make the reifier-metadata join opt-in via a `with_metadata: bool` on ModelQueryInput. Expected impact: ~10x → ~2x on scan-all. 2. Skip the COUNT query unless caller reads `total_count`. 3. Collapse two-phase pagination into Single plan when WHERE is already selective enough that no timestamp probe is needed. None of these are SPARQL-language extensions — they're orchestrator changes that reduce SPARQL work. Natural follow-up PR to #846. Pre-fix numbers retained alongside post-fix in the doc for the review trail.

@hasone

Full inventory of every store.query fan-out point in model_query (14 SPARQL sites + 1 Holochain-RPC site) plus the structural analysis of why each one is in Rust today and what would let it collapse into a single SPARQL. Headline finding: the remaining 5-25x model_query gap is a fan-out problem, not a per-query problem. A non-trivial findAll fires 1 + N + M + K SPARQL queries through the same evaluator, with tree-shaping in Rust between them. SPARQL itself is fast; the orchestration around it isn't. 11 numbered PRs (A-K) cover the path from "opt-in reifier metadata" (50 LOC, biggest immediate win) to "CONSTRUCT-based subgraph hydration with a tree walker" (500 LOC, removes recursion entirely so deep includes are constant-roundtrip). Each PR adds or extends one S16 case so the regression gate sees the ratio collapse cleanly. Audit also cross-references the 6 original prioritised additions against current state — #1 (polymorphic @hasone) works at the executor level; #2 (@BelongsTo) needs benching; #3-#6 still open as originally described. Adds 11 new items the inventory surfaces: opt-in metadata join, opt-in total_count, single-plan WHERE heuristic, reverse-relation UNION fusion, multi-key sort pushdown, relation where_filter pushdown, getter inlining via BIND(EXISTS), projection subquery inlining, CONSTRUCT-based hydration, parallel resolveLanguage batching, streaming Solutions->Value. Predicted post-state in S16: 5-25x ratios collapse to 1.5-3x with all fixes; the residual is RPC + JSON-marshal floor.

Append "Realised wins" section with per-tier S16 ratio tables comparing `dev` against `refactor/sparql-pushdown-last-write-wins` (coasys/ad4m#846). Includes the same data the PR description carries, plus the per-category verdict refresh: most flux call sites that don't need link-level metadata or unpaginated counts are now convertible at 1.5-3x cost rather than 5-10x. Headline opt-in ratios from #846: - sr_by_id_single_plan medium: 4.2x -> 1.2x (3.45x) - sr_all_no_metadata_no_count medium: 8.7x -> 3.3x (2.63x) - embeddings_all_no_metadata medium: 10.0x -> 3.7x (2.71x) - sr_by_expression_limit1_no_count medium: 4.9x -> 2.8x (1.76x) Back-compat cases stay within ±10% noise. The residual 3-5x on scan-all is what audit items H and I would close; both deferred for a follow-up PR (H is low-priority, I is the 500-LOC CONSTRUCT rewrite).

The yesterday's "Realised wins" table used a stale dev binary (~/workspaces/coasys/ad4m/target/release/ad4m-executor from May 22, test-2 build). When swapped for a fresh dev binary built from HEAD of `dev` (1f29d0b1, test-8) into the same shared CARGO_TARGET_DIR as #846's binary, the opt-in wins land slightly smaller in absolute terms but the structural conclusion is the same: - sr_by_id_single_plan medium: 4.1× → 1.2× (3.36× improvement) - sr_all_no_metadata_no_count medium: 9.3× → 3.3× (2.80× improvement) - embeddings_all_no_metadata medium: 9.0× → 3.7× (2.44× improvement) Default-behaviour cases stay within ±15% noise as before. Adds cross-scenario verification: - S5 (queryLinks scaling, 100/500/1000): 0.87-0.98×, modest wins - S8 (raw SPARQL Flux graph, 1865 + 58460 links): small tier wins 0.71-0.96×, medium tier parity ±8% S5/S8 results confirm K (Solutions → Vec<Value>) doesn't regress the shared SparqlStore path used by pre-existing query routes and yields incidental 5-30% wins at small data sizes where the serialise+parse round trip was a meaningful fraction of total latency.

Audit-driven migration of the convert-viable raw `querySparql` call sites in `packages/api/src/{channel,conversation,conversation-subgroup, semantic-relationship}/index.ts` to their `Ad4mModel.findAll{,AndCount}` equivalents. Lands on top of coasys/ad4m#846 — the orchestrator overhaul that makes these conversions viable at 1.5-3x cost (rather than the 5-10x of pre-#846). Migrated sites (5 production methods): 1. `Channel.pinnedConversations()` (Cat A site #7) - `findAll(Channel, { where: { isPinned: true }, include: { conversations: { limit: 1, withMetadata: false } }, withMetadata: false, count: false })` - Single-plan path engaged: the executor sees `isPinned == true` as uniquely-selective via the `@Property` flag form. 2. `Conversation.stats()` (Cat A site #10 — partial) - Replaced the subgroup count SPARQL with `ConversationSubgroup.findAllAndCount(..., { limit: 0, count: true, withMetadata: false })` — count-only fast path. - Replaced the participant SPARQL with native `LinkQuery` via `perspective.get(...)` — no SPARQL roundtrip needed for this simple link enumeration. 3. `ConversationSubgroup.stats()` (Cat A site #15) - Both queries (item count + participant list) replaced with parallel `LinkQuery`s via `perspective.get(...)`. Avoids the model_query orchestrator entirely because there's no multi-property hydration needed. 4. `SemanticRelationship.itemEmbedding()` (Cat B site #19) - Promoted the `itemEmbeddingViaModel()` demonstrator (added during the audit) to be the primary implementation. Uses `include: { embeddingTag: { withMetadata: false } }` with the polymorphic-on-same-predicate `@HasOne` confirmed working in wind tunnel S16 (`include actually fires: yes`). - Removes the `itemEmbeddingViaModel` method — superseded. 5. `findEmbeddingSRId(itemId)` in `conversation/util.ts` (Cat E site #26) - Promoted from "needs nested-where-on-relation" to convertible because the executor's polymorphic discrimination on the include resolves to the Embedding tag only (the Topic variant doesn't fire when the conformance doesn't match). Looks at `embeddingTag` on the result instances. Kept as raw SPARQL with audit-grounded rationale (16 production sites): - Cat C reifier-metadata reads (4): Channel.allItems, Channel.unprocessedItems data fetch, Conversation.subgroupsData, ConversationSubgroup.itemsData — all want per-link timestamp (the reifier on `<chan> has_child <msg>`, not on the message entity). Ad4mModel's `createdAt` synthesis refers to the entity hydrate, not to the link. - Cat D set-difference (2): Channel.unprocessedItems set-difference pair — `FILTER NOT EXISTS` planner cliff still present. - Cat E inter-class joins (4): Topic.linkedConversations, Topic.linkedSubgroups, Conversation.subgroupsData batch timestamp, Conversation.topics UNION — all need multi-hop joins through predicates that aren't declared as relations on the source model. - Cat A site #5 Channel.totalItemCount — multi-type FILTER would split into 3 parallel `findAllAndCount` calls and a sum. Worse than 1 SPARQL. - Cat A site #6 Channel.recentConversations — already hand-optimised to avoid reifier joins (uses native link API for timestamps). - ConversationSubgroup.topics, .topicsWithRelevance, Topic.* — Cat E. - SemanticRelationship.allConversationEmbeddings, .allSubgroupEmbeddings, .allItemEmbeddings, .allItemEmbeddingsByType — needs reverse `@HasMany`/`@BelongsTo` from Conversation/Subgroup/Message back to SR, which isn't declared today. Convertible in a follow-up PR. Tests - Updated `channel.test.ts` and `conversation.test.ts` mocks to include `perspective.modelQuery` (for the migrated paths) and seed Ad4mModel static-method spies where the @coasys/ad4m mock can't exercise the real findAll/findAllAndCount pipeline. - 119 tests total: 115 pass / 4 pre-existing failures unrelated to this migration (`Channel.unprocessedItems` × 3 + parseLit JSON-stringify × 1, all caused by the test file's `@coasys/ad4m` vi.mock missing `fileToDataUri` — present before this change). Stacked dependency - This PR's runtime correctness depends on coasys/ad4m#846 (and the upstream stack #837 / #842). The `withMetadata: false`, `count: false` and selective-WHERE single-plan paths the migrations rely on are only honoured by the post-#846 executor. Running against `dev`'s executor falls back to the back-compat default (slower but still functional).

Append a final section to the migration doc that mirrors what was landed in commit dd9de23 — a per-site table of every querySparql call site in packages/api/src, classified into migrated / kept-as- SPARQL with the audit-grounded rationale for each. 5 production methods migrated: - Channel.pinnedConversations() - Conversation.stats() (partial: count via findAllAndCount, participants via LinkQuery) - ConversationSubgroup.stats() (both via LinkQuery) - SemanticRelationship.itemEmbedding() (promoted from itemEmbeddingViaModel) - findEmbeddingSRId() in conversation/util.ts 17 kept as raw SPARQL, each with the category (C/D/E or special case) that makes raw SPARQL the structurally correct choice. Test coverage notes + stacked-dependency notice for reviewers.

HexaField changed the base branch from dev to refactor/literal-channel-v-separation June 4, 2026 10:16

HexaField added 2 commits June 4, 2026 20:23

HexaField mentioned this pull request Jun 4, 2026

feat(s16): SPARQL vs Ad4mModel side-by-side scenario coasys/ad4m-wind-tunnel#6

Open

5 tasks

HexaField added 2 commits June 4, 2026 22:27

HexaField mentioned this pull request Jun 4, 2026

perf(executor): model_query orchestrator overhaul — eight audit-driven optimisations coasys/ad4m#846

Draft

6 tasks

HexaField added 4 commits June 4, 2026 23:54

HexaField mentioned this pull request Jun 8, 2026

perf(executor): fused projection counts + CONSTRUCT subgraph hydration for findAll coasys/ad4m#853

Draft

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

research: raw SPARQL → Ad4mModel migration inventory + analysis#605

research: raw SPARQL → Ad4mModel migration inventory + analysis#605
HexaField wants to merge 9 commits into
refactor/literal-channel-v-separationfrom
research/sparql-to-ad4m-model-migration

HexaField commented Jun 4, 2026

Uh oh!

netlify Bot commented Jun 4, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 4, 2026 •

edited

Loading

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

HexaField commented Jun 4, 2026

Summary

What this PR is

Findings (one-line summary)

Prioritised Ad4mModel additions

Planned follow-up commits on this branch

Why this is a draft PR

Uh oh!

netlify Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for fluxsocial-dev ready!

Uh oh!

coderabbitai Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

netlify Bot commented Jun 4, 2026 •

edited

Loading

coderabbitai Bot commented Jun 4, 2026 •

edited

Loading