research: raw SPARQL → Ad4mModel migration inventory + analysis#605
Draft
HexaField wants to merge 9 commits into
Draft
research: raw SPARQL → Ad4mModel migration inventory + analysis#605HexaField wants to merge 9 commits into
HexaField wants to merge 9 commits into
Conversation
28 querySparql call sites in packages/api/src/, categorised by
convertibility against the current Ad4mModel decorator + findAll API.
Categories:
- A. Trivially convertible (5 sites) — single-class findAll + where
- B. Convertible with deep include + new reverse relations (10 sites)
- C. Reifier-metadata reads where raw SPARQL stays correct (4 sites)
- D. Set-difference (2 sites) — keep SPARQL, planner cliff documented
- E. Inter-class joins not modelled (4 sites) — keep SPARQL or new
Ad4mModel features
Identifies 6 prioritised Ad4mModel additions, ordered by call sites
unlocked:
1. tag as typed @hasone(Embedding | Topic) — unlocks 10 sites
2. @BelongsTo() / first-class reverse relations — unlocks 8
3. Multi-class polymorphic findAll — unlocks 4
4. Per-link reifier metadata sidecar — unlocks 4
5. Nested where on relations — unlocks 1
6. UNION across query shapes — unlocks 1
References AD4M #837, #842, #846 and the #812 named-graphs interaction
for the FILTER NOT EXISTS / planner cliff context.
✅ Deploy Preview for fluxsocial-dev ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
After re-checking @coasys/ad4m's decorator API, found that @hasone, @BelongsToOne, @BelongsToMany, and 'where' on relations are all already exposed. Three of the six recommendations in the inventory document are flux-side-only changes rather than needing AD4M SDK PRs. This commit implements recommendation #1: - Adds @hasone(() => Embedding) embeddingTag and @hasone(() => Topic) topicTag on SemanticRelationship, both routed through the same flux://has_tag predicate. The conformance filter on each target class's @Flag (entry_type = has_embedding / has_topic) discriminates at hydration time. - Keeps the existing @Property tag: string for back-compat; callers that want the raw IRI continue to work. - Adds itemEmbeddingViaModel() as a demonstrator conversion sitting alongside the raw-SPARQL itemEmbedding(). Same return shape; behavioural parity caveat documented (both still need perspective.getExpression() for the actual vector). Also checks in scripts/bench-sparql-vs-ad4m.ts as a documented skeleton — connection + seed are stubs (~200 LOC of additional work to make runnable, tracked as Stage 5 in the migration doc). The migration doc gains an 'Implementation log' section recording what's done, what the AD4M re-check changed about the plan, why no perf numbers in this commit, and the staged remaining work.
…ults
The earlier `scripts/bench-sparql-vs-ad4m.py` harness was a one-off.
Migrated the empirical comparison to a proper wind tunnel scenario
(ad4m-wind-tunnel `feat/s16-sparql-vs-model`) so it's reproducible,
version-controlled, and rerunnable as a regression gate on future
`model_query` work.
Updated the "Empirical bench results" section with the s16 numbers
from `dev`. Five cases × two tiers (100 items / 1000 items):
small — raw SPARQL 0.21-0.67 ms, modelQuery 3.4-14.4 ms (14-31x)
medium — raw SPARQL 0.47-7.57 ms, modelQuery 28-155 ms (18-150x)
The three structural findings stand and now have proper data behind
them:
1. `@HasOne` polymorphic-on-same-predicate doesn't fire — s16's
`includeWorks` check flags it (within-5% noise tolerance vs the
no-include baseline).
2. `findAll` overhead scales linearly in N regardless of LIMIT 1.
3. `model_query` is order-of-magnitude slower than raw SPARQL even
when doing strictly less work.
Bottom-line recommendation unchanged: the right work isn't migrating
flux call sites — it's bringing down `model_query` per-instance
overhead in coasys/ad4m. S16 is now the regression gate.
5 tasks
The v1 s16 ratios (14-150x) were inflated by malformed SHACL — the scenario emitted relations in a separate `relations: []` array which the Rust executor's `SHACLShape` deserializer silently drops. Once relations are emitted as `properties:` entries with `relation_kind` set (ad4m-wind-tunnel commit 1f11143), `include actually fires: yes` and the ratios collapse to 2.4-10.6x on medium tier (worst case `topics_all` at 25x because the raw baseline is sub-millisecond). Re-ran S16 against an executor patched with MODEL_QUERY_PROFILE=1 instrumentation that emits per-phase wall-clock timings. The patch times each step in execute_model_query_inner: SPARQL build, two-phase pagination subquery exec, properties subquery exec, count exec, hydration, language transforms, getters, recursive includes. For sr_by_expression_with_include (model 2.99ms; raw 0.55ms): twophase-pagination-exec 0.164 ms twophase-properties-exec 0.101 ms count-exec 0.118 ms include sub-query 0.741 ms Rust orchestration ~0.01 ms --- model_query work --- ~1.13 ms RPC roundtrip + JSON marshal ~1.86 ms For sr_all (model 68ms; raw 7.3ms): single-instance-exec ~65 ms (3090 rows × reifier join) Rust orchestration ~1.5 ms The reifier-metadata join (`?_reifier reifies <<(?s ?p ?t)>> . ?_reifier author ?author . ?_reifier timestamp ?timestamp .`) is unconditional in `build_instance_sparql`, even when the caller never reads author/createdAt/updatedAt. It adds 3 triple-pattern matches per result row — ~3.4x the per-row SPARQL cost vs raw. What can land alongside #842/#846: 1. Make the reifier-metadata join opt-in via a `with_metadata: bool` on ModelQueryInput. Expected impact: ~10x → ~2x on scan-all. 2. Skip the COUNT query unless caller reads `total_count`. 3. Collapse two-phase pagination into Single plan when WHERE is already selective enough that no timestamp probe is needed. None of these are SPARQL-language extensions — they're orchestrator changes that reduce SPARQL work. Natural follow-up PR to #846. Pre-fix numbers retained alongside post-fix in the doc for the review trail.
Full inventory of every store.query fan-out point in model_query (14 SPARQL sites + 1 Holochain-RPC site) plus the structural analysis of why each one is in Rust today and what would let it collapse into a single SPARQL. Headline finding: the remaining 5-25x model_query gap is a fan-out problem, not a per-query problem. A non-trivial findAll fires 1 + N + M + K SPARQL queries through the same evaluator, with tree-shaping in Rust between them. SPARQL itself is fast; the orchestration around it isn't. 11 numbered PRs (A-K) cover the path from "opt-in reifier metadata" (50 LOC, biggest immediate win) to "CONSTRUCT-based subgraph hydration with a tree walker" (500 LOC, removes recursion entirely so deep includes are constant-roundtrip). Each PR adds or extends one S16 case so the regression gate sees the ratio collapse cleanly. Audit also cross-references the 6 original prioritised additions against current state — #1 (polymorphic @hasone) works at the executor level; #2 (@BelongsTo) needs benching; #3-#6 still open as originally described. Adds 11 new items the inventory surfaces: opt-in metadata join, opt-in total_count, single-plan WHERE heuristic, reverse-relation UNION fusion, multi-key sort pushdown, relation where_filter pushdown, getter inlining via BIND(EXISTS), projection subquery inlining, CONSTRUCT-based hydration, parallel resolveLanguage batching, streaming Solutions->Value. Predicted post-state in S16: 5-25x ratios collapse to 1.5-3x with all fixes; the residual is RPC + JSON-marshal floor.
Append "Realised wins" section with per-tier S16 ratio tables comparing `dev` against `refactor/sparql-pushdown-last-write-wins` (coasys/ad4m#846). Includes the same data the PR description carries, plus the per-category verdict refresh: most flux call sites that don't need link-level metadata or unpaginated counts are now convertible at 1.5-3x cost rather than 5-10x. Headline opt-in ratios from #846: - sr_by_id_single_plan medium: 4.2x -> 1.2x (3.45x) - sr_all_no_metadata_no_count medium: 8.7x -> 3.3x (2.63x) - embeddings_all_no_metadata medium: 10.0x -> 3.7x (2.71x) - sr_by_expression_limit1_no_count medium: 4.9x -> 2.8x (1.76x) Back-compat cases stay within ±10% noise. The residual 3-5x on scan-all is what audit items H and I would close; both deferred for a follow-up PR (H is low-priority, I is the 500-LOC CONSTRUCT rewrite).
The yesterday's "Realised wins" table used a stale dev binary (~/workspaces/coasys/ad4m/target/release/ad4m-executor from May 22, test-2 build). When swapped for a fresh dev binary built from HEAD of `dev` (1f29d0b1, test-8) into the same shared CARGO_TARGET_DIR as #846's binary, the opt-in wins land slightly smaller in absolute terms but the structural conclusion is the same: - sr_by_id_single_plan medium: 4.1× → 1.2× (3.36× improvement) - sr_all_no_metadata_no_count medium: 9.3× → 3.3× (2.80× improvement) - embeddings_all_no_metadata medium: 9.0× → 3.7× (2.44× improvement) Default-behaviour cases stay within ±15% noise as before. Adds cross-scenario verification: - S5 (queryLinks scaling, 100/500/1000): 0.87-0.98×, modest wins - S8 (raw SPARQL Flux graph, 1865 + 58460 links): small tier wins 0.71-0.96×, medium tier parity ±8% S5/S8 results confirm K (Solutions → Vec<Value>) doesn't regress the shared SparqlStore path used by pre-existing query routes and yields incidental 5-30% wins at small data sizes where the serialise+parse round trip was a meaningful fraction of total latency.
Audit-driven migration of the convert-viable raw `querySparql` call
sites in `packages/api/src/{channel,conversation,conversation-subgroup,
semantic-relationship}/index.ts` to their `Ad4mModel.findAll{,AndCount}`
equivalents. Lands on top of coasys/ad4m#846 — the orchestrator
overhaul that makes these conversions viable at 1.5-3x cost (rather
than the 5-10x of pre-#846).
Migrated sites (5 production methods):
1. `Channel.pinnedConversations()` (Cat A site #7)
- `findAll(Channel, { where: { isPinned: true }, include: { conversations: { limit: 1, withMetadata: false } }, withMetadata: false, count: false })`
- Single-plan path engaged: the executor sees `isPinned == true` as
uniquely-selective via the `@Property` flag form.
2. `Conversation.stats()` (Cat A site #10 — partial)
- Replaced the subgroup count SPARQL with
`ConversationSubgroup.findAllAndCount(..., { limit: 0, count: true, withMetadata: false })` — count-only fast path.
- Replaced the participant SPARQL with native `LinkQuery` via
`perspective.get(...)` — no SPARQL roundtrip needed for this
simple link enumeration.
3. `ConversationSubgroup.stats()` (Cat A site #15)
- Both queries (item count + participant list) replaced with
parallel `LinkQuery`s via `perspective.get(...)`. Avoids the
model_query orchestrator entirely because there's no
multi-property hydration needed.
4. `SemanticRelationship.itemEmbedding()` (Cat B site #19)
- Promoted the `itemEmbeddingViaModel()` demonstrator (added during
the audit) to be the primary implementation. Uses
`include: { embeddingTag: { withMetadata: false } }` with the
polymorphic-on-same-predicate `@HasOne` confirmed working in
wind tunnel S16 (`include actually fires: yes`).
- Removes the `itemEmbeddingViaModel` method — superseded.
5. `findEmbeddingSRId(itemId)` in `conversation/util.ts` (Cat E site #26)
- Promoted from "needs nested-where-on-relation" to convertible
because the executor's polymorphic discrimination on the include
resolves to the Embedding tag only (the Topic variant doesn't
fire when the conformance doesn't match). Looks at `embeddingTag`
on the result instances.
Kept as raw SPARQL with audit-grounded rationale (16 production sites):
- Cat C reifier-metadata reads (4): Channel.allItems, Channel.unprocessedItems
data fetch, Conversation.subgroupsData, ConversationSubgroup.itemsData —
all want per-link timestamp (the reifier on `<chan> has_child <msg>`,
not on the message entity). Ad4mModel's `createdAt` synthesis refers
to the entity hydrate, not to the link.
- Cat D set-difference (2): Channel.unprocessedItems set-difference
pair — `FILTER NOT EXISTS` planner cliff still present.
- Cat E inter-class joins (4): Topic.linkedConversations,
Topic.linkedSubgroups, Conversation.subgroupsData batch timestamp,
Conversation.topics UNION — all need multi-hop joins through
predicates that aren't declared as relations on the source model.
- Cat A site #5 Channel.totalItemCount — multi-type FILTER would split
into 3 parallel `findAllAndCount` calls and a sum. Worse than 1 SPARQL.
- Cat A site #6 Channel.recentConversations — already hand-optimised to
avoid reifier joins (uses native link API for timestamps).
- ConversationSubgroup.topics, .topicsWithRelevance, Topic.* — Cat E.
- SemanticRelationship.allConversationEmbeddings, .allSubgroupEmbeddings,
.allItemEmbeddings, .allItemEmbeddingsByType — needs reverse
`@HasMany`/`@BelongsTo` from Conversation/Subgroup/Message back to SR,
which isn't declared today. Convertible in a follow-up PR.
Tests
- Updated `channel.test.ts` and `conversation.test.ts` mocks to include
`perspective.modelQuery` (for the migrated paths) and seed
Ad4mModel static-method spies where the @coasys/ad4m mock can't
exercise the real findAll/findAllAndCount pipeline.
- 119 tests total: 115 pass / 4 pre-existing failures unrelated to this
migration (`Channel.unprocessedItems` × 3 + parseLit JSON-stringify ×
1, all caused by the test file's `@coasys/ad4m` vi.mock missing
`fileToDataUri` — present before this change).
Stacked dependency
- This PR's runtime correctness depends on coasys/ad4m#846 (and the
upstream stack #837 / #842). The `withMetadata: false`,
`count: false` and selective-WHERE single-plan paths the migrations
rely on are only honoured by the post-#846 executor. Running
against `dev`'s executor falls back to the back-compat default
(slower but still functional).
Append a final section to the migration doc that mirrors what was landed in commit dd9de23 — a per-site table of every querySparql call site in packages/api/src, classified into migrated / kept-as- SPARQL with the audit-grounded rationale for each. 5 production methods migrated: - Channel.pinnedConversations() - Conversation.stats() (partial: count via findAllAndCount, participants via LinkQuery) - ConversationSubgroup.stats() (both via LinkQuery) - SemanticRelationship.itemEmbedding() (promoted from itemEmbeddingViaModel) - findEmbeddingSRId() in conversation/util.ts 17 kept as raw SPARQL, each with the category (C/D/E or special case) that makes raw SPARQL the structurally correct choice. Test coverage notes + stacked-dependency notice for reviewers.
Draft
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Inventory + analysis of every
perspective.querySparql<T>()call site inpackages/api/src/, categorised by whether and how each could be expressed viaAd4mModel.findAll. Identifies 6 prioritisedAd4mModelfeatures that would unlock the conversions, and lists the call sites that should stay as raw SPARQL with reasons.Full document at
docs/sparql-to-ad4m-model-migration.md.cc @lucksus @jhweir — stacks logically on top of coasys/ad4m#837, #842, #846 and interacts with #812 (named graphs).
What this PR is
Stage 1 of a multi-stage investigation: pure documentation. No code changes.
The user asked for SPARQL→Ad4mModel conversions with performance comparison, but the scope (28 call sites across 6 files, mixing reifier-metadata reads, set-difference, multi-hop joins, and inter-class traversals) is large enough that the responsible first deliverable is a structured catalog with explicit categorization and capability-gap analysis. Going straight to conversion + benchmarking without that grounding would have us picking conversion candidates blindly.
Findings (one-line summary)
findAll({ where })today — likely a slight perf regression because the model-query builder adds conformance joins the raw SPARQL skips. Trade-off.getExpression()pattern that batchedinclude:would collapse to one round-trip.meta:projection.unprocessedItems). Keep as raw SPARQL —FILTER NOT EXISTShits the planner cliffac57680b9warned about; would degrade to the same plan viawhere: { NOT: {...} }.Prioritised Ad4mModel additions
Planned follow-up commits on this branch
Stages 3–5 are explicitly deferred — the investigation is meant to ground decisions, not pre-decide them.
Why this is a draft PR
The PR is open as a draft so the analysis can be reviewed and the categorisation challenged before I start writing code. If the conclusions on what should stay as raw SPARQL look wrong, or if a feature gap is mis-prioritised, much cheaper to argue here than after I've converted half the call sites.
The expectation is that this PR will grow with Stage 2+ commits, or it gets closed and replaced with separate per-stage PRs once we agree on direction.