Skip to content

Spike: evaluate RDF/SPARQL as a unifying query layer over the dependency, domain-semantics, and value-binding graphs #60

@jwulf

Description

@jwulf

⚠️ Read this canonical brief first.
It supersedes the original framing below and the discussion comments. Use it as the input to any agent executing this spike. The remaining content on this issue is preserved as decision history.


Original issue framing (historical — see canonical brief above)

Spike: evaluate RDF/SPARQL as a unifying query layer

Motivation

The repo currently maintains three graph-shaped data structures that are traversed independently in TypeScript with no shared query layer:

  1. Operation dependency graph (semantic-graph-extractor/operation-dependency-graph.json) — ~144 operations, ~6,291 edges across 18 semantic types.
  2. Domain-semantics graph (path-analyser/domain-semantics.json) — hand-curated identifiers, capabilities, runtime states, artifact kinds.
  3. Value-binding graph (embedded in domain-semantics.json under operationRequirements[].valueBindings) — string field paths like response.deployments[].processDefinition.processDefinitionId mapped to semantic-type identity fields.

Each tool (path-analyser, request-validation, optional-responses) reloads and re-traverses these structures with bespoke procedural code. Cross-tool questions ("which operations have ≥1 positive scenario AND coverage in every applicable negative kind?") are currently impossible without ad-hoc joining of disparate JSON outputs.

Hypothesis

A small in-process triple store (Oxigraph WASM, or n3 + quadstore) loaded with these three graphs as RDF could:

  • Collapse the three graphs into one queryable surface.
  • Replace some procedural BFS heuristics with declarative SPARQL property paths.
  • Make value-binding paths structurally validatable (typo'd paths return zero results instead of silently no-oping at runtime).
  • Enable cross-tool coverage correlation queries.

Likely non-wins (to be confirmed)

  • Scenario generation is combinatorial enumeration, not query — SPARQL is the wrong tool. Mini-Datalog / ASP would fit better, but the cost-benefit is much weaker.
  • Performance is not currently a pain point; round-tripping through SPARQL adds latency at this scale.
  • Determinism — the current pipeline is byte-reproducible via TEST_SEED. Triple stores have non-deterministic iteration; would require ORDER BY discipline and stable URI minting. Real cost.
  • Tooling friction — strict TS, Biome, GritQL-banned as T casts, sync code style. JS RDF libraries have typing gaps and async query APIs.
  • Authoring burdendomain-semantics.json is hand-maintained; serialization format (JSON vs Turtle) doesn't reduce that burden.

Spike scope (timeboxed)

Pick one concrete query that is currently awkward in TS and implement it twice:

  1. Baseline (TS): e.g. "find all operations whose required semantic types have no producer in the dependency graph" (correctness check) or "for each operation, list which value-binding field paths no longer resolve against the current bundled spec" (drift detector).
  2. SPARQL: load the same inputs into an embedded triple store, express the query in SPARQL, compare:
    • Lines of code / clarity
    • Cold-start + query latency
    • Reproducibility under fixed seed
    • Type safety at the JS/RDF boundary

Decision criteria

  • Adopt if the SPARQL version is materially clearer AND the ergonomics overhead (load, serialize, type the boundary) is acceptable for the value-binding drift use case at minimum.
  • Adopt narrowly (value-binding graph only, leave dependency graph as-is) if the win is concentrated there.
  • Reject if it's a wash or worse — conclusion is "graph traversal scale is appropriate to in-memory TS; revisit if scale grows ≥10×".

Out of scope for the spike

  • Rewriting the BFS scenario planner.
  • Replacing domain-semantics.json authoring format.
  • Introducing a persistent triple store (in-process only).
  • SHACL validation of the spec (separate question).

Deliverables

  • A throwaway branch with both implementations of the chosen query.
  • Short writeup (in the issue or a docs/spikes/ note) recording: query chosen, LOC delta, latency numbers, friction points, recommendation.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestquestionFurther information is requested

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions