Skip to content

Use artifacts for data flow between plugins#7

Merged
jcushman merged 8 commits into
mainfrom
artifacts
Apr 8, 2026
Merged

Use artifacts for data flow between plugins#7
jcushman merged 8 commits into
mainfrom
artifacts

Conversation

@jcushman

@jcushman jcushman commented Mar 23, 2026

Copy link
Copy Markdown
Collaborator

This refactor is inspired by @parkan's comments in #5 :

my concerns with this as implemented are (1) how much re-parsing will need to be done in practice, for my example plugin it's fine (metadata reads are cheap) but larger block-level diffs can hurt, though need to think this through a bit more (2) store()/load() offers less type safety
the other regression, aside from giving up type safety/cache key collisions and re-parsing, is that skips can now be quite expensive

Basically if we enforce a serialization boundary between plugins and require them to pass around paths or opaque cache blobs, how do they efficiently communicate with each other?

To make that a bit more workable, this PR abandons the idea of an opaque cache to which plugins can store() and load(), and instead introduces the idea of typed artifacts that plugins can attach to diffnodes. An artifact is defined as:

{
    "package": <python package that owns the definition for this serialization format and exports reference read/write methods>,
    "name": <format name within that package>,
    "version": <integer format version, starting at 1>
}

So for example the stdlib includes a definition for

{
      "package": "binoc",
      "name": "tabular",
      "version": 1
}

which serializes tabular data produced by the csv comparator and consumed by binoc-row-reorder, binoc-column-reorder etc.

The routing rules are:

  • comparators can attach as many artifacts as they want to the "left", "right", or "pair" sides of a diffnode ("pair" in case they generate an expensive diff or something that applies equally to both sides)
  • there can be no more than one artifact with a particular artifact def, like only one (binoc, tabular, 1) for the left file. If you want a list of tabular files, create a container for it like (binoc-sql, tables, 1).
  • transformers can register to be called if a node contains a particular artifact

So for example the csv comparator currently attaches (binoc, tabular, 1) to both left and right. It could later also attach (binoc, columnular, 1) if an arrow column-based format was helpful, or it could attach (binoc, tabular, 1) and (binoc, tabular, 2) if transitioning to a new format and deprecating the old one. Transformers would bind to whichever they understand.

An excel or parquet comparator could emit (binoc, tabular, 1) and automatically use the same transformers.

Serialized data is not actually passed back through the controller, just a handle to it, as an sdk implementation detail. Currently the way that works is they write the serialized data to the scratch folder and pass back a path; in a future sdk it could be some other more efficient transport.

Typical plugins will directly import and use the reference read/write functions, if they're written in the same language; but if you wanted to use a different language or whatever, you would just need something that followed the same format. (binoc, tabular, 1) refers to a serialization contract committed to by the binoc package, not just a function. (I don't think that can be formally specified -- the contract could be "this is a valid sqlite file" or whatever. It's a social contract.)

In sorting out routing I also updated the transformer routing rules a bit -- item_type is just a display string now; transformers can specify a list of match_artifacts, node_shape (container vs leaf), tags, or actions, and will match if at least one element from each list matches. Empty lists mean no filter on that field. For example: detect-file-move matches on (node_shape = container) and looks at all files; row-reorder matches on (node_shape = leaf, match_artifacts = (binoc, tabular, 1) ) and looks at all files with tabular data.

Upshot of this approach:

  • Data flow is typed -- we use the plugin name resolution rules to declare who owns a format definition.
  • Comparators and transformers compose. There's a clean way to declare interchange formats and write against them.
  • Implementation details are hidden -- we can move to more efficient tabular data representations or whatever without changing consumer code.
  • Routing gets more precise, which helps a bit with skips -- there's an intuitive way to say "this is what I know how to transform."

Open questions / future work:

  • I don't have a good idea about how transformers can compose yet. Say you wanted to have a spreadsheet with a column reordered and then 4 rows added be first processed by binoc-column-rename to back out the column rename, then binoc-row-added to detect that after the column rename there's only a few rows actually changed. This would need binoc-column-rename to replace or add some sort of intermediate artifact and binoc-row-added to see it. That's technically possible now (transformers can add artifacts), but I don't know if it can be done in a composable or general way. In the meantime you'll get precise changelogs for files with just one relevant transformer, but a less specific changelog line for multiple changes to the same file.
  • At this stage we're not caring very much how (binoc, tabular, 1) works under the hood. At the moment it's a json serialization of pub struct TabularData { pub headers: Vec<String>, pub rows: Vec<Vec<String>> } written to a file in data_root/.artifacts/, which isn't particularly efficient; the idea with types and versions is to give us a path to have it be a shared-memory arrow buffer or whatever it needs to be in the future without needing the right answer right now

ADRs

Related updates: Cross-phase data cache (superseded note), Terminology (item_type vs dispatch)

Add ArtifactFormat, publish/get on CompareDataAccess, and related plugin
types so comparators can expose typed blobs and transformers can consume them.

ADR: published artifacts supersede the earlier cross-phase cache note; index
lists the new decision.

Made-with: Cursor
Run comparators and transformers with shared artifact storage; refine when
transformers run and how they match nodes.

ADRs document dispatch refinement, composition order with artifacts, and
terminology (item_type vs dispatch keys).

Made-with: Cursor
Wire SDK data-access surface through PyO3; adjust CLI test for pipeline output.

Made-with: Cursor
CSV comparator publishes typed tabular artifacts; tabular_analyzer consumes
them for row/column/cell semantics. Column reorder and move/copy detectors
align with artifact-based dispatch.

Made-with: Cursor
SQLite publishes relational-schema artifacts; row-reorder reads tabular
artifacts. Sync Cargo.lock (serde for sqlite, drop stale csv from lock).

Made-with: Cursor
Update expected changesets and abi-log snapshots; ignore Insta *.snap.new.

Made-with: Cursor
Align user-facing docs with published artifacts, transformer dispatch, and
authoring patterns.

Made-with: Cursor
@jcushman jcushman requested a review from cmsetzer March 23, 2026 20:57
@parkan

parkan commented Mar 26, 2026

Copy link
Copy Markdown

I think this is a much cleaner approach and provides a nice path out of JSON serialization once needed; overall 👍

transformer composition is something I'd need to wrap my head around further to comment on effectively but I don't see anything obviously blocking future work on that in the current design

@cmsetzer cmsetzer left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I'm still wrapping my head around some pieces but the artifact approach strikes me as an improvement and better sets us up for the future. I say let's merge this in and continue to iterate.

The composition of transformers question is a good one for us to get right — I feel like most reasonably complex real-world cases will demand a whole bunch of them.

@jcushman jcushman merged commit 6d88557 into main Apr 8, 2026
2 checks passed
@jcushman jcushman deleted the artifacts branch April 8, 2026 16:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants