Use artifacts for data flow between plugins by jcushman · Pull Request #7 · harvard-lil/binoc

jcushman · 2026-03-23T20:57:18Z

This refactor is inspired by @parkan's comments in #5 :

my concerns with this as implemented are (1) how much re-parsing will need to be done in practice, for my example plugin it's fine (metadata reads are cheap) but larger block-level diffs can hurt, though need to think this through a bit more (2) store()/load() offers less type safety
the other regression, aside from giving up type safety/cache key collisions and re-parsing, is that skips can now be quite expensive

Basically if we enforce a serialization boundary between plugins and require them to pass around paths or opaque cache blobs, how do they efficiently communicate with each other?

To make that a bit more workable, this PR abandons the idea of an opaque cache to which plugins can store() and load(), and instead introduces the idea of typed artifacts that plugins can attach to diffnodes. An artifact is defined as:

{
    "package": <python package that owns the definition for this serialization format and exports reference read/write methods>,
    "name": <format name within that package>,
    "version": <integer format version, starting at 1>
}

So for example the stdlib includes a definition for

{
      "package": "binoc",
      "name": "tabular",
      "version": 1
}

which serializes tabular data produced by the csv comparator and consumed by binoc-row-reorder, binoc-column-reorder etc.

The routing rules are:

comparators can attach as many artifacts as they want to the "left", "right", or "pair" sides of a diffnode ("pair" in case they generate an expensive diff or something that applies equally to both sides)
there can be no more than one artifact with a particular artifact def, like only one (binoc, tabular, 1) for the left file. If you want a list of tabular files, create a container for it like (binoc-sql, tables, 1).
transformers can register to be called if a node contains a particular artifact

So for example the csv comparator currently attaches (binoc, tabular, 1) to both left and right. It could later also attach (binoc, columnular, 1) if an arrow column-based format was helpful, or it could attach (binoc, tabular, 1) and (binoc, tabular, 2) if transitioning to a new format and deprecating the old one. Transformers would bind to whichever they understand.

An excel or parquet comparator could emit (binoc, tabular, 1) and automatically use the same transformers.

Serialized data is not actually passed back through the controller, just a handle to it, as an sdk implementation detail. Currently the way that works is they write the serialized data to the scratch folder and pass back a path; in a future sdk it could be some other more efficient transport.

Typical plugins will directly import and use the reference read/write functions, if they're written in the same language; but if you wanted to use a different language or whatever, you would just need something that followed the same format. (binoc, tabular, 1) refers to a serialization contract committed to by the binoc package, not just a function. (I don't think that can be formally specified -- the contract could be "this is a valid sqlite file" or whatever. It's a social contract.)

In sorting out routing I also updated the transformer routing rules a bit -- item_type is just a display string now; transformers can specify a list of match_artifacts, node_shape (container vs leaf), tags, or actions, and will match if at least one element from each list matches. Empty lists mean no filter on that field. For example: detect-file-move matches on (node_shape = container) and looks at all files; row-reorder matches on (node_shape = leaf, match_artifacts = (binoc, tabular, 1) ) and looks at all files with tabular data.

Upshot of this approach:

Data flow is typed -- we use the plugin name resolution rules to declare who owns a format definition.
Comparators and transformers compose. There's a clean way to declare interchange formats and write against them.
Implementation details are hidden -- we can move to more efficient tabular data representations or whatever without changing consumer code.
Routing gets more precise, which helps a bit with skips -- there's an intuitive way to say "this is what I know how to transform."

Open questions / future work:

I don't have a good idea about how transformers can compose yet. Say you wanted to have a spreadsheet with a column reordered and then 4 rows added be first processed by binoc-column-rename to back out the column rename, then binoc-row-added to detect that after the column rename there's only a few rows actually changed. This would need binoc-column-rename to replace or add some sort of intermediate artifact and binoc-row-added to see it. That's technically possible now (transformers can add artifacts), but I don't know if it can be done in a composable or general way. In the meantime you'll get precise changelogs for files with just one relevant transformer, but a less specific changelog line for multiple changes to the same file.
At this stage we're not caring very much how (binoc, tabular, 1) works under the hood. At the moment it's a json serialization of pub struct TabularData { pub headers: Vec<String>, pub rows: Vec<Vec<String>> } written to a file in data_root/.artifacts/, which isn't particularly efficient; the idea with types and versions is to give us a path to have it be a shared-memory arrow buffer or whatever it needs to be in the future without needing the right answer right now

ADRs

Related updates: Cross-phase data cache (superseded note), Terminology (item_type vs dispatch)

Add ArtifactFormat, publish/get on CompareDataAccess, and related plugin types so comparators can expose typed blobs and transformers can consume them. ADR: published artifacts supersede the earlier cross-phase cache note; index lists the new decision. Made-with: Cursor

Run comparators and transformers with shared artifact storage; refine when transformers run and how they match nodes. ADRs document dispatch refinement, composition order with artifacts, and terminology (item_type vs dispatch keys). Made-with: Cursor

Wire SDK data-access surface through PyO3; adjust CLI test for pipeline output. Made-with: Cursor

CSV comparator publishes typed tabular artifacts; tabular_analyzer consumes them for row/column/cell semantics. Column reorder and move/copy detectors align with artifact-based dispatch. Made-with: Cursor

SQLite publishes relational-schema artifacts; row-reorder reads tabular artifacts. Sync Cargo.lock (serde for sqlite, drop stale csv from lock). Made-with: Cursor

Update expected changesets and abi-log snapshots; ignore Insta *.snap.new. Made-with: Cursor

Align user-facing docs with published artifacts, transformer dispatch, and authoring patterns. Made-with: Cursor

parkan · 2026-03-26T15:12:31Z

I think this is a much cleaner approach and provides a nice path out of JSON serialization once needed; overall 👍

transformer composition is something I'd need to wrap my head around further to comment on effectively but I don't see anything obviously blocking future work on that in the current design

cmsetzer

Looks good. I'm still wrapping my head around some pieces but the artifact approach strikes me as an improvement and better sets us up for the future. I say let's merge this in and continue to iterate.

The composition of transformers question is a good one for us to get right — I feel like most reasonably complex real-world cases will demand a whole bunch of them.

jcushman added 8 commits March 23, 2026 16:11

Remove DESIGN.md (scope for initial implementation)

b512a42

feat(python): expose artifact helpers on compare data access

b14cced

Wire SDK data-access surface through PyO3; adjust CLI test for pipeline output. Made-with: Cursor

feat(stdlib): tabular artifacts and tabular analyzer transformer

ac4a19f

CSV comparator publishes typed tabular artifacts; tabular_analyzer consumes them for row/column/cell semantics. Column reorder and move/copy detectors align with artifact-based dispatch. Made-with: Cursor

feat(plugins): adopt artifact APIs in sqlite and row-reorder

7915fde

SQLite publishes relational-schema artifacts; row-reorder reads tabular artifacts. Sync Cargo.lock (serde for sqlite, drop stale csv from lock). Made-with: Cursor

test: refresh goldens for artifact pipeline and ABI logs

a347a12

Update expected changesets and abi-log snapshots; ignore Insta *.snap.new. Made-with: Cursor

docs: tutorial and plugin guide for artifacts and transformers

dabb45e

Align user-facing docs with published artifacts, transformer dispatch, and authoring patterns. Made-with: Cursor

jcushman requested a review from cmsetzer March 23, 2026 20:57

cmsetzer approved these changes Mar 26, 2026

View reviewed changes

jcushman merged commit 6d88557 into main Apr 8, 2026
2 checks passed

jcushman deleted the artifacts branch April 8, 2026 16:02

jcushman mentioned this pull request Apr 8, 2026

add Custom variant to ReopenedData for plugin-defined data types #4

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use artifacts for data flow between plugins#7

Use artifacts for data flow between plugins#7
jcushman merged 8 commits into
mainfrom
artifacts

jcushman commented Mar 23, 2026 •

edited

Loading

Uh oh!

parkan commented Mar 26, 2026

Uh oh!

cmsetzer left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

jcushman commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

parkan commented Mar 26, 2026

Uh oh!

cmsetzer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jcushman commented Mar 23, 2026 •

edited

Loading