Conversation
Add ArtifactFormat, publish/get on CompareDataAccess, and related plugin types so comparators can expose typed blobs and transformers can consume them. ADR: published artifacts supersede the earlier cross-phase cache note; index lists the new decision. Made-with: Cursor
Run comparators and transformers with shared artifact storage; refine when transformers run and how they match nodes. ADRs document dispatch refinement, composition order with artifacts, and terminology (item_type vs dispatch keys). Made-with: Cursor
Wire SDK data-access surface through PyO3; adjust CLI test for pipeline output. Made-with: Cursor
CSV comparator publishes typed tabular artifacts; tabular_analyzer consumes them for row/column/cell semantics. Column reorder and move/copy detectors align with artifact-based dispatch. Made-with: Cursor
SQLite publishes relational-schema artifacts; row-reorder reads tabular artifacts. Sync Cargo.lock (serde for sqlite, drop stale csv from lock). Made-with: Cursor
Update expected changesets and abi-log snapshots; ignore Insta *.snap.new. Made-with: Cursor
Align user-facing docs with published artifacts, transformer dispatch, and authoring patterns. Made-with: Cursor
|
I think this is a much cleaner approach and provides a nice path out of JSON serialization once needed; overall 👍 transformer composition is something I'd need to wrap my head around further to comment on effectively but I don't see anything obviously blocking future work on that in the current design |
cmsetzer
approved these changes
Mar 26, 2026
cmsetzer
left a comment
Contributor
There was a problem hiding this comment.
Looks good. I'm still wrapping my head around some pieces but the artifact approach strikes me as an improvement and better sets us up for the future. I say let's merge this in and continue to iterate.
The composition of transformers question is a good one for us to get right — I feel like most reasonably complex real-world cases will demand a whole bunch of them.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This refactor is inspired by @parkan's comments in #5 :
Basically if we enforce a serialization boundary between plugins and require them to pass around paths or opaque cache blobs, how do they efficiently communicate with each other?
To make that a bit more workable, this PR abandons the idea of an opaque cache to which plugins can store() and load(), and instead introduces the idea of typed artifacts that plugins can attach to diffnodes. An artifact is defined as:
So for example the stdlib includes a definition for
which serializes tabular data produced by the csv comparator and consumed by binoc-row-reorder, binoc-column-reorder etc.
The routing rules are:
So for example the csv comparator currently attaches (binoc, tabular, 1) to both left and right. It could later also attach (binoc, columnular, 1) if an arrow column-based format was helpful, or it could attach (binoc, tabular, 1) and (binoc, tabular, 2) if transitioning to a new format and deprecating the old one. Transformers would bind to whichever they understand.
An excel or parquet comparator could emit (binoc, tabular, 1) and automatically use the same transformers.
Serialized data is not actually passed back through the controller, just a handle to it, as an sdk implementation detail. Currently the way that works is they write the serialized data to the scratch folder and pass back a path; in a future sdk it could be some other more efficient transport.
Typical plugins will directly import and use the reference read/write functions, if they're written in the same language; but if you wanted to use a different language or whatever, you would just need something that followed the same format.
(binoc, tabular, 1)refers to a serialization contract committed to by thebinocpackage, not just a function. (I don't think that can be formally specified -- the contract could be "this is a valid sqlite file" or whatever. It's a social contract.)In sorting out routing I also updated the transformer routing rules a bit --
item_typeis just a display string now; transformers can specify a list of match_artifacts, node_shape (container vs leaf), tags, or actions, and will match if at least one element from each list matches. Empty lists mean no filter on that field. For example: detect-file-move matches on(node_shape = container)and looks at all files; row-reorder matches on(node_shape = leaf, match_artifacts = (binoc, tabular, 1) )and looks at all files with tabular data.Upshot of this approach:
Open questions / future work:
pub struct TabularData { pub headers: Vec<String>, pub rows: Vec<Vec<String>> }written to a file indata_root/.artifacts/, which isn't particularly efficient; the idea with types and versions is to give us a path to have it be a shared-memory arrow buffer or whatever it needs to be in the future without needing the right answer right nowADRs
Related updates: Cross-phase data cache (superseded note), Terminology (
item_typevs dispatch)