Skip to content

Conversation

@kervel
Copy link
Collaborator

@kervel kervel commented Sep 10, 2025

  • patch now does no more roundtrip to json.
  • we introduce node IDs and we keep track of added/removed/updated nodes with ID. this will allow tracking provenance of data in linkml values by keeping track of which node IDs were affected by which patch.
  • add more testcases for diff and patch and fix the bugs found by them
  • patch: take into account linkml instance equality rules when patching

…patch(); expose node_id and trace in Python; adjust CLI and tests
…l PatchTrace; remove JSON roundtrip and ID reconciliation; preserve scalar node IDs on in-place updates
…tore network-based CLI tests; remove dead code
…ead of expect/panic; update tests, CLI, and Python bindings
@kervel kervel changed the title Add NodeIds to LinkMLValue and return PatchTrace from patch() Refactor "patch" to avoid roundtrip to json. we patch the linkml values directly. Sep 10, 2025
…kMLValue::Null; add node_id to Null; handle Null in to_json, traversal, and parsing; resolve conflicts
…cting Null directly; handle Null in trace traversal
let new_obj = serde_json::json!({
"id": "P:999",
"name": "Added Person",
"objecttype": "https://w3id.org/linkml/examples/personinfo/Person"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haven't seen this as part of a linkml schema, is this new in this package?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really, the personinfo toy schema is already there (and i think i took it from the python linkml). here i test patch tracing by adding another person to a list of persons and then seeing if making a diff then applying it gives me the node IDs of the nodes that changed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry i meant the objecttype key being resolved by a url - presumably you are matching the uri to the class definition somewhere? catching up

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes, that's a slot that is designates_type: true, with range=uri. that was already possible in the python / pydantic classes too ...

Comment on lines 94 to 101
// Stable node identifiers assigned to every LinkMLValue node
pub type NodeId = u64;

static NEXT_NODE_ID: AtomicU64 = AtomicU64::new(1);

fn new_node_id() -> NodeId {
NEXT_NODE_ID.fetch_add(1, Ordering::Relaxed)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this stable? it seems like it would be order dependent. how do these work? Is each node a whole object definition, or is a node each item/field/etc. within a definition?

there is some notion of identity in the linkml spec - https://linkml.io/linkml-model/latest/docs/specification/02instances/#identity-conditions - so it would probably be good to document how some parallel notion of identity relates to the notion of identity in linkml.

i can't quite tell how they are used here from the description in your original comment, could you help me understand how these are used when merging? are we talking about linkml schemas, or the data described by a schema?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey,

this has not really something to do with the identities in linkml. it is an internal node id for tracing mutations: when i mutate a linkmlvalue (patch) i want to get feedback on which "nodes" in my value tree changed. this is done by, as result of the patch algo, give me a list of updated/deleted/changed node IDs.

which also means it doesn't have to be stable.

the end goal is to do provenance tracking: if i apply a patch, and another patch, and another patch, i can tell exactly what attribute came from what patch, just like "git blame" would do for code.

i can also put the git blame functionality in this package but i tought it belonged in a separate package. however, i can't get around adding some patch tracking here or i have to duplicate the whole diff/patch logic.

greetings,
Frank

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i will add this as rustdoc, because indeed its confusing!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, thanks for pointing out the document to me, i can do a better job of aligning with it, for instance by not assigning new node IDs when replacing an object/value/... by a value that is identical according to the docu. will try that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and another question (sorry for the spam) should i rename my linkmlvalue to linkmlinstance so that it aligns with the docu on instances ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am still catching up on what this is doing - if this is an instance of a linkml class, as in "a person named frank" rather than a class definition, "the definition of a person" then yes i think instance would be the canonical term.

so this is a diff/patch for instances, not class definitions? that's pretty cool! so the ID here is monotonically increasing per instance? or is that global? i don't quite understand the algorithm of the diff here but would like to. the thing i am not quite understanding is that say we have version 1 of the instance, and then each node in its tree is given an integer ID here. say that's in the range 1-10, and initially have some lastName property "a". then we have some patch with a small number of updated nodes, say that's in the range 11-15, and that sets the lastName property to "b". how are they used to match updated nodes? like how is the integer node id used to determine updates to the lastName property?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok will rename (i guess in a separate PR, so we have to get this one thru first)
the goal is indeed to have a diff/patch (but also a "git blame" like functionality) for instances. the node ID has no meaning. the only thing you can do is see if it remained the same after applying a patch then you know the patch didn't impact that node. and the "patch" function will give you a list of node IDs it added and deleted.

the diff algorithm on linkml instances is "almost" but not quite a simple yaml/json diff. there are some differences, for instance if the identifier of an inlined object changed we consider this as a new object not just as a changed identifier, also we treat missing differently from explicit null so we can apply "sparse" patches without loosing data.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i now made the "patch" function compliant with the instances doc you sent me: it will not carry out no-op patches, and it will determine whether a patch is a no-op based on the equality as laid out in the spec by default.

i still kept a flag for my use case where i want to treat explicit null differently from a missing value, but PatchOptions::default is now compliant (save for bugs)

kervel and others added 6 commits September 11, 2025 07:17
…ity coverage; revert CLI offline toggle and run tests with network
…h API takes treat_missing_as_null; CLI flag; Python equals/patch args; tests for equality and no-op semantics; preserve scalar node_ids
…} with defaults; make no-op skipping configurable; wire CLI --ignore-noop/--treat-missing-as-null; update Python patch signature; update tests
@kervel kervel changed the title Refactor "patch" to avoid roundtrip to json. we patch the linkml values directly. diff/patch: refactor Sep 11, 2025
@kervel kervel merged commit 4b34691 into p2p-ld:main Sep 12, 2025
16 checks passed
@kervel
Copy link
Collaborator Author

kervel commented Sep 12, 2025

argh sorry i ment to merge into my repo main not this one
will try to revert..

@kervel
Copy link
Collaborator Author

kervel commented Sep 12, 2025

i rolled back the push to main ... now i need to get this PR to reopen.

@kervel kervel mentioned this pull request Sep 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants