Description
PR #828 made the built-in chat-model tool-call context checkpoint-safe by normalizing non-primitive Python values (UUID, OutputSchema, List[ChatMessage]) to a primitive-only form before they reach sensory memory, and reconstructing the rich types on read. Without this, Pemja wraps those objects as PyObject holders whose JNI pointers go stale after a TaskManager/Python restart, so restoring the checkpointed context SIGSEGVs in JcpPyObject_FromJObject.
The unit tests in #828 assert that the stored form is recursively primitive as a checkpoint-safety proxy, because the bug cannot be reproduced in local/MiniCluster mode: that path never crosses Pemja, and triggering an in-place recovery by throwing an exception does not recreate the JVM, so the failing code path is never exercised.
After #708 lands, we should add an end-to-end test that triggers a real recovery in a standalone cluster — e.g. by killing the TaskManager process so the JVM (and the embedded Python interpreter) is recreated — and verify that the built-in tool-context flow recovers correctly. This would give true before/after verification of the #828 fix rather than relying on the primitive-form proxy.
Depends on #708.
Description
PR #828 made the built-in chat-model tool-call context checkpoint-safe by normalizing non-primitive Python values (
UUID,OutputSchema,List[ChatMessage]) to a primitive-only form before they reach sensory memory, and reconstructing the rich types on read. Without this, Pemja wraps those objects asPyObjectholders whose JNI pointers go stale after a TaskManager/Python restart, so restoring the checkpointed context SIGSEGVs inJcpPyObject_FromJObject.The unit tests in #828 assert that the stored form is recursively primitive as a checkpoint-safety proxy, because the bug cannot be reproduced in local/MiniCluster mode: that path never crosses Pemja, and triggering an in-place recovery by throwing an exception does not recreate the JVM, so the failing code path is never exercised.
After #708 lands, we should add an end-to-end test that triggers a real recovery in a standalone cluster — e.g. by killing the TaskManager process so the JVM (and the embedded Python interpreter) is recreated — and verify that the built-in tool-context flow recovers correctly. This would give true before/after verification of the #828 fix rather than relying on the primitive-form proxy.
Depends on #708.