Skip to content

[Tech Debt] Add end-to-end checkpoint-recovery test for built-in tool-context (true verification of #828) #836

@weiqingy

Description

@weiqingy

Description

PR #828 made the built-in chat-model tool-call context checkpoint-safe by normalizing non-primitive Python values (UUID, OutputSchema, List[ChatMessage]) to a primitive-only form before they reach sensory memory, and reconstructing the rich types on read. Without this, Pemja wraps those objects as PyObject holders whose JNI pointers go stale after a TaskManager/Python restart, so restoring the checkpointed context SIGSEGVs in JcpPyObject_FromJObject.

The unit tests in #828 assert that the stored form is recursively primitive as a checkpoint-safety proxy, because the bug cannot be reproduced in local/MiniCluster mode: that path never crosses Pemja, and triggering an in-place recovery by throwing an exception does not recreate the JVM, so the failing code path is never exercised.

After #708 lands, we should add an end-to-end test that triggers a real recovery in a standalone cluster — e.g. by killing the TaskManager process so the JVM (and the embedded Python interpreter) is recreated — and verify that the built-in tool-context flow recovers correctly. This would give true before/after verification of the #828 fix rather than relying on the primitive-form proxy.

Depends on #708.

Metadata

Metadata

Assignees

Labels

tech debt[Issue Type] User-unaware issues, such as code refactor and infrastructure maintenance.

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions