-
Notifications
You must be signed in to change notification settings - Fork 20
Trick Checkpoint Restart
This describes how we make GUNNS compatible with Trick's checkpoint/restart capability.
- The checkpointed-ness of a variable is controlled with the trick_chkpnt_io() field in the Trick comment in the variable declaration. For example:
double mVelocity; /**< (m/s) trick_chpnt_io(**) This is not checkpointed, by overriding the I/O setting.
This is how we usually NOT checkpoint. */
- The trick_chkpnt_io() field is optional, and should be put after the units field and before the comment field, when used.
- If the trick_chkpnt_io() field is not specified, then the checkpointed-ness of a variable matches the Trick I/O field. For example:
double mPosition; /**< (m) This is checkpointed, because of the I/O setting.
This is how we usually checkpoint. */
double mVelocity; /**< (**) (m/s) This is not checkpointed, because of the I/O setting. */
- The trick_chkpnt_io() field overrides the Trick I/O field for determining checkpointed-ness. For example:
double mVelocity; /**< (**) (m/s) trick_chpnt_io() This is checkpointed, by overriding the I/O setting. */
-
For any pointers, the trick_chkpnt_io() field only controls the checkpointed-ness of the pointer itself (the address value), not the thing that is pointed to.
- In general, there is usually no good reason to checkpoint a pointer, so we usually trick_chkpnt_io(**) them.
- An example of where it would be appropriate is if the pointer changes to different objects during runtime as part of your model state (for example, switching between state objects in a state machine).
-
For pointers to dynamic memory allocations, the pointer's trick_chkpnt_io() field does not control the checkpointed-ness of the dynamic memory itself.
- If Trick knows about the dynamic memory (allocated with the TMM instead of new/delete, etc.) then Trick always checkpoint/restarts the dynamic array, and there's no way to control it.
- Only way to not checkpoint-restart a dynamic array is to hide it from Trick, i.e. use new/delete instead of TMM
- But then it's not visible on TV, so this is a trade-off.
-
Because the Trick I/O field defaults to I/O enabled, and checkpointed-ness defaults to match the Trick I/O field, then by default, Trick checkpoints everything.
- This is actually bad - it adds unneeded bloat to the checkpoint files, and more things that can break in the restart.
-
What to checkpoint? Basically, inputs and state:
- All inputs to your model
- All changing, persistent state - terms that can change but also need to persist from one pass to the next
- This generally includes any output of a numerical integration, when the output is an input to next pass integral
- Counters: (elapsedTime += dt; FrameCounter++) are also numerical integrations
- Any other thing calculated by your model and needed to persist for output to others, or as an input next pass
- This can also be thought of an input to your model, even if it is an attribute of the same class - it's an input from last pass
- NOT constants or thing that never change, e.g. config data
- NOT continuously re-calculated output that doesn't depend on its own value from last pass
-
Sim Bus is NOT checkpointed
- Therefore you should checkpoint your intended inputs from Sim Bus
-
Goal of checkpoint/restart is repeatability. I call this the A-B-C-B-C test:
- Run from time A to time B, cut a checkpoint at B, then continue on, recording the trajectory to time C
- Restart back to the checkpoint at B, and run to C again.
- The trajectory from checkpoint B to C should exactly match the original run from B to C.
- We should checkpoint the bare minimum needed to achieve this match.
- If running from the checkpoint diverges from B to D instead, then we didn't checkpoint something we needed.
- Since our models and modeled physics are so highly connected, a divergence in your model could be caused by a missing term anywhere in the chain/loop of inputs to your model.
- Therefore, everything has to be checkpoint-correct for repeatability to work.

-
More reading about the TMM and checkpointing from Trick:
- Ignore the stuff about the DMTCP checkpointing, as that stuff is obsolete and we don't use it. We only use the ASCII checkpointing.
- trick/share/doc/trick/advanced/Trick_Memory_Manager_Overview.ppt
- trick/share/doc/trick/advanced/Trick_Checkpointing.pptx
-
TBD go into more detail about the checkpoint file itself, give examples of what stuff looks like in the file, etc.