Skip to content

Trick Checkpoint Restart

Jason Harvey edited this page Jul 7, 2022 · 11 revisions

This describes how we make GUNNS compatible with Trick's checkpoint/restart capability.

  • The checkpointed-ness of a variable is controlled with the trick_chkpnt_io() field in the Trick comment in the variable declaration. For example:
double mVelocity; /**< (m/s) trick_chpnt_io(**) This is not checkpointed, by overriding the I/O setting.
                                                This is how we usually NOT checkpoint. */
  • The trick_chkpnt_io() field is optional, and should be put after the units field and before the comment field, when used.
  • If the trick_chkpnt_io() field is not specified, then the checkpointed-ness of a variable matches the Trick I/O field. For example:
double mPosition; /**<      (m)   This is checkpointed, because of the I/O setting.
                                  This is how we usually checkpoint. */
double mVelocity; /**< (**) (m/s) This is not checkpointed, because of the I/O setting. */
  • The trick_chkpnt_io() field overrides the Trick I/O field for determining checkpointed-ness. For example:
double mVelocity; /**< (**) (m/s) trick_chpnt_io() This is checkpointed, by overriding the I/O setting. */
  • For any pointers, the trick_chkpnt_io() field only controls the checkpointed-ness of the pointer itself (the address value), not the thing that is pointed to.

    • In general, there is usually no good reason to checkpoint a pointer, so we usually trick_chkpnt_io(**) them.
    • An example of where it would be appropriate is if the pointer changes to different objects during runtime as part of your model state (for example, switching between state objects in a state machine).
  • For pointers to dynamic memory allocations, the pointer's trick_chkpnt_io() field does not control the checkpointed-ness of the dynamic memory itself.

    • If Trick knows about the dynamic memory (allocated with the TMM instead of new/delete, etc.) then Trick always checkpoint/restarts the dynamic array, and there's no way to control it.
    • Only way to not checkpoint-restart a dynamic array is to hide it from Trick, i.e. use new/delete instead of TMM
      • But then it's not visible on TV, so this is a trade-off.
  • Because the Trick I/O field defaults to I/O enabled, and checkpointed-ness defaults to match the Trick I/O field, then by default, Trick checkpoints everything.

    • This is actually bad - it adds unneeded bloat to the checkpoint files, and more things that can break in the restart.
  • What to checkpoint? Basically, inputs and state:

    • All inputs to your model
    • All changing, persistent state - terms that can change but also need to persist from one pass to the next
      • This generally includes any output of a numerical integration, when the output is an input to next pass integral
      • Counters: (elapsedTime += dt; FrameCounter++) are also numerical integrations
      • Any other thing calculated by your model and needed to persist for output to others, or as an input next pass
      • This can also be thought of an input to your model, even if it is an attribute of the same class - it's an input from last pass
    • NOT constants or thing that never change, e.g. config data
    • NOT continuously re-calculated output that doesn't depend on its own value from last pass
  • Sim Bus is NOT checkpointed

    • Therefore you should checkpoint your intended inputs from Sim Bus
  • Goal of checkpoint/restart is repeatability. I call this the A-B-C-B-C test:

    • Run from time A to time B, cut a checkpoint at B, then continue on, recording the trajectory to time C
    • Restart back to the checkpoint at B, and run to C again.
    • The trajectory from checkpoint B to C should exactly match the original run from B to C.
    • We should checkpoint the bare minimum needed to achieve this match.
    • If running from the checkpoint diverges from B to D instead, then we didn't checkpoint something we needed.
    • Since our models and modeled physics are so highly connected, a divergence in your model could be caused by a missing term anywhere in the chain/loop of inputs to your model.
    • Therefore, everything has to be checkpoint-correct for repeatability to work.

Checkpoint_Repeatability

  • More reading about the TMM and checkpointing from Trick:

    • Ignore the stuff about the DMTCP checkpointing, as that stuff is obsolete and we don't use it. We only use the ASCII checkpointing.
    • trick/share/doc/trick/advanced/Trick_Memory_Manager_Overview.ppt
    • trick/share/doc/trick/advanced/Trick_Checkpointing.pptx
  • TBD go into more detail about the checkpoint file itself, give examples of what stuff looks like in the file, etc.

Clone this wiki locally