Commit 219007d
authored
fix(core/saver): create checkpoint_dir in each rank when initialize_checkpoint (#63)
Fix the error:
```
[MLF 2026-02-26 19:09:24,334 ERROR Step=110 Rank=5 ml_flashpoint.adapter.megatron.save_utils:78] Failed to save ML Flashpoint checkpoint. Skipping saving and continuing.
Traceback (most recent call last):
File "/tmp/ml-flashpoint/src/ml_flashpoint/adapter/megatron/save_utils.py", line 67, in save_local_aware_megatron_checkpoint
return save_strategy.async_save(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/contextlib.py", line 81, in inner
return func(*args, **kwds)
^^^^^^^^^^^^^^^^^^^
File "/tmp/ml-flashpoint/src/ml_flashpoint/adapter/megatron/save_strategies.py", line 220, in async_save
with open(os.path.join(checkpoint_dir, "metadata.json"), "w") as f:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/logs/ml-flashpoint/job-plan/step-110_ckpt/metadata.json'
```1 parent e6c2a59 commit 219007d
File tree
2 files changed
+8
-15
lines changed- src/ml_flashpoint/core
- tests/core
2 files changed
+8
-15
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
339 | 339 | | |
340 | 340 | | |
341 | 341 | | |
342 | | - | |
343 | | - | |
344 | | - | |
| 342 | + | |
| 343 | + | |
345 | 344 | | |
346 | 345 | | |
347 | 346 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
406 | 406 | | |
407 | 407 | | |
408 | 408 | | |
409 | | - | |
| 409 | + | |
410 | 410 | | |
411 | 411 | | |
412 | 412 | | |
| |||
427 | 427 | | |
428 | 428 | | |
429 | 429 | | |
430 | | - | |
431 | | - | |
432 | | - | |
433 | | - | |
434 | | - | |
435 | | - | |
| 430 | + | |
| 431 | + | |
| 432 | + | |
436 | 433 | | |
437 | 434 | | |
438 | 435 | | |
| |||
494 | 491 | | |
495 | 492 | | |
496 | 493 | | |
497 | | - | |
498 | | - | |
499 | | - | |
500 | | - | |
501 | | - | |
| 494 | + | |
| 495 | + | |
502 | 496 | | |
503 | 497 | | |
504 | 498 | | |
| |||
0 commit comments