Fuji with Orbax Emergency Checkpointer #1210

samos123 · 2025-05-23T07:00:23Z

do not merge, just to align on what to test.

Command to test on 2 x v6e-16:

axlearn gcp launch run --cluster=$CLUSTER \
        --runner_name gke_tpu_single \
        --name=$USER-2 \
        --instance_type=tpu-v6e-16 \
        --host_mount_spec=name=tmp,host_path=/tmp,mount_path=/host-tmp \
        --num_replicas=2 \
        --bundler_spec=allow_dirty=True \
        --bundler_type=artifactregistry --bundler_spec=image=tpu \
        --bundler_spec=dockerfile=Dockerfile --bundler_spec=target=tpu \
        -- python3 -m axlearn.common.launch_trainer_main \
          --init_module=axlearn.common.checkpointer_orbax_emergency:local_ckpt_dir=/host-tmp/checkpoints \
          --module=text.gpt.c4_trainer --config=fuji-7B-v2-flash-orbaxem \
          --trainer_dir=gs://$PROJECT_ID-axlearn/$USER-v6e-7b-orbax-2/ \
          --data_dir=gs://axlearn-public/tensorflow_datasets  \
          --jax_backend=tpu \
          --mesh_selector=tpu-v6e-16 \
          --trace_at_steps=3

70b on 4 x v6e-256:

axlearn gcp launch run --cluster=$CLUSTER \
        --runner_name gke_tpu_single \
        --name=$USER \
        --instance_type=tpu-v6e-256 \
        --priority_class=very-high \
        --host_mount_spec=name=tmp,host_path=/tmp,mount_path=/host-tmp \
        --num_replicas=4 \
        --bundler_spec=allow_dirty=True \
        --bundler_type=artifactregistry --bundler_spec=image=tpu \
        --bundler_spec=dockerfile=Dockerfile --bundler_spec=target=tpu \
        -- python3 -m axlearn.common.launch_trainer_main \
          --init_module=axlearn.common.checkpointer_orbax_emergency:local_ckpt_dir=/host-tmp/checkpoints \
          --module=text.gpt.c4_trainer \
          --config=fuji-70B-v2-flash-orbaxem \
          --trainer_dir=gs://tess-checkpoints-us-west1/stoelinga-axlearn-v6e-4k-orbax-1/ \
          --data_dir=gs://axlearn-public/tensorflow_datasets  \
          --jax_backend=tpu \
          --mesh_selector=tpu-v6e-256-4 \
          --trace_at_steps=3

Private _normalized_spec was changed to _normalized_spec_for_aval.

matthew-e-hopkins and others added 6 commits May 20, 2025 22:27

update to jax 0.5.3

2c60c02

update _paritition_spec usage

eeefdc8

Private _normalized_spec was changed to _normalized_spec_for_aval.

update tests for jax 0.5.3

0a17498

Merge branch 'jax-0.5.3' into orbax-fuji

c8452b2

Orbax trainer config for Fuji

b0fb79e

too late to write

5ad37e6

samos123 changed the title ~~Fuji Config with Orbax Emergency Checkpointer~~ Fuji with Orbax Emergency Checkpointer May 23, 2025

samos123 added 23 commits May 23, 2025 14:18

pdbs=1 hostNetwork true

a4b223c

use kueue and add prints

6f62e6d

update

5261304

update

6622a5c

bump orbax version

5be3b44

change to 128GB buffer

18d3ebb

unlimited grpc messages

fdfeace

unlimited grpc take 2

51cc8ab

add test script

7452761

revert unlimited grpc limits

e8f86f4

add bundle step

19dab4b

custom libtpu

d55f279

comment out custom libtpu

74f4b0d

use custom orbax

5aae6dd

use new cluster

a40d6e4

set bastion tier disabled explicitely

52d8ebf

update for small scale tests

6a5a530

git mv test-orbax-4k-70b.sh test-orbax.sh

86d6690

loss print every step

71699e7

add support for no orbax on small scale

f94c115

save every 500 steps without orbax

32d7c2f

update pathways image to 0.5.3

6028eb6

pathways allow insecure grpc

ea0f3e8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fuji with Orbax Emergency Checkpointer #1210

Fuji with Orbax Emergency Checkpointer #1210

Uh oh!

samos123 commented May 23, 2025 •

edited

Loading

Uh oh!

Uh oh!

Fuji with Orbax Emergency Checkpointer #1210

Are you sure you want to change the base?

Fuji with Orbax Emergency Checkpointer #1210

Uh oh!

Conversation

samos123 commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

samos123 commented May 23, 2025 •

edited

Loading