Skip to content

Conversation

@SamuelMarks
Copy link
Collaborator

@SamuelMarks SamuelMarks commented Sep 17, 2025

Description

Third phase of RESTRUCTURE.md. Follows #2144 & #2304.

Note: merge this after #2541; which roughly halves this PR.

mkdir -p dependencies/{dockerfiles,requirements,scripts} tools/{data_generation,dev,gcs_benchmarks,orchestration,setup,weight_inspector}
while read -r f; do mv "$f" dependencies/dockerfiles/; done< <(gfind . \( -type d -name .git -prune -not -name .git \) -o -type f -name '*Dockerfile')
for name in 'jetstream_pathways' 'maxengine_server'; do
   git checkout main ./src/MaxText/inference/"$name"/Dockerfile
   mv ./src/MaxText/inference/"$name"/Dockerfile ./dependencies/dockerfiles/"$name".Dockerfile
done
rm dependencies/dockerfiles/Dockerfile
mv *.txt dependencies/requirements/
mv {docker_build_dependency_image,docker_upload_runner}.sh dependencies/scripts/
mv {download_dataset.sh,src/MaxText/generate_distillation_data.py} tools/data_generation/
mv {code_style.sh,unit_test_and_lint.sh} tools/dev/
mv src/MaxText/standalone_{checkpointer,dataloader}.py tools/gcs_benchmarks/
mv {gpu_multi_process_run.sh,multihost_job.py,multihost_runner.py} tools/orchestration/
mv setup*sh tools/setup/
mv src/MaxText/weight_inspector.py tools/weight_inspector/
for d in data_generation gcs_benchmarks orchestration weight_inspector; do cp src/MaxText/experimental/__init__.py tools/"$d"; done
cp src/MaxText/experimental/__init__.py tools
 44 files changed, 91 insertions(+), 70 deletions(-)
 rename clean_py_env.Dockerfile => dependencies/dockerfiles/clean_py_env.Dockerfile (100%)
 rename src/MaxText/inference/jetstream_pathways/Dockerfile => dependencies/dockerfiles/jetstream_pathways.Dockerfile (100%)
 rename src/MaxText/inference/maxengine_server/Dockerfile => dependencies/dockerfiles/maxengine_server.Dockerfile (100%)
 rename maxtext_custom_wheels.Dockerfile => dependencies/dockerfiles/maxtext_custom_wheels.Dockerfile (100%)
 rename maxtext_db_dependencies.Dockerfile => dependencies/dockerfiles/maxtext_db_dependencies.Dockerfile (96%)
 rename maxtext_dependencies.Dockerfile => dependencies/dockerfiles/maxtext_dependencies.Dockerfile (96%)
 rename maxtext_gpu_dependencies.Dockerfile => dependencies/dockerfiles/maxtext_gpu_dependencies.Dockerfile (96%)
 rename maxtext_jax_ai_image.Dockerfile => dependencies/dockerfiles/maxtext_jax_ai_image.Dockerfile (90%)
 rename maxtext_libtpu_path.Dockerfile => dependencies/dockerfiles/maxtext_libtpu_path.Dockerfile (100%)
 rename maxtext_runner.Dockerfile => dependencies/dockerfiles/maxtext_runner.Dockerfile (100%)
 rename {base_requirements => dependencies/requirements}/requirements.txt (100%)
 rename requirements_docs.txt => dependencies/requirements/requirements_docs.txt (100%)
 rename requirements_with_jax_ai_image.txt => dependencies/requirements/requirements_with_jax_ai_image.txt (100%)
 rename requirements_with_jax_stable_stack_0_6_1_pipreqs.txt => dependencies/requirements/requirements_with_jax_stable_stack_0_6_1_pipreqs.txt (100%)
 rename {base_requirements => dependencies/requirements}/tpu-base-requirements.txt (100%)
 rename {generated_requirements => dependencies/requirements}/tpu-requirements.txt (100%)
 rename docker_build_dependency_image.sh => dependencies/scripts/docker_build_dependency_image.sh (100%)
 rename docker_upload_runner.sh => dependencies/scripts/docker_upload_runner.sh (100%)
 delete mode 100644 requirements.txt
 create mode 100644 tools/__init__.py
 create mode 100644 tools/data_generation/__init__.py
 rename download_dataset.sh => tools/data_generation/download_dataset.sh (100%)
 rename {src/MaxText => tools/data_generation}/generate_distillation_data.py (100%)
 rename code_style.sh => tools/dev/code_style.sh (100%)
 rename unit_test_and_lint.sh => tools/dev/unit_test_and_lint.sh (100%)
 create mode 100644 tools/gcs_benchmarks/__init__.py
 rename {src/MaxText => tools/gcs_benchmarks}/standalone_checkpointer.py (100%)
 rename {src/MaxText => tools/gcs_benchmarks}/standalone_dataloader.py (100%)
 create mode 100644 tools/orchestration/__init__.py
 rename gpu_multi_process_run.sh => tools/orchestration/gpu_multi_process_run.sh (100%)
 rename multihost_job.py => tools/orchestration/multihost_job.py (100%)
 rename multihost_runner.py => tools/orchestration/multihost_runner.py (100%)
 rename setup.sh => tools/setup/setup.sh (100%)
 rename setup_gcsfuse.sh => tools/setup/setup_gcsfuse.sh (100%)
 rename setup_with_retries.sh => tools/setup/setup_with_retries.sh (100%)
 create mode 100644 tools/weight_inspector/__init__.py
 rename {src/MaxText => tools/weight_inspector}/weight_inspector.py (100%)

Tests

Whence CI passes manual testing can happen; then this is ready for merge.

This worked on a TPU VM:

bash ./dependencies/scripts/docker_build_dependency_image.sh DEVICE=tpu MODE=stable
bash ./dependencies/scripts/docker_build_dependency_image.sh DEVICE=tpu MODE=nightly
printf -v command 'python3 -m MaxText.train MaxText/configs/base.yml base_output_directory='"'"'%s'"'"' dataset_path='"'"'%s'"'"' steps='"'"'%d'"'"' per_device_batch_size='"'"'%d'"'"   "${BASE_OUTPUT_DIR?}" "${DATASET_PATH?}" '100' '1'
xpk workload create \
  --base-docker-image 'maxtext_base_image'\
  --zone "${ZONE?}" \
  --cluster "${CLUSTER_NAME?}" \
  --workload "${WORKLOAD?}" \
  --tpu-type='v6e-256' \
  --num-slices='1' \
  --command "${command?}"

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed.

Copy link
Collaborator

@bvandermoon bvandermoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @SamuelMarks. Generally LGTM, just a few comments

Copy link
Collaborator

@bvandermoon bvandermoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you locally test building some of the Docker images, running setup.sh, other relevant tests, etc.? I can test the benchmark_runner tomorrow with a Docker image. Maybe pip installing from source also (although I don't think this should be impacted)

Copy link
Collaborator

@bvandermoon bvandermoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to split this PR into two PRs? One for dependencies and one for tools

Also, we will want to run manual tests for this one

Copy link
Collaborator

@bvandermoon bvandermoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM just one comment

Copy link
Collaborator

@bvandermoon bvandermoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, please put the benchmark runner test result in the PR description as well before merging

@copybara-service copybara-service bot merged commit 8d9588a into AI-Hypercomputer:main Nov 4, 2025
43 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants