AI-Hypercomputer
diff --git a/‎docs/development.md‎
Lines changed: 1 addition & 0 deletions b/‎docs/development.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/development/hlo_diff_testing.md‎
Lines changed: 14 additions & 10 deletions b/‎docs/development/hlo_diff_testing.md‎
Lines changed: 14 additions & 10 deletions
diff --git a/‎docs/guides.md‎
Lines changed: 19 additions & 18 deletions b/‎docs/guides.md‎
Lines changed: 19 additions & 18 deletions
diff --git a/‎docs/guides/data_input_pipeline.md‎
Lines changed: 2 additions & 1 deletion b/‎docs/guides/data_input_pipeline.md‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎docs/guides/data_input_pipeline/data_input_grain.md‎
Lines changed: 29 additions & 7 deletions b/‎docs/guides/data_input_pipeline/data_input_grain.md‎
Lines changed: 29 additions & 7 deletions
diff --git a/‎docs/guides/data_input_pipeline/data_input_hf.md‎
Lines changed: 34 additions & 0 deletions b/‎docs/guides/data_input_pipeline/data_input_hf.md‎
Lines changed: 34 additions & 0 deletions
diff --git a/‎docs/guides/data_input_pipeline/data_input_tfds.md‎
Lines changed: 12 additions & 0 deletions b/‎docs/guides/data_input_pipeline/data_input_tfds.md‎
Lines changed: 12 additions & 0 deletions
diff --git a/‎docs/run_maxtext.md‎
Lines changed: 22 additions & 13 deletions b/‎docs/run_maxtext.md‎
Lines changed: 22 additions & 13 deletions
@@ -7,4 +7,5 @@ hidden:
 ---
 development/update_dependencies.md
 development/contribute_docs.md
+development/hlo_diff_testing.md
 ```
@@ -44,9 +44,12 @@ ______________________________________________________________________
 
 When intended architectures transformations alter graph lowering, reference file baselines require updates.
 
-> [!IMPORTANT]\
-> While running the update script locally is not the end of the world, **relying on local execution can cause remote CI tests to fail.**
-> The PR verification pipelines run the tests in a strictly locked GitHub Actions environment. The smallest discrepancies in local library installations will introduce slight backend lowering graph deviations. If your local execution leads to a remote CI check failure, rely on the GitHub Action trigger described below to generate environment-matching baselines.
+```{important}
+
+While running the update script locally is not the end of the world, **relying on local execution can cause remote CI tests to fail.**
+
+The PR verification pipelines run the tests in a strictly locked GitHub Actions environment. The smallest discrepancies in local library installations will introduce slight backend lowering graph deviations. If your local execution leads to a remote CI check failure, rely on the GitHub Action trigger described below to generate environment-matching baselines.
+```
 
 ### Method 1: Run the manual GitHub Action Workflow (Highly Recommended)
 
@@ -66,13 +69,14 @@ Alternatively, you can trigger the remote workflow via terminal CLI execution:
 gh workflow run update_reference_hlo.yml --ref <branch>
 ```
 
-> [!NOTE]
-> A successful run of the manual update workflow will add a new commit to your Pull Request branch. Once complete, you must:
->
-> 1. Pull the new commit from remote.
-> 2. Squash the commits in your branch once again to keep your PR history clean.
-> 3. Push the squashed commit to remote.
-> 4. Retry the `tpu-integration` workflow to verify tests pass on your PR.
+```{note}
+A successful run of the manual update workflow will add a new commit to your Pull Request branch. Once complete, you must:
+
+1. Pull the new commit from remote.
+2. Squash the commits in your branch once again to keep your PR history clean.
+3. Push the squashed commit to remote.
+4. Retry the `tpu-integration` workflow to verify tests pass on your PR.
+```
 
 ### Method 2: Local Execution
 
 
@@ -18,58 +18,59 @@
 
 Explore our how-to guides for optimizing, debugging, and managing your MaxText workloads.
 
-::::{grid} 1 2 2 2
-:gutter: 2
-
-:::{grid-item-card} ⚡ Optimization
+````{grid} 1 2 2 2
+---
+gutter: 2
+---
+```{grid-item-card} ⚡ Optimization
 :link: guides/optimization
 :link-type: doc
 
 Techniques for maximizing performance, including sharding strategies, Pallas kernels, and benchmarking.
-:::
+```
 
-:::{grid-item-card} 💾 Data Pipelines
+```{grid-item-card} 💾 Data Pipelines
 :link: guides/data_input_pipeline
 :link-type: doc
 
 Configure input pipelines using **Grain** (recommended for determinism), **HuggingFace**, or **TFDS**.
-:::
+```
 
-:::{grid-item-card} 🔄 Checkpointing
+```{grid-item-card} 🔄 Checkpointing
 :link: guides/checkpointing_solutions
 :link-type: doc
 
 Manage GCS checkpoints, handle preemption with emergency checkpointing, and configure multi-tier storage.
-:::
+```
 
-:::{grid-item-card} 🔍 Monitoring & Debugging
+```{grid-item-card} 🔍 Monitoring & Debugging
 :link: guides/monitoring_and_debugging
 :link-type: doc
 
 Tools for observability: goodput monitoring, hung job debugging, and Vertex AI TensorBoard integration.
-:::
+```
 
-:::{grid-item-card} 🐍 Python Notebooks
+```{grid-item-card} 🐍 Python Notebooks
 :link: guides/run_python_notebook
 :link-type: doc
 
 Interactive development guides for running MaxText on Google Colab or local JupyterLab environments.
-:::
+```
 
-:::{grid-item-card} 🌱 Model Bringup
+```{grid-item-card} 🌱 Model Bringup
 :link: guides/model_bringup
 :link-type: doc
 
 A step-by-step guide for the community to help expand MaxText's model library.
-:::
+```
 
-:::{grid-item-card} 🎓 Distillation
+```{grid-item-card} 🎓 Distillation
 :link: guides/distillation
 :link-type: doc
 
 How online distillation works in MaxText: loss anatomy, α / β / temperature schedule tuning, layer indices, monitoring metrics, and troubleshooting.
-:::
-::::
+```
+````
 
 ```{toctree}
 ---
 
@@ -37,7 +37,8 @@ Training in a multi-host environment presents unique challenges for data input p
 
 ### Random access dataset (Recommended)
 
-Random-access formats are highly recommended for multi-host training because they allow any part of the file to be read directly by its index.<br>
+Random-access formats are highly recommended for multi-host training because they allow any part of the file to be read directly by its index.
+
 In MaxText, this is best supported by the ArrayRecord format using the Grain input pipeline. This approach gracefully handles the key challenges:
 
 - **Concurrent access and uniqueness**: Grain assigns a unique set of indices to each host. ArrayRecord allows different hosts to read from different indices in the same file.
 
@@ -32,9 +32,14 @@ Grain ensures determinism in data input pipelines by saving the pipeline's state
 
 ## Using Grain
 
-1. Grain currently supports three data formats: [ArrayRecord](https://github.com/google/array_record) (random access), [Parquet](https://arrow.apache.org/docs/python/parquet.html) (partial random-access through row groups) and [TFRecord](https://www.tensorflow.org/tutorials/load_data/tfrecord)(sequential access). Only the ArrayRecord format supports the global shuffle mentioned above. For converting a dataset into ArrayRecord, see [Apache Beam Integration for ArrayRecord](https://github.com/google/array_record/tree/main/beam). Additionally, other random access data sources can be supported via a custom [data source](https://google-grain.readthedocs.io/en/latest/data_sources/protocol.html) class.
-   - **Community Resource**: The MaxText community has created a [ArrayRecord Documentation](https://array-record.readthedocs.io/). Note: we appreciate the contribution from the community, but as of now it has not been verified by the MaxText or ArrayRecord developers yet.
-2. If the dataset is hosted on a Cloud Storage bucket, the path `gs://` can be provided directly. However, for the best performance, it's recommended to read the bucket through [Cloud Storage FUSE](https://cloud.google.com/storage/docs/gcs-fuse). This will significantly improve the perf for the ArrayRecord format as it allows meta data caching to speeds up random access. The installation of Cloud Storage FUSE is included in [setup.sh](https://github.com/google/maxtext/blob/main/src/dependencies/scripts/setup.sh). The user then needs to mount the Cloud Storage bucket to a local path for each worker, using the script [setup_gcsfuse.sh](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/dependencies/scripts/setup_gcsfuse.sh). The script configures some parameters for the mount.
+Grain currently supports three data formats: [ArrayRecord](https://github.com/google/array_record) (random access), [Parquet](https://arrow.apache.org/docs/python/parquet.html) (partial random-access through row groups) and [TFRecord](https://www.tensorflow.org/tutorials/load_data/tfrecord)(sequential access). Only the ArrayRecord format supports the global shuffle mentioned above. For converting a dataset into ArrayRecord, see [Apache Beam Integration for ArrayRecord](https://github.com/google/array_record/tree/main/beam). Additionally, other random access data sources can be supported via a custom [data source](https://google-grain.readthedocs.io/en/latest/data_sources/protocol.html) class.
+
+```{admonition} Community Resource
+
+The MaxText community has created a [ArrayRecord Documentation](https://array-record.readthedocs.io/). Note: we appreciate the contribution from the community, but as of now it has not been verified by the MaxText or ArrayRecord developers yet.
+```
+
+If the dataset is hosted on a Cloud Storage bucket, the path `gs://` can be provided directly. However, for the best performance, it's recommended to read the bucket through [Cloud Storage FUSE](https://cloud.google.com/storage/docs/gcs-fuse). This will significantly improve the perf for the ArrayRecord format as it allows meta data caching to speeds up random access. The installation of Cloud Storage FUSE is included in [setup.sh](https://github.com/google/maxtext/blob/main/src/dependencies/scripts/setup.sh). The user then needs to mount the Cloud Storage bucket to a local path for each worker, using the script [setup_gcsfuse.sh](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/dependencies/scripts/setup_gcsfuse.sh). The script configures some parameters for the mount.
 
 ```sh
 bash src/dependencies/scripts/setup_gcsfuse.sh \
@@ -45,11 +50,13 @@ MOUNT_PATH=${MOUNT_PATH?} \
 
 Note that `FILE_PATH` is optional; when provided, the script runs `ls -R` for pre-filling the metadata cache (see ["Performance tuning best practices" on the Google Cloud documentation](https://docs.cloud.google.com/storage/docs/cloud-storage-fuse/performance)).
 
+### Configuration
+
 1. Set `dataset_type=grain`, `grain_file_type={arrayrecord|parquet|tfrecord}`, `grain_train_files` in `src/maxtext/configs/base.yml` or through command line arguments to match the file pattern on the mounted local path.
 
 2. Tune `grain_worker_count` for performance. This parameter controls the number of child processes used by Grain (more details in [behind_the_scenes](https://google-grain.readthedocs.io/en/latest/behind_the_scenes.html)). If you use a large number of workers, check your config for gcsfuse in [setup_gcsfuse.sh](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/dependencies/scripts/setup_gcsfuse.sh) to avoid gcsfuse throttling.
 
-3. ArrayRecord Only: For multi-source blending, you can specify multiple data sources with their respective weights using semicolon (;) as a separator and a comma (,) for weights. The weights will be automatically normalized to sum to 1.0. For example:
+3. *ArrayRecord Only*: For multi-source blending, you can specify multiple data sources with their respective weights using semicolon (;) as a separator and a comma (,) for weights. The weights will be automatically normalized to sum to 1.0. For example:
 
    ```
    # Blend two data sources with 30% from first source and 70% from second source
@@ -120,17 +127,32 @@ grain_train_files=/tmp/gcsfuse/array-record/c4/en/3.0.1/c4-train.array_record* \
 grain_worker_count=2
 ```
 
-1. Using validation set for evaluation
+### Using validation set for evaluation
 
-When setting eval_interval > 0, evaluation will be run with a specified eval dataset. Example config (set in [`src/maxtext/configs/base.yml`](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/maxtext/configs/base.yml) or through command line):
+When setting `eval_interval > 0`, evaluation will be run with a specified eval dataset. Example config (set in [`src/maxtext/configs/base.yml`](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/maxtext/configs/base.yml) or through command line):
 
 ```yaml
 eval_interval: 10000
 eval_steps: 50
 grain_eval_files: '/tmp/gcsfuse/array-record/c4/en/3.0.1/c4-validation.array_record*'
 ```
 
-1. Experimental: resuming training with a different chip count
+### Tokenizer support
+
+Grain pipeline supports three tokenizer types:
+
+- `sentencepiece`: For SentencePiece tokenizers;
+- `huggingface`: For HuggingFace tokenizers (requires `hf_access_token` for gated models);
+- `tiktoken`: For OpenAI's tiktoken tokenizers.
+
+Example with SentencePiece:
+
+```bash
+tokenizer_type=sentencepiece \
+tokenizer_path=gs://<your-bucket>/tokenizers/c4_en_301_5Mexp2_spm.model
+```
+
+### Experimental: resuming training with a different chip count
 
 In Grain checkpoints, each data-loading host has a corresponding JSON file. For cases where a user wants to resume training with a different number of data-loading hosts, MaxText provides an experimental feature:
 
 
@@ -39,6 +39,40 @@ hf_eval_files: 'gs://<bucket>/<folder>/*-validation-*.parquet'  # match the val
 tokenizer_path: 'google-t5/t5-large'  # for using https://huggingface.co/google-t5/t5-large
 ```
 
+## Tokenizer configuration
+
+The Hugging Face pipeline only supports Hugging Face tokenizers and will ignore the `tokenizer_type` flag.
+
+## Using gated datasets
+
+For [gated datasets](https://huggingface.co/docs/hub/en/datasets-gated) or tokenizers from [gated models](https://huggingface.co/docs/hub/en/models-gated), you need to:
+
+1. Request access on HuggingFace
+2. Generate an access token from your [HuggingFace settings](https://huggingface.co/settings/tokens)
+3. Provide the token in your command:
+
+```bash
+hf_access_token=<YOUR_TOKEN>
+```
+
+Example with gated model:
+
+```bash
+python3 -m maxtext.trainers.pre_train.train \
+  base_output_directory=gs://<your-bucket> \
+  run_name=llama2_demo \
+  model_name=llama2-7b \
+  dataset_type=hf \
+  hf_path=allenai/c4 \
+  hf_data_dir=en \
+  train_split=train \
+  tokenizer_type=huggingface \
+  tokenizer_path=meta-llama/Llama-2-7b \
+  hf_access_token=hf_xxxxxxxxxxxxx \
+  steps=1000 \
+  per_device_batch_size=8
+```
+
 ## Limitations and Recommendations
 
 1. Streaming data directly from Hugging Face Hub may be impacted by the traffic of the server. During peak hours you may encounter "504 Server Error: Gateway Time-out". It's recommended to download the Hugging Face dataset to a Cloud Storage bucket or disk for the most stable experience.
 
@@ -1,5 +1,9 @@
 # TFDS pipeline
 
+The TensorFlow Datasets (TFDS) pipeline uses datasets in TFRecord format, which is performant and widely supported in the TensorFlow ecosystem.
+
+## Example config for streaming from TFDS dataset in a Cloud Storage bucket
+
 1. Download the Allenai C4 dataset in TFRecord format to a Cloud Storage bucket. For information about cost, see [this discussion](https://github.com/allenai/allennlp/discussions/5056)
 
 ```shell
@@ -18,3 +22,11 @@ eval_split: 'validation'
 # TFDS input pipeline only supports tokenizer in spm format
 tokenizer_path: 'src/maxtext/assets/tokenizers/tokenizer.llama2'
 ```
+
+### Tokenizer support
+
+TFDS pipeline supports three tokenizer types:
+
+- `sentencepiece`: For SentencePiece tokenizers
+- `huggingface`: For HuggingFace tokenizers (requires `hf_access_token` for gated models)
+- `tiktoken`: For OpenAI's tiktoken tokenizers
@@ -2,50 +2,59 @@
 
 Choose your environment and orchestration method to run MaxText.
 
-::::{grid} 1 2 2 2
-:gutter: 2
+````{grid} 1 2 2 2
+---
+gutter: 2
+---
+```{grid-item-card} 🚀 Pre-training
+:link: run_maxtext/run_maxtext_pretraining
+:link-type: doc
+
+Complete guide to pre-training language models from scratch. Covers model selection, hyperparameters, dataset configuration, deployment options, and monitoring.
+```
 
-:::{grid-item-card} 💻 Localhost / Single VM
+```{grid-item-card} 💻 Localhost / Single VM
 :link: run_maxtext/run_maxtext_localhost
 :link-type: doc
 
 Get started quickly on a single machine. Clone the repo, install dependencies, and run your first training job on a single TPU or GPU VM.
-:::
+```
 
-:::{grid-item-card} 🎮 Single-host GPU
+```{grid-item-card} 🎮 Single-host GPU
 :link: run_maxtext/run_maxtext_single_host_gpu
 :link-type: doc
 
 Run MaxText on single-host NVIDIA GPUs (e.g., A3 High/Mega). Includes Docker setup, NVIDIA Container Toolkit installation, and 1B/7B model training examples.
-:::
+```
 
-:::{grid-item-card} 🏗️ At scale with XPK (GKE)
+```{grid-item-card} 🏗️ At scale with XPK (GKE)
 :link: run_maxtext/run_maxtext_via_xpk
 :link-type: doc
 
 Deploy to Google Kubernetes Engine (GKE) using XPK. Orchestrate large-scale training jobs on TPU or GPU clusters with simple CLI commands.
-:::
+```
 
-:::{grid-item-card} 🌐 Multi-host via Pathways
+```{grid-item-card} 🌐 Multi-host via Pathways
 :link: run_maxtext/run_maxtext_via_pathways
 :link-type: doc
 
 Run large-scale JAX jobs on TPUs using Pathways. Supports batch and headless (interactive) workloads on GKE.
-:::
+```
 
-:::{grid-item-card} 🔌 Decoupled Mode
+```{grid-item-card} 🔌 Decoupled Mode
 :link: run_maxtext/decoupled_mode
 :link-type: doc
 
 Run tests and local development without Google Cloud dependencies (no `gcloud`, GCS, or Vertex AI required).
-:::
-::::
+```
+````
 
 ```{toctree}
 ---
 hidden:
 maxdepth: 1
 ---
+run_maxtext/run_maxtext_pretraining.md
 run_maxtext/run_maxtext_localhost.md
 run_maxtext/run_maxtext_single_host_gpu.md
 run_maxtext/run_maxtext_via_xpk.md