You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Moves pre-training documentation to be a sub-section of the Run MaxText section;
* Adds more information beyond just dataset configuration for the pre-training guide;
* Adds some extra content to data pipeline individual guides.
When intended architectures transformations alter graph lowering, reference file baselines require updates.
46
46
47
-
> [!IMPORTANT]\
48
-
> While running the update script locally is not the end of the world, **relying on local execution can cause remote CI tests to fail.**
49
-
> The PR verification pipelines run the tests in a strictly locked GitHub Actions environment. The smallest discrepancies in local library installations will introduce slight backend lowering graph deviations. If your local execution leads to a remote CI check failure, rely on the GitHub Action trigger described below to generate environment-matching baselines.
47
+
```{important}
48
+
49
+
While running the update script locally is not the end of the world, **relying on local execution can cause remote CI tests to fail.**
50
+
51
+
The PR verification pipelines run the tests in a strictly locked GitHub Actions environment. The smallest discrepancies in local library installations will introduce slight backend lowering graph deviations. If your local execution leads to a remote CI check failure, rely on the GitHub Action trigger described below to generate environment-matching baselines.
52
+
```
50
53
51
54
### Method 1: Run the manual GitHub Action Workflow (Highly Recommended)
52
55
@@ -66,13 +69,14 @@ Alternatively, you can trigger the remote workflow via terminal CLI execution:
66
69
gh workflow run update_reference_hlo.yml --ref <branch>
67
70
```
68
71
69
-
> [!NOTE]
70
-
> A successful run of the manual update workflow will add a new commit to your Pull Request branch. Once complete, you must:
71
-
>
72
-
> 1. Pull the new commit from remote.
73
-
> 2. Squash the commits in your branch once again to keep your PR history clean.
74
-
> 3. Push the squashed commit to remote.
75
-
> 4. Retry the `tpu-integration` workflow to verify tests pass on your PR.
72
+
```{note}
73
+
A successful run of the manual update workflow will add a new commit to your Pull Request branch. Once complete, you must:
74
+
75
+
1. Pull the new commit from remote.
76
+
2. Squash the commits in your branch once again to keep your PR history clean.
77
+
3. Push the squashed commit to remote.
78
+
4. Retry the `tpu-integration` workflow to verify tests pass on your PR.
Copy file name to clipboardExpand all lines: docs/guides/data_input_pipeline.md
+2-1Lines changed: 2 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -37,7 +37,8 @@ Training in a multi-host environment presents unique challenges for data input p
37
37
38
38
### Random access dataset (Recommended)
39
39
40
-
Random-access formats are highly recommended for multi-host training because they allow any part of the file to be read directly by its index.<br>
40
+
Random-access formats are highly recommended for multi-host training because they allow any part of the file to be read directly by its index.
41
+
41
42
In MaxText, this is best supported by the ArrayRecord format using the Grain input pipeline. This approach gracefully handles the key challenges:
42
43
43
44
-**Concurrent access and uniqueness**: Grain assigns a unique set of indices to each host. ArrayRecord allows different hosts to read from different indices in the same file.
Copy file name to clipboardExpand all lines: docs/guides/data_input_pipeline/data_input_grain.md
+29-7Lines changed: 29 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -32,9 +32,14 @@ Grain ensures determinism in data input pipelines by saving the pipeline's state
32
32
33
33
## Using Grain
34
34
35
-
1. Grain currently supports three data formats: [ArrayRecord](https://github.com/google/array_record) (random access), [Parquet](https://arrow.apache.org/docs/python/parquet.html) (partial random-access through row groups) and [TFRecord](https://www.tensorflow.org/tutorials/load_data/tfrecord)(sequential access). Only the ArrayRecord format supports the global shuffle mentioned above. For converting a dataset into ArrayRecord, see [Apache Beam Integration for ArrayRecord](https://github.com/google/array_record/tree/main/beam). Additionally, other random access data sources can be supported via a custom [data source](https://google-grain.readthedocs.io/en/latest/data_sources/protocol.html) class.
36
-
-**Community Resource**: The MaxText community has created a [ArrayRecord Documentation](https://array-record.readthedocs.io/). Note: we appreciate the contribution from the community, but as of now it has not been verified by the MaxText or ArrayRecord developers yet.
37
-
2. If the dataset is hosted on a Cloud Storage bucket, the path `gs://` can be provided directly. However, for the best performance, it's recommended to read the bucket through [Cloud Storage FUSE](https://cloud.google.com/storage/docs/gcs-fuse). This will significantly improve the perf for the ArrayRecord format as it allows meta data caching to speeds up random access. The installation of Cloud Storage FUSE is included in [setup.sh](https://github.com/google/maxtext/blob/main/src/dependencies/scripts/setup.sh). The user then needs to mount the Cloud Storage bucket to a local path for each worker, using the script [setup_gcsfuse.sh](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/dependencies/scripts/setup_gcsfuse.sh). The script configures some parameters for the mount.
35
+
Grain currently supports three data formats: [ArrayRecord](https://github.com/google/array_record) (random access), [Parquet](https://arrow.apache.org/docs/python/parquet.html) (partial random-access through row groups) and [TFRecord](https://www.tensorflow.org/tutorials/load_data/tfrecord)(sequential access). Only the ArrayRecord format supports the global shuffle mentioned above. For converting a dataset into ArrayRecord, see [Apache Beam Integration for ArrayRecord](https://github.com/google/array_record/tree/main/beam). Additionally, other random access data sources can be supported via a custom [data source](https://google-grain.readthedocs.io/en/latest/data_sources/protocol.html) class.
36
+
37
+
```{admonition} Community Resource
38
+
39
+
The MaxText community has created a [ArrayRecord Documentation](https://array-record.readthedocs.io/). Note: we appreciate the contribution from the community, but as of now it has not been verified by the MaxText or ArrayRecord developers yet.
40
+
```
41
+
42
+
If the dataset is hosted on a Cloud Storage bucket, the path `gs://` can be provided directly. However, for the best performance, it's recommended to read the bucket through [Cloud Storage FUSE](https://cloud.google.com/storage/docs/gcs-fuse). This will significantly improve the perf for the ArrayRecord format as it allows meta data caching to speeds up random access. The installation of Cloud Storage FUSE is included in [setup.sh](https://github.com/google/maxtext/blob/main/src/dependencies/scripts/setup.sh). The user then needs to mount the Cloud Storage bucket to a local path for each worker, using the script [setup_gcsfuse.sh](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/dependencies/scripts/setup_gcsfuse.sh). The script configures some parameters for the mount.
38
43
39
44
```sh
40
45
bash src/dependencies/scripts/setup_gcsfuse.sh \
@@ -45,11 +50,13 @@ MOUNT_PATH=${MOUNT_PATH?} \
45
50
46
51
Note that `FILE_PATH` is optional; when provided, the script runs `ls -R` for pre-filling the metadata cache (see ["Performance tuning best practices" on the Google Cloud documentation](https://docs.cloud.google.com/storage/docs/cloud-storage-fuse/performance)).
47
52
53
+
### Configuration
54
+
48
55
1. Set `dataset_type=grain`, `grain_file_type={arrayrecord|parquet|tfrecord}`, `grain_train_files` in `src/maxtext/configs/base.yml` or through command line arguments to match the file pattern on the mounted local path.
49
56
50
57
2. Tune `grain_worker_count` for performance. This parameter controls the number of child processes used by Grain (more details in [behind_the_scenes](https://google-grain.readthedocs.io/en/latest/behind_the_scenes.html)). If you use a large number of workers, check your config for gcsfuse in [setup_gcsfuse.sh](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/dependencies/scripts/setup_gcsfuse.sh) to avoid gcsfuse throttling.
51
58
52
-
3. ArrayRecord Only: For multi-source blending, you can specify multiple data sources with their respective weights using semicolon (;) as a separator and a comma (,) for weights. The weights will be automatically normalized to sum to 1.0. For example:
59
+
3.*ArrayRecord Only*: For multi-source blending, you can specify multiple data sources with their respective weights using semicolon (;) as a separator and a comma (,) for weights. The weights will be automatically normalized to sum to 1.0. For example:
53
60
54
61
```
55
62
# Blend two data sources with 30% from first source and 70% from second source
When setting eval_interval > 0, evaluation will be run with a specified eval dataset. Example config (set in [`src/maxtext/configs/base.yml`](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/maxtext/configs/base.yml) or through command line):
132
+
When setting `eval_interval > 0`, evaluation will be run with a specified eval dataset. Example config (set in [`src/maxtext/configs/base.yml`](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/maxtext/configs/base.yml) or through command line):
### Experimental: resuming training with a different chip count
134
156
135
157
In Grain checkpoints, each data-loading host has a corresponding JSON file. For cases where a user wants to resume training with a different number of data-loading hosts, MaxText provides an experimental feature:
Copy file name to clipboardExpand all lines: docs/guides/data_input_pipeline/data_input_hf.md
+34Lines changed: 34 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -39,6 +39,40 @@ hf_eval_files: 'gs://<bucket>/<folder>/*-validation-*.parquet' # match the val
39
39
tokenizer_path: 'google-t5/t5-large' # for using https://huggingface.co/google-t5/t5-large
40
40
```
41
41
42
+
## Tokenizer configuration
43
+
44
+
The Hugging Face pipeline only supports Hugging Face tokenizers and will ignore the `tokenizer_type` flag.
45
+
46
+
## Using gated datasets
47
+
48
+
For [gated datasets](https://huggingface.co/docs/hub/en/datasets-gated) or tokenizers from [gated models](https://huggingface.co/docs/hub/en/models-gated), you need to:
49
+
50
+
1. Request access on HuggingFace
51
+
2. Generate an access token from your [HuggingFace settings](https://huggingface.co/settings/tokens)
52
+
3. Provide the token in your command:
53
+
54
+
```bash
55
+
hf_access_token=<YOUR_TOKEN>
56
+
```
57
+
58
+
Example with gated model:
59
+
60
+
```bash
61
+
python3 -m maxtext.trainers.pre_train.train \
62
+
base_output_directory=gs://<your-bucket> \
63
+
run_name=llama2_demo \
64
+
model_name=llama2-7b \
65
+
dataset_type=hf \
66
+
hf_path=allenai/c4 \
67
+
hf_data_dir=en \
68
+
train_split=train \
69
+
tokenizer_type=huggingface \
70
+
tokenizer_path=meta-llama/Llama-2-7b \
71
+
hf_access_token=hf_xxxxxxxxxxxxx \
72
+
steps=1000 \
73
+
per_device_batch_size=8
74
+
```
75
+
42
76
## Limitations and Recommendations
43
77
44
78
1. Streaming data directly from Hugging Face Hub may be impacted by the traffic of the server. During peak hours you may encounter "504 Server Error: Gateway Time-out". It's recommended to download the Hugging Face dataset to a Cloud Storage bucket or disk for the most stable experience.
Copy file name to clipboardExpand all lines: docs/guides/data_input_pipeline/data_input_tfds.md
+12Lines changed: 12 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,9 @@
1
1
# TFDS pipeline
2
2
3
+
The TensorFlow Datasets (TFDS) pipeline uses datasets in TFRecord format, which is performant and widely supported in the TensorFlow ecosystem.
4
+
5
+
## Example config for streaming from TFDS dataset in a Cloud Storage bucket
6
+
3
7
1. Download the Allenai C4 dataset in TFRecord format to a Cloud Storage bucket. For information about cost, see [this discussion](https://github.com/allenai/allennlp/discussions/5056)
4
8
5
9
```shell
@@ -18,3 +22,11 @@ eval_split: 'validation'
18
22
# TFDS input pipeline only supports tokenizer in spm format
Copy file name to clipboardExpand all lines: docs/run_maxtext.md
+22-13Lines changed: 22 additions & 13 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,50 +2,59 @@
2
2
3
3
Choose your environment and orchestration method to run MaxText.
4
4
5
-
::::{grid} 1 2 2 2
6
-
:gutter: 2
5
+
````{grid} 1 2 2 2
6
+
---
7
+
gutter: 2
8
+
---
9
+
```{grid-item-card} 🚀 Pre-training
10
+
:link: run_maxtext/run_maxtext_pretraining
11
+
:link-type: doc
12
+
13
+
Complete guide to pre-training language models from scratch. Covers model selection, hyperparameters, dataset configuration, deployment options, and monitoring.
14
+
```
7
15
8
-
:::{grid-item-card} 💻 Localhost / Single VM
16
+
```{grid-item-card} 💻 Localhost / Single VM
9
17
:link: run_maxtext/run_maxtext_localhost
10
18
:link-type: doc
11
19
12
20
Get started quickly on a single machine. Clone the repo, install dependencies, and run your first training job on a single TPU or GPU VM.
13
-
:::
21
+
```
14
22
15
-
:::{grid-item-card} 🎮 Single-host GPU
23
+
```{grid-item-card} 🎮 Single-host GPU
16
24
:link: run_maxtext/run_maxtext_single_host_gpu
17
25
:link-type: doc
18
26
19
27
Run MaxText on single-host NVIDIA GPUs (e.g., A3 High/Mega). Includes Docker setup, NVIDIA Container Toolkit installation, and 1B/7B model training examples.
20
-
:::
28
+
```
21
29
22
-
:::{grid-item-card} 🏗️ At scale with XPK (GKE)
30
+
```{grid-item-card} 🏗️ At scale with XPK (GKE)
23
31
:link: run_maxtext/run_maxtext_via_xpk
24
32
:link-type: doc
25
33
26
34
Deploy to Google Kubernetes Engine (GKE) using XPK. Orchestrate large-scale training jobs on TPU or GPU clusters with simple CLI commands.
27
-
:::
35
+
```
28
36
29
-
:::{grid-item-card} 🌐 Multi-host via Pathways
37
+
```{grid-item-card} 🌐 Multi-host via Pathways
30
38
:link: run_maxtext/run_maxtext_via_pathways
31
39
:link-type: doc
32
40
33
41
Run large-scale JAX jobs on TPUs using Pathways. Supports batch and headless (interactive) workloads on GKE.
34
-
:::
42
+
```
35
43
36
-
:::{grid-item-card} 🔌 Decoupled Mode
44
+
```{grid-item-card} 🔌 Decoupled Mode
37
45
:link: run_maxtext/decoupled_mode
38
46
:link-type: doc
39
47
40
48
Run tests and local development without Google Cloud dependencies (no `gcloud`, GCS, or Vertex AI required).
0 commit comments