Skip to content

Commit d9b8746

Browse files
committed
Update dataset and readme
Signed-off-by: Savitha Srinivasan <savithas@nvidia.com>
1 parent 43e29dc commit d9b8746

File tree

2 files changed

+34
-16
lines changed

2 files changed

+34
-16
lines changed

bionemo-recipes/recipes/opengenome2_llama_native_te/DATASET.md

Lines changed: 25 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -64,20 +64,20 @@ batch diversity and we hypothesized that it may slow convergence compared to the
6464

6565
### Tuning parameters
6666

67-
| Parameter | Effect | Tradeoff |
68-
| ----------------- | -------------------------------------------- | ---------------------------------------------------------------------------- |
69-
| `buffer_size` | Controls local shuffle quality | Larger = better randomization, more CPU memory |
67+
| Parameter | Effect | Tradeoff |
68+
| ----------------- | -------------------------------------------- | ------------------------------------------------------------------------------------------ |
69+
| `buffer_size` | Controls local shuffle quality | Larger = better randomization, more CPU memory |
7070
| `num_workers` | Controls data loading throughput | More workers = better I/O overlap and shuffling, but more memory (each has its own buffer) |
71-
| `prefetch_factor` | Batches queued ahead per worker (default: 4) | Higher = absorbs shard-transition stalls, more memory |
71+
| `prefetch_factor` | Batches queued ahead per worker (default: 4) | Higher = absorbs shard-transition stalls, more memory |
7272

7373
**Step-time spikes:** When a worker finishes its shard and opens a new one, the GPU may stall
7474
waiting for the buffer to refill. This causes occasional step-time spikes visible in WandB.
7575
Increasing `prefetch_factor` or `buffer_size` can help absorb these stalls.
7676

77-
## Reshuffling and resharding with DuckDB
77+
## Reshuffling and resharding with [DuckDB](https://duckdb.org/)
7878

7979
To address the limited batch diversity, we globally shuffle all sequences and reshard into many
80-
more Parquet files using DuckDB:
80+
more Parquet files using [DuckDB](https://duckdb.org/):
8181

8282
<p align="center">
8383
<img src="assets/resharding_duckdb.png" alt="Resharding pipeline with DuckDB" width="60%" />
@@ -113,7 +113,14 @@ reading from different shards, the effective shuffle pool becomes
113113
<img src="assets/hf_streaming_buffer_resharded.png" alt="HF streaming buffer with resharded dataset" width="80%" />
114114
</p>
115115

116-
In order to create such a dataset, use the duck db command above and create ad irectory for your sharded parquet files. Then point the `og2_7b_thd_gqa_global_shuffle` config at the output directory like so:
116+
### Creating your own resharded dataset
117+
118+
Use the [DuckDB](https://duckdb.org/) command above to globally shuffle and reshard your data:
119+
120+
1. Install DuckDB: `pip install duckdb` (or download from [duckdb.org](https://duckdb.org/))
121+
2. Run the DuckDB command above from the directory containing your JSONL training files
122+
3. The output directory will contain Parquet shards (e.g. `output/data_0.parquet`, ...)
123+
4. Update your Hydra config or override on the command line:
117124

118125
```yaml
119126
dataset:
@@ -123,6 +130,13 @@ dataset:
123130
streaming: true
124131
```
125132
133+
Or via command line:
134+
135+
```bash
136+
torchrun --nproc_per_node=8 train_fsdp2.py --config-name og2_7b_thd_gqa_global_shuffle \
137+
dataset.load_dataset_kwargs.path=/path/to/your/resharded_parquet_dir
138+
```
139+
126140
## Summary of approaches
127141

128142
<p align="center">
@@ -131,10 +145,10 @@ dataset:
131145

132146
## Config mapping
133147

134-
| Config | Data source | Tokenization | stride | buffer_size | Notes |
135-
| ------------------------------- | ------------------- | --------------------- | ------ | ----------- | ----------------------- |
136-
| `og2_7b_thd_gqa` | Streaming JSONL (original) | Windowed (on-the-fly) | 200 | 50,000 | Original 80 shards |
137-
| `og2_7b_thd_gqa_global_shuffle` | Streaming Sharded Parquet | Windowed (on-the-fly) | 200 | 10,000 | Reshuffled 1,733 shards |
148+
| Config | Data source | Tokenization | stride | buffer_size | Notes |
149+
| ------------------------------- | -------------------------- | --------------------- | ------ | ----------- | ----------------------- |
150+
| `og2_7b_thd_gqa` | Streaming JSONL (original) | Windowed (on-the-fly) | 200 | 50,000 | Original 80 shards |
151+
| `og2_7b_thd_gqa_global_shuffle` | Streaming Sharded Parquet | Windowed (on-the-fly) | 200 | 10,000 | Reshuffled 1,733 shards |
138152

139153
Implementation: [dataset.py](dataset.py) (`create_tokenized_dataset`, `create_thd_dataloader`,
140154
`create_bshd_dataloader`).

bionemo-recipes/recipes/opengenome2_llama_native_te/README.md

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -50,8 +50,9 @@ Spike-No-More embeddings, scaled initialization of output projections (proj/fc2)
5050
with FP32 master weights.
5151

5252
However, this recipe uses THD sequence packing for training, whereas the Megatron baseline uses a standard BSHD dataloader.
53-
On the metagenome dataset, the median sequence length is ~2.2k and the average is ~4k, so with THD we
54-
process roughly 2–3× more tokens per training step (less padding waste). As a result, this recipe
53+
In the metagenome dataset, the median sequence length is ~2.2k and the average is ~4k, so with THD we
54+
process roughly 2–3× more tokens per training step (less padding waste). See
55+
[Dataset and tokenization](DATASET.md) for more details on the data pipeline. As a result, this recipe
5556
achieves significantly better convergence [TODO: add %] than the Megatron baseline at a matched global batch size.
5657
Both runs use FP32 master weights; the Megatron baseline uses FP8 training and we use BF16. Reported
5758
results use GBS 384 on 6× H100 nodes (48 GPUs). Note that we also use bf16/fp32 training while the Megatron baseline uses fp8/fp32 training
@@ -230,9 +231,12 @@ torchrun --nproc_per_node=4 train_fsdp2_cp.py --config-name L0_sanity_cp cp_size
230231
## Downloading Pre-Training Data
231232

232233
The default configs expect OpenGenome2-style data: either JSONL (e.g.
233-
`data_metagenomics_train_*.jsonl.gz`) for streaming, or a directory of pre-chunked Parquet shards for
234-
the globally shuffled path. Point `dataset.load_dataset_kwargs.path` to your data directory (or
235-
use the appropriate config). Example for pre-chunked Parquet:
234+
`data_metagenomics_train_*.jsonl.gz`) for streaming, or a directory of globally shuffled Parquet
235+
shards. For details on the data pipeline, how to reshard your data with DuckDB, and the tradeoffs
236+
between streaming approaches, see [Dataset and tokenization](DATASET.md).
237+
238+
Point `dataset.load_dataset_kwargs.path` to your data directory (or use the appropriate config).
239+
Example for pre-chunked Parquet:
236240

237241
```bash
238242
torchrun --nproc_per_node=8 train_fsdp2.py --config-name og2_7b_thd_gqa_global_shuffle \

0 commit comments

Comments
 (0)