You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
|`buffer_size`| Controls local shuffle quality | Larger = better randomization, more CPU memory |
70
70
|`num_workers`| Controls data loading throughput | More workers = better I/O overlap and shuffling, but more memory (each has its own buffer) |
71
-
|`prefetch_factor`| Batches queued ahead per worker (default: 4) | Higher = absorbs shard-transition stalls, more memory |
71
+
|`prefetch_factor`| Batches queued ahead per worker (default: 4) | Higher = absorbs shard-transition stalls, more memory |
72
72
73
73
**Step-time spikes:** When a worker finishes its shard and opens a new one, the GPU may stall
74
74
waiting for the buffer to refill. This causes occasional step-time spikes visible in WandB.
75
75
Increasing `prefetch_factor` or `buffer_size` can help absorb these stalls.
76
76
77
-
## Reshuffling and resharding with DuckDB
77
+
## Reshuffling and resharding with [DuckDB](https://duckdb.org/)
78
78
79
79
To address the limited batch diversity, we globally shuffle all sequences and reshard into many
80
-
more Parquet files using DuckDB:
80
+
more Parquet files using [DuckDB](https://duckdb.org/):
81
81
82
82
<palign="center">
83
83
<imgsrc="assets/resharding_duckdb.png"alt="Resharding pipeline with DuckDB"width="60%" />
@@ -113,7 +113,14 @@ reading from different shards, the effective shuffle pool becomes
113
113
<imgsrc="assets/hf_streaming_buffer_resharded.png"alt="HF streaming buffer with resharded dataset"width="80%" />
114
114
</p>
115
115
116
-
In order to create such a dataset, use the duck db command above and create ad irectory for your sharded parquet files. Then point the `og2_7b_thd_gqa_global_shuffle` config at the output directory like so:
116
+
### Creating your own resharded dataset
117
+
118
+
Use the [DuckDB](https://duckdb.org/) command above to globally shuffle and reshard your data:
119
+
120
+
1. Install DuckDB: `pip install duckdb` (or download from [duckdb.org](https://duckdb.org/))
121
+
2. Run the DuckDB command above from the directory containing your JSONL training files
122
+
3. The output directory will contain Parquet shards (e.g. `output/data_0.parquet`, ...)
123
+
4. Update your Hydra config or override on the command line:
0 commit comments