Welcome to the NextStep dataset system! This guide will help you understand how the system unifies different sample types into training-ready batches.
- Introduction
- System Architecture
- Core Components
- Dataset Types
- Batch Output Format
- Checkpoint Resumption
- Usage Examples
- Troubleshooting
- Related Documentation
The nextstep/datasets/ directory is responsible for unifying different sample types into training-ready batches. This layer handles:
- ✅ Multi-source sampling: Mix multiple data sources according to specified sampling ratios
- ✅ Batch construction: Organize samples into batches with proper padding and alignment
- ✅ Aspect ratio grouping: Group images by aspect ratio for efficient batch processing
- ✅ State management: Track and restore data sampling state for checkpoint resumption
- ✅ Timeout handling: Monitor data loading and alert on timeouts
Multiple Data Sources (ImageTextWDS, VideoInterleave, NLPITD, etc.)
↓
MixedDataset (weighted sampling)
↓
Buffers (NLPBuffer, ImageBuffer)
↓
BatchBucket (aspect ratio grouping)
↓
MixedDataloader (state tracking, timeout monitoring)
↓
Training-ready Batches
MixedDataloader
├── MixedDataset
│ ├── Multiple IndexedTarDataset instances
│ │ ├── ImageTextWDS
│ │ ├── VideoInterleave
│ │ ├── ImageEditingInterleave
│ │ └── NLPITD
│ ├── NLPBuffer (for text-only datasets)
│ ├── ImageBuffer (for image datasets)
│ └── BatchBucket (batch organization)
└── MixingStatus (state management)
This file contains the core components for mixing multiple data sources and constructing batches.
Purpose: Combines multiple data sources with weighted sampling and constructs unified batches.
Key Features:
| Feature | Description |
|---|---|
| Weighted sampling | Samples from multiple datasets according to specified ratios |
| Buffer management | Uses NLPBuffer for text data and ImageBuffer for image data |
| Batch construction | Organizes samples into batches via BatchBucket |
| Aspect ratio grouping | Groups images by aspect ratio for efficient processing |
| State tracking | Maintains state for checkpoint resumption |
Key Parameters:
| Parameter | Description |
|---|---|
data_info_list |
List of dataset configurations with sampling ratios |
batch_size |
Target batch size |
tokenizer |
Tokenizer for text processing |
hw_aspect_ratios_ids |
Mapping of aspect ratio strings to token IDs |
max_len |
Maximum sequence length |
drop_text_prob |
Probability of dropping text tokens when sequence is too long |
Sampling Mechanism:
- Sampling ratios are normalized from
samplesfield indata_info_list - Each dataset is sampled proportionally to its ratio
- Samples are buffered and then organized into batches
Purpose: Wraps MixedDataset with state tracking, timeout monitoring, and checkpoint resumption support.
Key Features:
| Feature | Description |
|---|---|
| State tracking | Tracks mixing_status, dataset_status, and hw_aspect_ratio_status |
| Timeout monitoring | Monitors data loading and alerts on timeouts |
| Checkpoint resumption | Saves and restores data sampling state |
| Memory monitoring | Optional memory usage monitoring per worker |
| Data monitoring | Periodic data statistics reporting |
Persisted State:
| State | Description |
|---|---|
mixing_status |
Mixing ratios and dataset states per worker |
dataset_status |
Progress tracking for each dataset |
hw_aspect_ratio_status |
Statistics for different aspect ratios |
Usage:
from nextstep.datasets.mixed_dataset import MixedDataset, MixedDataloader
# Create mixed dataset
mixed_dataset = MixedDataset(
data_info_list=data_info_list,
batch_size=4,
tokenizer=tokenizer,
# ... other parameters
)
# Create dataloader
dataloader = MixedDataloader(
dataset=mixed_dataset,
num_workers=8,
timeout=30.0,
data_monitor_interval=30.0,
)Purpose: Buffers text-only (NLP) samples before batch construction.
Key Features:
- Stores
input_idsfor text samples - Provides
enqueue()to add samples anddequeue()to retrieve batches - Maintains state for checkpoint resumption
Usage: Automatically used by MixedDataset for datasets with data_type="nlp".
Purpose: Buffers image samples with aspect ratio handling before batch construction.
Key Features:
- Stores
pixel_valuesandinput_idsfor image samples - Groups images by aspect ratio
- Handles image placeholder tokens (
<image_0>,<image_1>, etc.) - Maintains state for checkpoint resumption
Usage: Automatically used by MixedDataset for datasets with image data types.
Purpose: Organizes samples into batches grouped by aspect ratio.
Key Features:
| Feature | Description |
|---|---|
| Aspect ratio grouping | Groups samples by image aspect ratio for efficient batching |
| Batch accumulation | Accumulates samples until batch size is reached |
| Safety threshold | Prevents data accumulation beyond batch_size * 2 |
Aspect Ratio Handling:
- Samples with same aspect ratio are grouped together
- Mixed aspect ratios are placed in
ANY_ASPECT_RATIObucket - NLP samples can be added to any bucket with available space
Purpose: Handles single image with caption pairs (text-to-image generation data).
Data Format:
- Single image file (e.g.,
sample1.jpg) - Caption text (e.g.,
sample1.txtor insample1.json)
Key Features:
| Feature | Description |
|---|---|
| Caption handling | Supports multiple caption keys with sampling ratios |
| Image filtering | Filters by area, aspect ratio, height, width, keywords, etc. |
| Post-processing | Optional image post-processing (cropping, resizing, etc.) |
Configuration Example:
{
"cls": "nextstep.datasets.image_text_wds.ImageTextWDS",
"data_type": "image_text_pair",
"name": "text2image/BLIP3o-60k",
"caption_keys": ["caption"],
"caption_ratio": [1],
"filter": {
"area": [256*256, 1024*1024],
"aspect_ratio": 6,
},
"samples": LargeInt("58K"),
}Purpose: Handles multiple images interleaved with text (video understanding, story data).
Data Format:
- Multiple image files (e.g.,
sample1-0.jpg,sample1-1.jpg,sample1-2.jpg) - Caption text with image placeholders (e.g.,
"<image_0>Text<image_1>More text<image_2>")
Key Features:
| Feature | Description |
|---|---|
| Multi-image support | Handles sequences of images with interleaved text |
| Placeholder handling | Processes <image_n> placeholders in captions |
| Story caption support | Special handling for story-style captions (title, summary, content) |
Configuration Example:
{
"cls": "nextstep.datasets.video_interleave.VideoInterleave",
"data_type": "interleave",
"name": "video/StoryDataset",
"caption_keys": ["caption"],
"caption_ratio": [1],
"samples": LargeInt("100K"),
}Purpose: Handles image editing tasks with input image, instruction, and output image.
Data Format:
- Input image (e.g.,
sample1-0.jpg) - Instruction text (e.g., in
sample1.json) - Output image (e.g.,
sample1-1.jpg)
Key Features:
| Feature | Description |
|---|---|
| Input-output pairs | Handles image editing with before/after images |
| Instruction processing | Processes editing instructions in captions |
| Image filtering | Similar filtering capabilities as ImageTextWDS |
Configuration Example:
{
"cls": "nextstep.datasets.image_editing_interleave.ImageEditingInterleave",
"data_type": "image_editing",
"name": "editing/InstructPix2Pix",
"caption_keys": ["caption"],
"caption_ratio": [1],
"samples": LargeInt("50K"),
}Purpose: Handles pure text data for language model pretraining.
Data Format:
- Text file (e.g.,
sample1.txt) or text in JSON (e.g.,sample1.json)
Key Features:
| Feature | Description |
|---|---|
| Text-only processing | No image handling, pure text tokenization |
| Single caption key | Only supports one caption key (unlike image datasets) |
Configuration Example:
{
"cls": "nextstep.datasets.nlp_itd.NLPITD",
"data_type": "nlp",
"name": "nlp/CommonCrawl",
"caption_keys": ["text"],
"samples": LargeInt("1M"),
}All dataset classes output batches in a unified format for the training loop.
| Field | Type | Description |
|---|---|---|
input_ids |
torch.LongTensor |
Token IDs for the input sequence |
attention_mask |
torch.LongTensor |
Attention mask (1 for valid tokens, 0 for padding) |
labels |
torch.LongTensor |
Labels for loss computation (same as input_ids with IGNORE_INDEX for padding) |
pixel_values |
torch.Tensor |
Image pixel values (shape: [batch, channels, height, width]) |
image_filtered_idx |
list[int] |
Indices of filtered images (for statistics) |
waste_token_num |
list[int] |
Number of wasted tokens due to sequence length limits |
batch = {
"input_ids": torch.LongTensor([batch_size, seq_len]),
"attention_mask": torch.LongTensor([batch_size, seq_len]),
"labels": torch.LongTensor([batch_size, seq_len]),
"pixel_values": torch.Tensor([batch_size, num_images, channels, height, width]),
"image_filtered_idx": [int, ...], # List of filtered image indices
"waste_token_num": [int, ...], # List of wasted token counts per sample
}| Token | Description |
|---|---|
<image_0>, <image_1>, ... |
Image placeholder tokens (replaced with image embeddings during training) |
<BOI>, <EOI> |
Beginning/End of Image tokens |
<EOL> |
End of Line token (optional) |
💡 Note: The
labelsfield usesIGNORE_INDEXfor padding tokens, which are ignored during loss computation.
The MixedDataloader supports full checkpoint resumption, restoring not only model state but also data sampling state.
MixedDataloader persists three types of state:
Tracks the mixing state across all workers:
{
"num_workers": int,
"data_status": {
"dataset_name": {
"worker_id": int # Number of samples processed
}
},
"dataset_state_dict": {
"worker_id": {
"rng_state": ...,
"buffer_state": ...,
# ... other dataset state
}
},
"last_worker_id": int
}Tracks progress for each dataset:
{
"dataset_name": {
"hit": int, # Successfully processed samples
"miss": int, # Skipped samples
"total": int # Total samples in dataset
}
}Tracks statistics for different aspect ratios:
{
"1:1": int, # Number of 1:1 aspect ratio images
"4:3": int, # Number of 4:3 aspect ratio images
# ... other aspect ratios
}- Save State:
MixedDataloadersaves state periodically or at checkpoint save - Load State: On resumption, state is loaded from checkpoint
- Restore Dataset State: Each
IndexedTarDatasetrestores itsDataStatus - Restore Buffer State:
NLPBufferandImageBufferrestore their internal state - Continue Sampling: Sampling continues from the restored state
⚠️ Important: When resuming training, ensure:
- The same number of workers is used (or state will be invalid)
- Dataset configurations haven't changed
- Checkpoint contains the data state (saved by
MixedDataloader)
from nextstep.datasets.mixed_dataset import MixedDataset, MixedDataloader
from transformers import AutoTokenizer
# Prepare data configuration
data_info_list = [
{
"cls": "nextstep.datasets.image_text_wds.ImageTextWDS",
"data_type": "image_text_pair",
"name": "text2image/BLIP3o-60k",
"caption_keys": ["caption"],
"caption_ratio": [1],
"samples": LargeInt("58K"),
},
{
"cls": "nextstep.datasets.nlp_itd.NLPITD",
"data_type": "nlp",
"name": "nlp/CommonCrawl",
"caption_keys": ["text"],
"samples": LargeInt("100K"),
},
]
# Initialize tokenizer
tokenizer = AutoTokenizer.from_pretrained("your-tokenizer")
# Create mixed dataset
mixed_dataset = MixedDataset(
data_info_list=data_info_list,
batch_size=4,
tokenizer=tokenizer,
down_factor=8, # VAE downsampling factor
hw_aspect_ratios_ids={
"1:1": [100, 101],
"4:3": [102, 103],
# ... more aspect ratios
},
max_len=1280,
)
# Create dataloader
dataloader = MixedDataloader(
dataset=mixed_dataset,
num_workers=8,
timeout=30.0,
)
# Use in training loop
for batch in dataloader:
# batch contains: input_ids, attention_mask, labels, pixel_values, etc.
loss = model(**batch)
# ... training code# Save checkpoint (including data state)
checkpoint = {
"model": model.state_dict(),
"optimizer": optimizer.state_dict(),
"data_state": dataloader.mixing_status.state_dict(),
}
# Load checkpoint
checkpoint = torch.load("checkpoint.pt")
model.load_state_dict(checkpoint["model"])
optimizer.load_state_dict(checkpoint["optimizer"])
# Restore data state
dataloader.mixing_status.load_state_dict(checkpoint["data_state"])
# Continue training
for batch in dataloader:
# ... training continues from restored stateProblem: Data loading is slow or batches are delayed.
Solutions:
- Increase
num_workers(but not exceeding CPU cores / number of GPUs) - Check if data is on slow storage (consider using local cache)
- Reduce
batch_sizeif memory is constrained - Check
timeoutparameter - increase if data loading takes longer
Problem: Out of memory errors during batch construction.
Solutions:
- Reduce
batch_size - Reduce
max_lento decrease sequence length - Reduce
lru_sizein dataset initialization - Check if
BatchBucket.max_accumulated_samplesthreshold is too high
Problem: Images with different aspect ratios in the same batch.
Solutions:
- This is expected behavior - mixed aspect ratios go to
ANY_ASPECT_RATIObucket - Ensure
hw_aspect_ratios_idsincludes all needed aspect ratios - Check that
BatchBucketis correctly grouping by aspect ratio
Problem: Data state not restored correctly after resumption.
Solutions:
- Ensure same
num_workersis used - Check that checkpoint contains
data_state - Verify dataset configurations haven't changed
- Check
mixing_status.state_dict()is being saved correctly
Problem: Some samples are being skipped.
Solutions:
- Check
dataset_statusfor "miss" counts - Review filtering conditions in dataset configuration
- Check logs for specific error messages
- Verify data format matches expected schema
- Data Loading System:
nextstep/data/README.md- How tar files are indexed and loaded - Configuration System:
configs/README.md- How to configure datasets for training - Training Engine:
nextstep/engine/train_nextstep_ds.py- Training script that uses datasets - Base Dataset Class:
nextstep/data/indexed_tar_dataset.py- Base class for all datasets
Core concepts of the dataset system:
- MixedDataset: Combines multiple data sources with weighted sampling
- Buffers:
NLPBufferandImageBuffertemporarily store samples before batching - BatchBucket: Organizes samples into batches grouped by aspect ratio
- MixedDataloader: Wraps dataset with state tracking and checkpoint resumption
- Unified Output: All datasets output batches in the same format for training
The system is designed for efficiency and flexibility, supporting multiple data types, aspect ratio grouping, and full state restoration for seamless checkpoint resumption.