Welcome to the NextStep data loading system! This guide will help you understand how the system efficiently reads tar-format data and builds indexes.
- Introduction
- Core Components
- Key Files
- Common Workflows
- Extension Guide
- Data Flow Overview
- Troubleshooting
- Related Documentation
The nextstep/data/ directory is the core foundation layer of the data loading system, responsible for efficiently reading tar-format data and building indexes. This layer addresses the following key challenges:
- ✅ Fast sample location: Quickly locate specific samples in tar files without decompressing the entire archive
- ✅ Efficient index caching: Cache index information to avoid repeated parsing
- ✅ Distributed training support: Handle data sharding and worker management for multi-GPU training
- ✅ Unified data handling: Process different data field formats (images, text, JSON, etc.) uniformly
The data loading system consists of three main layers:
┌─────────────────────────────────────┐
│ IndexedTarDataset (Base Class) │ ← Distributed training support
│ - Worker management │
│ - Data sharding │
│ - State recovery │
└──────────────┬──────────────────────┘
│
┌──────────────▼──────────────────────┐
│ IndexedTarSamples │ ← Sample organization
│ - Group files by key │
│ - Return sample dictionaries │
└──────────────┬──────────────────────┘
│
┌──────────────▼──────────────────────┐
│ IndexedTar │ ← Core indexing
│ - Parse tar headers │
│ - Build and cache indexes │
│ - Support mmap/memory backends │
└─────────────────────────────────────┘
Purpose: The foundation of the entire data loading system, responsible for direct tar file operations.
Key Features:
| Feature | Description |
|---|---|
| Parse tar headers | Read tar file headers to get file positions and sizes |
| Build indexes | Create mappings from filenames to positions (by_name) and from indices to positions (by_index) |
| Cache indexes | Save index information as .index files for fast loading on subsequent use |
| Dual backends | Support both mmap (memory-mapped, low memory) and memory (full file in memory) backends |
Key Classes:
| Class | Description |
|---|---|
IndexedTar |
Basic tar file indexing class, provides access by index or filename |
IndexedTarSamples |
Extends IndexedTar to organize multiple files with the same key (e.g., sample1.jpg and sample1.txt) into a sample dictionary |
Backend Selection:
mmap: Uses memory mapping, suitable for local files with low memory usagememory: Loads entire file into memory, suitable for remote files (e.g., S3) or small files
Usage Scenario: Use this module when you need to quickly read specific samples from tar files.
Purpose: Provides an iterable dataset base class for distributed training scenarios, handling complex multi-process and multi-GPU logic.
Key Features:
| Feature | Description |
|---|---|
| Iterable dataset | Inherits from torch.utils.data.IterableDataset, supports streaming reads for large datasets |
| Distributed support | Automatically handles rank (current GPU ID) and world_size (total GPU count) |
| Sharding modes | Supports two sharding modes: by sample or by URL (tar file) |
| Worker management | Manages worker state during multi-process data loading to ensure each worker reads different data |
| State recovery | Supports checkpoint resumption via DataStatus class, tracking samples processed by each worker |
| Unified output | All subclasses return RetSample objects containing sample data, metadata, worker info, etc. |
Key Classes:
| Class | Description |
|---|---|
IndexedTarDataset |
Abstract base class, subclasses must implement data_generator() method |
RetSample |
Unified data return structure containing sample data, index info, missing sample statistics, etc. |
DataStatus |
Used for recording and recovering training state |
Sharding Modes:
- Sample mode: Split samples across workers (used when there are fewer URLs than
world_size * 10) - URL mode: Split tar files across workers (used when there are many URLs, reduces file lock contention)
Usage Scenario: Inherit this base class when implementing a dataset that supports distributed training.
Purpose: Automatically decodes different data field formats based on file extensions.
Key Features:
- Extension-based decoding: Automatically selects the corresponding decoder function based on file extension (e.g.,
.jpg,.txt,.json)
Supported Data Formats:
| Format | Extension | Decoded Output |
|---|---|---|
| Text | .txt |
UTF-8 string |
| JSON | .json |
Python dictionary |
| Images | .jpg, .jpeg, .png, .webp |
PIL Image object (converted to RGB) |
| NumPy | .npy, .npz |
NumPy array |
| Pickle | .pickle |
Python object |
| Video | .mp4 |
Raw bytes |
Extension Method: To support new data formats, add new extensions and corresponding decoder functions to the default_handlers dictionary.
Usage Scenario: IndexedTarSamples automatically uses these decoders when reading samples. You typically don't need to call them directly.
Purpose: CLI tool to batch scan tar files and generate metadata JSON files.
Key Features:
| Feature | Description |
|---|---|
| Scan tar files | Recursively scan all .tar files in the specified directory |
| Extract metadata | For each tar file, extract: num_samples, size (bytes), checksum |
| Generate metadata file | Aggregate metadata for all tar files into a JSON file |
| Parallel processing | Supports multi-process parallel processing to speed up scanning |
Output Files:
| File | Description |
|---|---|
meta.json |
Contains relative paths and metadata for all tar files |
meta_stats.json |
Overall dataset statistics (total samples, total files, etc.) |
Usage:
# Basic usage (meta.json will be saved in dataset root directory)
gen_meta /path/to/dataset/root_dir
# Specify save path
gen_meta /path/to/dataset/root_dir --meta_save_path /path/to/meta.json
# Specify number of parallel jobs
gen_meta /path/to/dataset/root_dir --meta_save_path /path/to/meta.json --n_jobs 32Parameters:
| Parameter | Description |
|---|---|
dir |
Root directory where tar files are located (required) |
--meta_save_path |
Save path for meta.json (default: {dir}/meta.json) |
--n_jobs |
Number of parallel jobs (default: 64, set to 0 to auto-detect CPU cores) |
⚠️ Important: This tool must be run before using any dataset for the first time. The data loading system depends on the generated metadata file.
Purpose: CLI tool to pre-build indexes (.index files) for tar files, avoiding delays during the first read in training.
Key Features:
| Feature | Description |
|---|---|
| Match datasets | Match datasets to warm up based on name patterns (supports wildcards) |
| Check indexes | Check which tar files don't have .index index files yet |
| Batch index building | Build and cache indexes for all matched tar files in parallel |
| Parallel processing | Supports multi-process parallel processing |
Why Warm Up?:
- First read of a tar file requires parsing the entire file to build an index, which can be time-consuming
- After warming up, indexes are cached as
.indexfiles, and subsequent reads directly load the index, which is much faster - Warming up before training starts avoids delays during training
Usage:
# Warm up matching datasets (supports wildcards, multiple datasets separated by commas)
warmup_data "text2image/*,interleave/*" --n_jobs 32Parameters:
| Parameter | Description |
|---|---|
names |
Dataset name patterns (supports wildcards, comma-separated) |
--n_jobs |
Number of parallel jobs (default: 0, auto-detect CPU cores) |
💡 Tip: Especially useful when reading data from remote storage like S3, where first-read latency can be significant.
Purpose: Example script demonstrating how to pack raw data into WebDataset format (tar files).
Key Features:
| Feature | Description |
|---|---|
| Data format conversion | Convert JSON-format raw data to WebDataset standard format |
| Image processing | Convert image paths to image byte data |
| Standardized output | Generate tar files conforming to NextStep data format requirements |
| Parallel packing | Supports multi-process parallel processing to speed up packing |
Output Format:
Each tar file contains multiple samples, and each sample contains:
key.json: Containscaptionfield (using<image_n>placeholders to describe image-text positions) and optionalcaption_all,metadatafieldskey-{i}.jpg: Image files, corresponding sequentially to<image_0>,<image_1>, etc. placeholders
Usage:
# Modify the configuration in the script
EXAMPLE_PATH = "/path/to/your/data.json" # Input data path
OUTPUT_DIR = "/path/to/output/tar/files" # tar file output directory
MAX_SAMPLES_PER_TAR = 100 # Maximum number of samples per tar file
# Run the script
python nextstep/data/build_wds.py
⚠️ Important Notes:
- This is an example script. You must modify
load_dataandcreate_examplefunctions according to your data source format- The
<image_n>placeholders in thecaptionfield must correspond to the image file order- Sample
keycannot contain periods (.) or hyphens (-)
Step 1: Pack raw data into tar format
# Reference build_wds.py and modify according to your data format
python nextstep/data/build_wds.pyStep 2: Generate tar file metadata
gen_meta /path/to/dataset/root --meta_save_path /path/to/meta.json --n_jobs 32Step 3: (Optional) Warm up indexes to speed up first read
warmup_data "your_dataset_name" --n_jobs 32# If dataset indexes haven't been built yet, warm them up in advance
warmup_data "text2image/*,interleave/*" --n_jobs 32If your dataset contains new file formats (e.g., .yaml, .csv), register decoder functions in data_handler.py:
# Add to the default_handlers dictionary in data_handler.py
import yaml
import pandas as pd
import io
default_handlers = {
# ... existing handlers ...
".yaml": lambda data: yaml.safe_load(data.decode("utf-8")),
".csv": lambda data: pd.read_csv(io.BytesIO(data)),
}- Create a new dataset class in the
datasets/directory - Inherit from the
IndexedTarDatasetbase class - Implement the
data_generator()method, returningRetSampleobjects - Implement the
__repr__()method for debugging
Example Structure:
from nextstep.data.indexed_tar_dataset import IndexedTarDataset, RetSample
class MyCustomDataset(IndexedTarDataset):
def __repr__(self):
return f"MyCustomDataset(name={self.name}, ...)"
def data_generator(self):
# Your data loading logic here
for index in self.index_generator():
# Load and process sample
sample = self._load_sample(index)
yield RetSample(
name=self.name,
data_type=self.data_type,
missing_indices=[],
missing_urls_indices=[],
missing_samples=0,
index=index,
url_index=self.index2indices(index),
sample=sample,
worker_id=self.worker_id,
num_workers=self.num_workers,
)
⚠️ Important: When data schema changes, adapt indatasets/subclasses first. Do not directly modify the base class protocol.
Raw Data (JSON/images/etc.)
↓
build_wds.py (packing)
↓
WebDataset Format (tar files)
↓
gen_meta (generate metadata)
↓
warmup_data (warm up indexes, optional)
↓
IndexedTarDataset subclass (reading)
↓
Training Usage
Detailed Flow:
- Packing: Convert raw data to WebDataset tar format using
build_wds.py - Metadata Generation: Run
gen_metato generatemeta.jsonwith sample statistics - Index Warming (Optional): Run
warmup_datato pre-build.indexfiles - Data Loading: Dataset classes inherit from
IndexedTarDatasetand use the indexes to efficiently load samples - Training: Samples are streamed to the training loop
Problem: First read of a tar file is very slow.
Solutions:
- Run
warmup_datato pre-build indexes before training - Ensure
.indexfiles are in the same directory as tar files - Check if tar files are on slow storage (consider using local cache)
Problem: Index cache is out of date or corrupted.
Solutions:
- Delete
.indexfiles and let the system rebuild them - Ensure tar files haven't been modified after index creation
- Check disk space and permissions for index file creation
Problem: Out of memory when loading data.
Solutions:
- Use
mmapbackend instead ofmemorybackend (requires local files) - Reduce
lru_sizeparameter in dataset initialization - Check if multiple workers are loading the same large tar files simultaneously
Problem: Multiple workers or GPUs are reading the same samples.
Solutions:
- Ensure you're using
IndexedTarDatasetbase class correctly - Don't manually shuffle or split data in subclasses
- Check that
rankandworld_sizeare set correctly
Problem: Some samples are skipped during training.
Solutions:
- Check
RetSample.missing_samplesto see how many samples were skipped - Use
skip_sample()when a sample fails to load - Use
success_sample()to reset the missing sample counter - Check logs for specific error messages
- Dataset Classes:
nextstep/datasets/- Concrete dataset implementations - Data Configuration:
configs/README.md- How to configure datasets for training - Training Engine:
nextstep/engine/train_nextstep_ds.py- Training script that uses datasets - Data Zoos:
nextstep/data_zoos.py- Dataset registry and metadata management
Core concepts of the data loading system:
- IndexedTar: Core indexing engine that parses tar files and builds position indexes
- IndexedTarSamples: Organizes files into samples based on keys
- IndexedTarDataset: Base class for distributed training datasets with worker management
- Metadata:
meta.jsonfiles contain sample statistics for efficient data access - Index Caching:
.indexfiles cache parsed indexes to avoid repeated parsing
The system is designed for efficiency and scalability, supporting both local and remote storage, with automatic handling of distributed training complexities.