A high-performance data processing pipeline for large-scale datasets built in Rust.
DataMap is a Rust-based toolkit designed for efficient processing, filtering, and transformation of large text datasets, primarily in JSONL format. It provides a flexible, distributed architecture for data operations with high-performance parallel processing.
Key features:
- Multi-threaded processing with Rayon
- Configurable processing pipeline via JSON/YAML configuration
- Comprehensive set of data transformation operations
- High-performance parallel file processing
- Memory-efficient streaming operations
Important: This tool is designed for local file processing. We strongly recommend using i4i/i7i EC2 instances with large AWS Nitro Drives for optimal performance. We have found that streaming of remote data sources (e.g. S3) is finnicky and unreliable, so a local-only solution is reliable and efficient, particularly with fast AWS S3 wrappers like s5cmd
DataMap provides the following operations:
Passes data through a highly customizable data processing pipeline that includes filtering and annotating data. Every processor/filter/annotator operates on each document independently, allowing for embarrassingly parallel processing.
Takes a data pool with data files of uneven size and reorganizes them into files of a maximum target size (typically ~256MB before compression, the "sweet spot" for many applications). Can be configured to respect subdirectory structure.
Gathers statistics about data through distributed reservoir sampling. Useful for understanding data distributions before partitioning or for quality analysis.
Partitions data into subdirectories based on a key with discrete support (i.e., a small number of categories like language, domain, or classification labels).
Partitions data into subdirectories based on a key with continuous support. Requires either a reservoir sample or predefined range groups to effectively partition the data on a continuous signal.
A highly distributed grouping operation that ensures all documents with the same "group ID" live in the same JSONL file (or collection of JSONL files with easily identifiable names). Essential for deduplication workflows.
After data has been grouped, keeps just one document from each group. Can apply logic to select which document to keep (e.g., first or last according to a sort key).
Coarsely shuffles data into a large collection of new files. Redistributes data across files but doesn't shuffle data within each individual file.
Counts the number of documents, file sizes, and optionally the total size of a specified text field across a dataset. Useful for dataset statistics and validation.
- Install Rust (if not already installed):
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | shVisit https://www.rust-lang.org/tools/install for more options.
- Clone the repository:
git clone
cd datamap- Build the project:
cargo build --releaseThe binary will be available at target/release/datamap
- (Optional) Install Python dependencies for cloud utilities:
pip install boto3 click tqdm- (Optional) Install s5cmd for cloud storage:
# For Linux systems
wget https://github.com/peak/s5cmd/releases/download/v2.2.2/s5cmd_2.2.2_Linux-64bit.tar.gz
tar -xvzf s5cmd_2.2.2_Linux-64bit.tar.gz
sudo mv s5cmd /usr/local/bin- (Optional) Run tests
mkdir -p ft_classifiers
wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin -O ft_classifiers/lid176.bin
cargo test- Parallelism: DataMap processes files in parallel using all available CPU cores by default
- Thread Control: Use
--threads Nto limit parallelism (useful for memory-constrained environments) - Memory Usage: Scales with the number of parallel files being processed. Large documents may require additional memory
- Sequential Processing: Documents are processed sequentially through pipeline stages to maintain consistency
- File Size: The 256MB "sweet spot" for file sizes balances parallel processing efficiency with memory usage
- Storage: Local NVMe storage (like AWS Nitro) dramatically outperforms network-attached storage
We strongly recommend using s5cmd for efficient S3 interaction. The workflow of downloading to local storage, processing, and re-uploading is more efficient and stable than direct S3 interaction.
Using s5cmd CLI directly:
s5cmd sync s3://bucket/path ./local/path
# ... process with datamap ...
s5cmd sync ./local/path s3://bucket/outputUsing Python wrappers:
python utils/s5cmd_wrapper.py download --src s3://bucket/path --dst ./local/path [--part 0 --num-parts 4]
python utils/s5cmd_wrapper.py upload --src ./local/path --dst s3://bucket/pathWe provide example configuration files for common data processing workflows:
- DCLM Pipeline: Configurations for the DCLM data processing flow
- All-Dressed™ Mixture: Our complete data mixture pipeline
These examples are in the configs/ directory and are nearly plug-and-play. Note: Some configurations require downloading the lid176.bin FastText language classification model. Run /configs/all_dressed/download_lid.sh to download it.
Detailed documentation for each command is available in the docs/ directory:
- Map Command - Filtering and transformation pipelines
- Reshard Command - File size normalization
- Reservoir Sample Command - Statistical sampling
- Discrete Partition Command - Categorical partitioning
- Range Partition Command - Continuous value partitioning
- Group Command - Document grouping
- GroupFilter Command - Group-based deduplication
- Shuffle Command - Data shuffling
- Count Command - Dataset statistics
