A high-performance CLI tool to import/export data from Azure Cosmos DB (SQL API) using JSON/JSONL format. Designed for speed, scalability, and extreme memory efficiency, it features real-time Request Units (RU/s) monitoring, multi-core processing, and streaming parsers.
- Multi-Process Parallelism: Leverages multiple CPU cores using
pathos.multiprocessingto process multiple containers or files simultaneously. - Hybrid Architecture: Combines multi-processing with
asynciofor non-blocking I/O, saturating available network bandwidth and Cosmos DB RU/s. - Extreme Memory Efficiency: Uses
ijsonfor streaming JSON parsing. Can handle massive datasets with constant, low RAM usage (few hundred MBs). - Smart Export Splitting: Automatically splits large container exports into multiple files (default 20GB) to enable high-speed parallel importing later.
- Indexing Policy Optimization: Automatically disables indexing (
indexingMode: none) during import to maximize throughput and reduce RU consumption, restoring the original policy upon completion. - Streaming Shuffle: Includes a "sliding window" shuffle buffer during import to distribute write load across all physical partitions of Cosmos DB, preventing hotspots.
- Real-time Statistics: Monitors total RU consumed and calculates live RU/s rates.
- Format Flexibility: Supports both standard JSON (arrays) and JSON Lines (
.jsonl) for both import and export.
- Python 3.14+
- PDM (Package Manager) or PIP 25+
-
Clone the repository:
git clone <repository-url> cd cosmos-dumper
-
Install dependencies with PDM:
pdm install
-
(OR) Install with PIP
pip install .
Create a .env file in the project root:
cp .env.example .envAvailable variables:
COSMOS_EXPORT_URL: Your Cosmos DB endpoint.COSMOS_EXPORT_KEY: Primary access key.COSMOS_EXPORT_DB_NAME: Database name.
pdm run cosmos-dumper export [options]--url,--key,--db: Database credentials (or use.env).--container: Export only this specific container.--jsonl: Export in JSON Lines format (faster, one object per line).--workers: Number of parallel processes (default: number of CPUs).--max-file-size: Max file size in GB before splitting (default: 20).
Example:
# High-speed export of all containers to JSONL
pdm run cosmos-dumper export --jsonl --workers 8To export from a MongoDB-compatible database (like Cosmos DB for MongoDB):
pdm run cosmos-dumper export --mongo --url "mongodb://your-connection-string" --db your_db_name --jsonlpdm run cosmos-dumper import --path <file_or_directory> [options]--path: Path to a specific file or a directory containing exported files.--url,--key,--db: Database credentials.--container: Target container name.--from-container: Filter a specific container from a directory of exports.--workers: Number of parallel files to process (default: number of CPUs).--concurrency: Number of concurrent upserts per worker (default: 200).--shuffle: Enable streaming shuffle to distribute load across partitions.
Example:
# Massive parallel import with shuffling and high concurrency
pdm run cosmos-dumper import --path ./export/my_dump --workers 4 --concurrency 300 --shuffleTo import into a MongoDB-compatible database:
pdm run cosmos-dumper import --mongo --url "mongodb://your-connection-string" --db your_db_name --path ./export/your_dumpTo achieve maximum performance (GBs per minute):
- Scale RU/s: Temporarily increase your Cosmos DB container RU/s (e.g., to 10k-50k) before starting the import.
- Use Shuffle: Always use
--shufflefor large datasets to avoid hitting a single physical partition bottleneck. - Tune Concurrency: Increase
--concurrency(e.g., 500+) if you have high RU/s and a fast network. - JSONL: Prefer
--jsonlduring export for simpler streaming and slightly better performance.
Data is saved in export/<database_name>_<timestamp>/.
Files are named:
{container}_export.json(for single files){container}_export_{index}.json(for split files)
MIT