Cosmos DB Data Import/Export Tool

A high-performance CLI tool to import/export data from Azure Cosmos DB (SQL API) using JSON/JSONL format. Designed for speed, scalability, and extreme memory efficiency, it features real-time Request Units (RU/s) monitoring, multi-core processing, and streaming parsers.

Features

Multi-Process Parallelism: Leverages multiple CPU cores using pathos.multiprocessing to process multiple containers or files simultaneously.
Hybrid Architecture: Combines multi-processing with asyncio for non-blocking I/O, saturating available network bandwidth and Cosmos DB RU/s.
Extreme Memory Efficiency: Uses ijson for streaming JSON parsing. Can handle massive datasets with constant, low RAM usage (few hundred MBs).
Smart Export Splitting: Automatically splits large container exports into multiple files (default 20GB) to enable high-speed parallel importing later.
Indexing Policy Optimization: Automatically disables indexing (indexingMode: none) during import to maximize throughput and reduce RU consumption, restoring the original policy upon completion.
Streaming Shuffle: Includes a "sliding window" shuffle buffer during import to distribute write load across all physical partitions of Cosmos DB, preventing hotspots.
Real-time Statistics: Monitors total RU consumed and calculates live RU/s rates.
Format Flexibility: Supports both standard JSON (arrays) and JSON Lines (.jsonl) for both import and export.

Requirements

Python 3.14+
PDM (Package Manager) or PIP 25+

Installation

Clone the repository:

git clone <repository-url>
cd cosmos-dumper

Install dependencies with PDM:
```
pdm install
```
(OR) Install with PIP
```
pip install .    
```

Configuration

Create a .env file in the project root:

cp .env.example .env

Available variables:

COSMOS_EXPORT_URL: Your Cosmos DB endpoint.
COSMOS_EXPORT_KEY: Primary access key.
COSMOS_EXPORT_DB_NAME: Database name.

Usage

1. Exporting Data

pdm run cosmos-dumper export [options]

Export Options:

--url, --key, --db: Database credentials (or use .env).
--container: Export only this specific container.
--jsonl: Export in JSON Lines format (faster, one object per line).
--workers: Number of parallel processes (default: number of CPUs).
--max-file-size: Max file size in GB before splitting (default: 20).

Example:

# High-speed export of all containers to JSONL
pdm run cosmos-dumper export --jsonl --workers 8

MongoDB Export:

To export from a MongoDB-compatible database (like Cosmos DB for MongoDB):

pdm run cosmos-dumper export --mongo --url "mongodb://your-connection-string" --db your_db_name --jsonl

2. Importing Data

pdm run cosmos-dumper import --path <file_or_directory> [options]

Import Options:

--path: Path to a specific file or a directory containing exported files.
--url, --key, --db: Database credentials.
--container: Target container name.
--from-container: Filter a specific container from a directory of exports.
--workers: Number of parallel files to process (default: number of CPUs).
--concurrency: Number of concurrent upserts per worker (default: 200).
--shuffle: Enable streaming shuffle to distribute load across partitions.

Example:

# Massive parallel import with shuffling and high concurrency
pdm run cosmos-dumper import --path ./export/my_dump --workers 4 --concurrency 300 --shuffle

MongoDB Import:

To import into a MongoDB-compatible database:

pdm run cosmos-dumper import --mongo --url "mongodb://your-connection-string" --db your_db_name --path ./export/your_dump

Performance Tips

To achieve maximum performance (GBs per minute):

Scale RU/s: Temporarily increase your Cosmos DB container RU/s (e.g., to 10k-50k) before starting the import.
Use Shuffle: Always use --shuffle for large datasets to avoid hitting a single physical partition bottleneck.
Tune Concurrency: Increase --concurrency (e.g., 500+) if you have high RU/s and a fast network.
JSONL: Prefer --jsonl during export for simpler streaming and slightly better performance.

Output Structure

Data is saved in export/<database_name>_<timestamp>/. Files are named:

{container}_export.json (for single files)
{container}_export_{index}.json (for split files)

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
cosmos_dumper		cosmos_dumper
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pdm.lock		pdm.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cosmos DB Data Import/Export Tool

Features

Requirements

Installation

Configuration

Usage

1. Exporting Data

Export Options:

MongoDB Export:

2. Importing Data

Import Options:

MongoDB Import:

Performance Tips

Output Structure

License

About

Uh oh!

Releases

Packages

Languages

License

ffppa/cosmos-dumper

Folders and files

Latest commit

History

Repository files navigation

Cosmos DB Data Import/Export Tool

Features

Requirements

Installation

Configuration

Usage

1. Exporting Data

Export Options:

MongoDB Export:

2. Importing Data

Import Options:

MongoDB Import:

Performance Tips

Output Structure

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages