This file provides guidance to programming agents when working with code in this repository.
Rust implementation of filesystem storage clients for Crawlee crawlers, with Python and Node.js bindings. The project follows the same structure as apify/impit — a Cargo workspace with a core library crate and per-language binding crates.
The Rust library implements three storage clients (FileSystemDatasetClient, FileSystemKeyValueStoreClient, FileSystemRequestQueueClient) that are byte-for-byte filesystem-compatible with the Python implementations in crawlee-py. Requests are treated as opaque JSON blobs (requiring at minimum a uniqueKey field).
# Build the entire workspace
cargo build
# Build just the core library
cargo build -p crawlee-storage
# Run all Rust tests
cargo test
# Run tests for just the core library
cargo test -p crawlee-storage
# Run a single test
cargo test -p crawlee-storage -- test_name
# Run tests with output shown
cargo test -p crawlee-storage -- --nocapture
# Build Python bindings (requires maturin)
cd crawlee-storage-python && maturin develop --release
# Build Python bindings in debug mode (faster compile)
cd crawlee-storage-python && maturin develop
# Build Node.js bindings (requires @napi-rs/cli)
cd crawlee-storage-node && npm install && npm run build
# Run Node.js tests
cd crawlee-storage-node && npm test
# Lint Node.js code (type-aware, via oxlint + tsgolint)
cd crawlee-storage-node && npm run lint
# Format Node.js code (via oxfmt)
cd crawlee-storage-node && npm run fmt
# Check Node.js formatting without writing
cd crawlee-storage-node && npm run fmt:check- Rust edition: 2021
- Rust formatting:
cargo fmt(default rustfmt settings) - Rust linting:
cargo clippy - Node.js linting:
oxlintwith type-aware linting viatsgolint(config in.oxlintrc.json) - Node.js formatting:
oxfmt(config in.oxfmtrc.json; 4-space indent, single quotes, trailing commas) - Commit format: Conventional Commits (
feat:,fix:,docs:,refactor:,test:, etc.)
crawlee-storage/ Core Rust library (no FFI dependencies)
├── src/
│ ├── lib.rs Module root
│ ├── models.rs Shared data models (metadata, responses, queue state)
│ ├── utils.rs Utilities (atomic_write, JSON formatting, hashing, encoding)
│ ├── dataset.rs FileSystemDatasetClient
│ ├── key_value_store.rs FileSystemKeyValueStoreClient
│ └── request_queue.rs FileSystemRequestQueueClient
crawlee-storage-python/ PyO3/maturin Python bindings
├── src/lib.rs PyO3 module and wrapper classes
├── python/crawlee_storage/ Pure Python package (re-exports from native module)
└── pyproject.toml maturin build config
crawlee-storage-node/ napi-rs Node.js bindings
├── src/lib.rs napi-rs module (napi v3)
├── build.rs napi-build setup
├── dts-header.d.ts Custom TypeScript interfaces (prepended to auto-generated index.d.ts)
├── index.js Auto-generated native module loader (by napi-rs CLI)
├── index.d.ts Auto-generated TypeScript declarations (by napi-rs CLI)
├── .oxlintrc.json Oxlint config (type-aware linting)
├── .oxfmtrc.json Oxfmt config (formatting)
├── tsconfig.json TypeScript config (for test compilation)
├── test/ Vitest tests (TypeScript)
└── package.json npm package config
There is no StorageClient facade or trait in Rust. The three client structs are independent and self-contained. The Python/JS side provides its own facade that instantiates these clients and handles concerns like purge_on_start and Configuration resolution.
Concurrency model: Each client uses tokio::sync::Mutex internally to protect shared state. All file I/O uses tokio::fs (async). The clients are Send + Sync and safe for concurrent use from multiple async tasks within a single process. They are NOT safe for multi-process concurrent access.
Request model: Requests are serde_json::Value objects. The Rust code only accesses uniqueKey (for dedup and file naming) and handledAt (for marking as handled). Everything else passes through opaquely.
Request queue state persistence: The FileSystemRequestQueueClient uses a private StatePersistence struct that directly opens the default FileSystemKeyValueStoreClient to persist queue state (sequence counters, in-progress/handled sets) under the key __RQ_STATE_{queue_id}. The binding layer is responsible for calling persist_state() periodically (e.g. via the framework's event system). See #12 for discussion about making this injectable.
KVS value model: KVS record values use the KvsValue enum (None, Json(Value), Text(String), Binary(Vec<u8>)) instead of serde_json::Value. This avoids base64-encoding binary data at the core level — each binding layer converts KvsValue variants directly to native types (e.g. Binary → Python bytes, Node.js Buffer).
{storage_dir}/
├── datasets/{name}/
│ ├── __metadata__.json
│ ├── 000000001.json (9-digit zero-padded item files)
│ └── ...
├── key_value_stores/{name}/
│ ├── __metadata__.json
│ ├── {percent_encoded_key} (value data file)
│ ├── {percent_encoded_key}.__metadata__.json (record sidecar)
│ └── ...
└── request_queues/{name}/
├── __metadata__.json
├── {sha256(uniqueKey)[:15]}.json (request files)
└── ...
These must be preserved for drop-in compatibility with the Python FileSystemStorageClient:
- JSON formatting: Pretty-printed, 2-space indent, non-ASCII preserved (
ensure_ascii=Falseequivalent). Useserde_json::ser::PrettyFormatter::with_indent(b" "). - Metadata field names: snake_case in JSON (e.g.,
item_count,created_at), matching Python'smodel_dump()output. - Datetime format:
2024-01-15T10:30:00.123456+00:00— 6 fractional digits,+00:00suffix for UTC. - KVS key encoding:
percent_encoding::utf8_percent_encode(key, NON_ALPHANUMERIC)— equivalent to Python'surllib.parse.quote(key, safe=''). - RQ filenames:
sha256(unique_key_bytes).hexdigest()[:15] + ".json". - Atomic writes: Write to temp file in same directory, then
rename(). application/x-nonesentinel: KVS uses this custom MIME type forNone/null values (empty file on disk).serde_jsonpreserve_orderfeature: Enabled to maintain JSON key insertion order (matching Python dict ordering).
- Uses PyO3 0.28 with pyo3-async-runtimes (tokio feature) for native Python coroutines.
- Each Rust client is wrapped in
Arcso it can be cloned into async blocks (standard pattern for pyo3 async methods). - JSON data crosses the FFI boundary as Python dicts/lists, converted to/from
serde_json::Valueviavalue_to_py/py_to_valuehelper functions. - KVS binary values cross the FFI boundary as Python
bytes↔KvsValue::Binary(Vec<u8>)directly — no base64 intermediary. - The compiled native module is
crawlee_storage._native, re-exported bycrawlee_storage/__init__.py.
- Uses napi-rs v3 (
napi = "3",napi-derive = "3") withasync,serde-json, andnapi4features. build.rscallsnapi_build::setup()— standard napi-rs build script.index.jsandindex.d.tsare auto-generated bynapi build(via@napi-rs/cli). Do not edit them manually.dts-header.d.tscontains hand-written TypeScript interfaces (DatasetMetadata,KeyValueStoreRecord, etc.) that are prepended to the auto-generatedindex.d.ts. This is configured via"dtsHeaderFile"inpackage.json'snapisection.#[napi(ts_return_type = "...")]and#[napi(ts_args_type = "...")]annotations on Rust methods override auto-generated types to reference the header interfaces instead ofany.- camelCase convention: The core Rust library serializes with snake_case (for Python compatibility). The Node binding layer converts all object keys from snake_case to camelCase via
to_camel_case_keys()before returning to JS. Thedts-header.d.tsinterfaces use camelCase field names accordingly. - Each Rust client is wrapped in
Arcso it can be cloned into async blocks. - JSON data crosses the FFI boundary as
serde_json::Value↔ JS objects (via napi'sserde-jsonfeature). - KVS binary values are received as
napi::bindgen_prelude::Bufferand converted toKvsValue::Binary(Vec<u8>). On read, binary data is returned as a JSON array of byte values with a__binary__: truemarker. - Tests are TypeScript (
.test.ts) using Vitest, importing directly from../index.js. - Linting uses
oxlintwith type-aware rules (viatsgolint). Formatting usesoxfmt.
crawlee-storage/src/— All core Rust implementationcrawlee-storage-python/src/— PyO3 binding codecrawlee-storage-python/python/— Pure Python packagecrawlee-storage-node/src/— napi-rs binding code
Core library (crawlee-storage):
tokio— async runtime and filesystem I/Oserde/serde_json— serialization (withpreserve_order)chrono— datetime handlingsha2— SHA-256 for request queue filenamespercent-encoding— URL-encoding KVS keystempfile— atomic write temp filesthiserror— error typestracing— loggingrand— random ID generation
Python bindings (crawlee-storage-python):
pyo3— Python FFIpyo3-async-runtimes— native async Python coroutines via tokio
Node.js bindings (crawlee-storage-node):
napi/napi-derive— Node.js FFInapi-build— build script for napi-rsoxlint/oxlint-tsgolint— linting (with type-aware rules)oxfmt— formattingvitest— test frameworktypescript/@types/node— TypeScript support for tests