|
1 | 1 | # atelier-data |
2 | 2 |
|
3 | | -# Overview |
| 3 | +Market data infrastructure for the **atelier-rs** trading engine. |
4 | 4 |
|
5 | | -Foundational Data Types and I/O integrations for the atelier-rs project. |
| 5 | +This crate provides everything needed to connect to cryptocurrency exchanges, |
| 6 | +normalise their heterogeneous WebSocket feeds into a common data model, |
| 7 | +synchronise events onto a uniform time grid, and persist the result to |
| 8 | +Apache Parquet files. |
6 | 9 |
|
7 | | -Core data types are: |
| 10 | +## Core Data Types |
8 | 11 |
|
9 | | -OffChain activity |
10 | | -- OrderBook |
11 | | -- PublicTrades |
12 | | -- Liquidations (When available) |
13 | | -- FundingRates (When available) |
14 | | -- OpenInterests (When available) |
| 12 | +**Off-chain activity** (market microstructure): |
15 | 13 |
|
16 | | -OnChain activity |
17 | | -- Swaps |
18 | | -- LendingRates |
| 14 | +| Type | Description | |
| 15 | +|------|-------------| |
| 16 | +| `Orderbook` | Full-depth limit order book snapshot (bid/ask levels) | |
| 17 | +| `OrderbookDelta` | Incremental order book maintained via `NormalizedDelta` updates | |
| 18 | +| `Trade` | Public trade execution (price, size, side, timestamp) | |
| 19 | +| `Liquidation` | Forced liquidation event | |
| 20 | +| `FundingRate` | Perpetual futures funding rate observation | |
| 21 | +| `OpenInterest` | Aggregate open interest snapshot | |
| 22 | + |
| 23 | +**Composed types:** |
19 | 24 |
|
20 | | -## Orderbook data |
| 25 | +| Type | Description | |
| 26 | +|------|-------------| |
| 27 | +| `MarketSnapshot` | Time-aligned bundle of all market data for one grid period | |
| 28 | +| `MarketAggregate` | 15-scalar feature vector derived from a `MarketSnapshot` | |
21 | 29 |
|
22 | | -- Snapshots and Deltas |
23 | | -- Metrics |
| 30 | +## Exchange Sources |
24 | 31 |
|
25 | | -## Sources |
| 32 | +| Source | Kind | API | Order Books | Public Trades | Liquidations | Funding Rates | Open Interest | |
| 33 | +|--------|------|-----|-------------|---------------|--------------|---------------|---------------| |
| 34 | +| Bybit | CEX | WSS | YES / YES | YES / YES | YES / YES | YES / YES | YES / YES | |
| 35 | +| Coinbase | CEX | WSS | YES / YES | YES / YES | — | — | — | |
| 36 | +| Kraken | CEX | WSS | YES / YES | YES / YES | — | — | — | |
26 | 37 |
|
27 | | -| Source | Kind | Type | API | Data | Implemented | Tests | |
28 | | -| ------------- | ------- | ------- | ------- | ------------- | ------------- | ------- | |
29 | | -| | | | | Order Books | YES | N/A | |
30 | | -| | | | | Public Trades | YES | N/A | |
31 | | -| Bybit | Markets | CEX | WSS | Liquidations | YES | N/A | |
32 | | -| | | | | Funding Rates | YES | N/A | |
33 | | -| | | | | Open Interest | YES | N/A | |
34 | | -|---------------|---------|---------|---------|---------------|---------------|---------| |
35 | | -| | | | | Order Books | N/A | N/A | |
36 | | -| | | | | Public Trades | N/A | N/A | |
37 | | -| Coinbase | Markets | CEX | WSS | Liquidations | N/A | N/A | |
38 | | -| | | | | Funding Rates | N/A | N/A | |
39 | | -| | | | | Open Interest | N/A | N/A | |
40 | | -|---------------|---------|---------|---------|---------------|---------------|---------| |
41 | | -| | | | | Order Books | N/A | N/A | |
42 | | -| | | | | Public Trades | N/A | N/A | |
43 | | -| Kraken | Markets | CEX | WSS | Liquidations | N/A | N/A | |
44 | | -| | | | | Funding Rates | N/A | N/A | |
45 | | -| | | | | Open Interest | N/A | N/A | |
| 38 | +*Format: Implemented / Tested. Dashes indicate the exchange does not expose |
| 39 | +the data type on its spot/linear WebSocket API.* |
46 | 40 |
|
| 41 | +## Workers |
47 | 42 |
|
48 | | -<br> |
| 43 | +Two worker types handle end-to-end data collection: |
49 | 44 |
|
50 | | ---- |
| 45 | +**DataWorker** — raw event ingestion without synchronisation. Connects to a |
| 46 | +live exchange WebSocket feed, decodes events, and delivers them through a |
| 47 | +pluggable `OutputSink` pipeline. Configuration is driven by a TOML manifest |
| 48 | +(`DataWorkerManifest`). Handles reconnection, backoff, health monitoring, |
| 49 | +and gap detection automatically. |
| 50 | + |
| 51 | +**MarketWorker** — synchronised market snapshots. Extends `DataWorker`'s |
| 52 | +ingestion with a `MarketSynchronizer` that bins heterogeneous events onto |
| 53 | +a uniform nanosecond grid, producing `MarketSnapshot` objects at each tick. |
| 54 | +Multiple `ClockMode` strategies are supported: `OrderbookDriven`, |
| 55 | +`TradeDriven`, `LiquidationDriven`, and `ExternalClock`. Snapshots are |
| 56 | +delivered through the same `OutputSink` pipeline and can be flushed to |
| 57 | +Parquet automatically. |
| 58 | + |
| 59 | +## Output Sinks |
| 60 | + |
| 61 | +The `OutputSink` trait defines where worker output goes. Multiple sinks |
| 62 | +run simultaneously via `OutputSinkSet` (fan-out): |
| 63 | + |
| 64 | +| Sink | Status | Description | |
| 65 | +|------|--------|-------------| |
| 66 | +| `ChannelSink` | Working | Wraps `TopicRegistry` broadcast channels for pub/sub | |
| 67 | +| `TerminalSink` | Working | Debug/tracing terminal output | |
| 68 | +| `ParquetSink` | Working | Buffers `MarketSnapshot`s, decomposes and flushes to per-datatype Parquet files | |
| 69 | + |
| 70 | +## Parquet Persistence |
| 71 | + |
| 72 | +Requires `--features parquet`. All five data types support read and write: |
| 73 | + |
| 74 | +| Data Type | Writer | Reader | |
| 75 | +|-----------|--------|--------| |
| 76 | +| Orderbooks | `write_ob_parquet` | `read_ob_parquet` | |
| 77 | +| Trades | `write_trades_parquet_timestamped` | `read_trades_parquet` | |
| 78 | +| Liquidations | `write_liquidations_parquet_timestamped` | `read_liquidations_parquet` | |
| 79 | +| Funding Rates | `write_funding_parquet_timestamped` | `read_funding_parquet` | |
| 80 | +| Open Interest | `write_oi_parquet_timestamped` | `read_oi_parquet` | |
51 | 81 |
|
52 | | -**`atelier-data`** is a member of the [atelier-rs](https://github.com/iteralabs/atelier-rs) workspace, which has other published crates: |
| 82 | +### Filename Convention |
53 | 83 |
|
54 | | -- [atelier-engine](https://crates.io/crates/atelier-engine): |
55 | | -- [atelier-quant](https://crates.io/crates/atelier-quant): |
56 | | -- [atelier-retro](https://crates.io/crates/atelier-retro): |
57 | | -- [atelier-rs](https://crates.io/crates/atelier-rs): |
| 84 | +All timestamped writers produce files following this pattern: |
58 | 85 |
|
59 | | -there are Github hosted artifacts: |
| 86 | +``` |
| 87 | +{SYMBOL}_{DATATYPE}_{MODE}_{TIMESTAMP}.parquet |
| 88 | +``` |
| 89 | + |
| 90 | +Where `MODE` is `"sync"` for grid-aligned data or `"raw"` for unprocessed |
| 91 | +captures. Symbols containing `/` (e.g. Kraken's `BTC/USDT`) are sanitised |
| 92 | +to `-` in the filename (`BTC-USDT`) while the Parquet data retains the |
| 93 | +original symbol string. Examples: |
| 94 | + |
| 95 | +``` |
| 96 | +BTCUSDT_ob_sync_20260226_153000.123.parquet |
| 97 | +ETHUSDT_trades_raw_20260226_160000.456.parquet |
| 98 | +BTC-USDT_ob_sync_20260226_153000.123.parquet |
| 99 | +``` |
| 100 | + |
| 101 | +Files are organised into subdirectories per data type: `orderbooks/`, |
| 102 | +`trades/`, `liquidations/`, `fundings/`, `open_interests/`. |
| 103 | + |
| 104 | +## Feature Flags |
| 105 | + |
| 106 | +| Flag | Effect | |
| 107 | +|------|--------| |
| 108 | +| `parquet` | Enables Apache Parquet I/O (adds `arrow` + `parquet` deps) | |
| 109 | +| `torch` | Enables `tch`-based tensor conversion in the `datasets` module | |
| 110 | + |
| 111 | +## Examples |
| 112 | + |
| 113 | +| Example | Description | Command | |
| 114 | +|---------|-------------|---------| |
| 115 | +| `run_data_worker` | Raw event ingestion via DataWorker | `cargo run -p atelier_data --example run_data_worker -- --config <path>` | |
| 116 | +| `run_market_worker` | Synchronised snapshots to Parquet via MarketWorker | `cargo run -p atelier_data --example run_market_worker --features parquet -- --config <path>` | |
| 117 | +| `read_market_worker` | Read Parquet files and print per-symbol stats | `cargo run -p atelier_data --example read_market_worker --features parquet -- --dir <path>` | |
| 118 | +| `bybit_markets` | Bybit market snapshot collection (standalone) | `cargo run -p atelier_data --example bybit_markets --features parquet -- --config <path>` | |
| 119 | +| `coinbase_markets` | Coinbase market snapshot collection | `cargo run -p atelier_data --example coinbase_markets --features parquet -- --config <path>` | |
| 120 | +| `kraken_markets` | Kraken market snapshot collection | `cargo run -p atelier_data --example kraken_markets --features parquet -- --config <path>` | |
| 121 | +| `market_load` | Load and verify most recent Parquet files | `cargo run -p atelier_data --example market_load --features parquet -- --config <path>` | |
| 122 | +| `market_fetch` | Multi-exchange raw stream collector (Bybit/Coinbase/Kraken) | `cargo run -p atelier_data --example market_fetch --features parquet` | |
| 123 | +| `multi_sync_workers` | Multi-worker manifest parser (stub) | `cargo run -p atelier_data --example multi_sync_workers -- --config <path>` | |
| 124 | + |
| 125 | +--- |
60 | 126 |
|
61 | | -- [benches](https://github.com/IteraLabs/atelier-rs/tree/main/benches): |
62 | | -- [datasets](https://github.com/IteraLabs/atelier-rs/tree/main/datasets): |
| 127 | +**`atelier-data`** is a member of the [atelier-rs](https://github.com/iteralabs/atelier-rs) workspace: |
63 | 128 |
|
64 | | -and consider this for the Development cycle: |
| 129 | +- [atelier-engine](https://crates.io/crates/atelier-engine) |
| 130 | +- [atelier-quant](https://crates.io/crates/atelier-quant) |
| 131 | +- [atelier-retro](https://crates.io/crates/atelier-retro) |
| 132 | +- [atelier-rs](https://crates.io/crates/atelier-rs) |
65 | 133 |
|
66 | | -- [examples](https://github.com/IteraLabs/atelier-rs/tree/main/examples): |
67 | | -- [tests](https://github.com/IteraLabs/atelier-rs/tree/main/tests): |
| 134 | +Development resources: |
68 | 135 |
|
| 136 | +- [examples](https://github.com/IteraLabs/atelier-rs/tree/main/atelier-data/examples) |
| 137 | +- [tests](https://github.com/IteraLabs/atelier-rs/tree/main/atelier-data/tests) |
| 138 | +- [benches](https://github.com/IteraLabs/atelier-rs/tree/main/benches) |
| 139 | +- [datasets](https://github.com/IteraLabs/atelier-rs/tree/main/datasets) |
0 commit comments