|
| 1 | +# Olake S3 Source Driver |
| 2 | +Production-ready S3 source connector for Olake that ingests data directly from AWS S3 or S3-compatible storage (MinIO, LocalStack, etc.). |
| 3 | + |
| 4 | +## Highlights |
| 5 | +- **Multi-format**: CSV (plain or `.gz`), JSON (JSONL/array/object), and Parquet with schema inference. |
| 6 | +- **Incremental sync**: Tracks `_last_modified_time` per stream and processes only newer files. |
| 7 | +- **Parallel processing**: Chunked downloads and configurable `max_threads` keep throughput high. |
| 8 | +- **Stateful**: Stream-level state keeps cursor information so you can resume syncs reliably. |
| 9 | + |
| 10 | +## Configuration |
| 11 | +### Required fields |
| 12 | +| Field | Type | Description | |
| 13 | +| --- | --- | --- | |
| 14 | +| `bucket_name` | string | Target S3 bucket name | |
| 15 | +| `region` | string | AWS region (e.g., `us-east-1`) | |
| 16 | +| `file_format` | string | `csv`, `json`, or `parquet` | |
| 17 | +| `path_prefix` | string | Prefix used to group files into streams | |
| 18 | + |
| 19 | +### Optional fields |
| 20 | +| Field | Type | Default | Description | |
| 21 | +| --- | --- | --- | --- | |
| 22 | +| `access_key_id` | string | — | Static AWS access key (pair with `secret_access_key`) | |
| 23 | +| `secret_access_key` | string | — | Static AWS secret key (pair with `access_key_id`) | |
| 24 | +| `endpoint` | string | AWS S3 | Override S3 endpoint (MinIO/LocalStack) | |
| 25 | +| `max_threads` | integer | 10 | Concurrent file download/parsing workers | |
| 26 | +| `retry_count` | integer | 3 | Retries for transient failures | |
| 27 | +| `compression` | string | — | Override auto-detected compression (`gzip` or `none`) |
| 28 | + |
| 29 | +**Authentication note**: Omitting credentials lets the driver fall back to the AWS default credential chain (environment variables, IAM roles, instance profiles, etc.). If you provide one static credential, include the other as well. |
| 30 | + |
| 31 | +### CSV-specific tuning |
| 32 | +Use the nested `csv` block to customize parsing. |
| 33 | +| Field | Type | Default | Description | |
| 34 | +| --- | --- | --- | --- | |
| 35 | +| `delimiter` | string | `","` | Field delimiter | |
| 36 | +| `has_header` | boolean | `true` | Whether the first row is a header | |
| 37 | +| `skip_rows` | integer | 0 | Rows to skip before parsing | |
| 38 | +| `quote_character` | string | `\"\"\"` | Quote character for fields | |
| 39 | + |
| 40 | +### JSON-specific tuning |
| 41 | +Use the `json` block to override parsing of JSON files. |
| 42 | +| Field | Type | Default | Description | |
| 43 | +| --- | --- | --- | --- | |
| 44 | +| `line_delimited` | boolean | `true` | When true, treat each line as a separate record; set to false for JSON arrays or single objects | |
| 45 | + |
| 46 | +## Commands |
| 47 | +Run the driver binaries through the repository root `build.sh` helper: |
| 48 | +``` |
| 49 | +./build.sh driver-s3 discover --config /path/to/source.json |
| 50 | +./build.sh driver-s3 sync --config /path/to/source.json --catalog /path/to/catalog.json --destination /path/to/destination.json --state /path/to/state.json |
| 51 | +``` |
| 52 | +- `discover` generates a `streams.json` catalog describing each folder and inferred columns. |
| 53 | +- `sync` processes files in ~2 GB chunks, injects `_last_modified_time` as the cursor, and pushes records to the destination. |
| 54 | +- The `state.json` file records `_last_modified_time` per stream so subsequent runs only process changed files. |
| 55 | + |
| 56 | +## Example configuration snippets |
| 57 | +Use these as a starting point; substitute your bucket, path, and authentication values. |
| 58 | +### JSON stream example |
| 59 | +```json |
| 60 | +{ |
| 61 | + "bucket_name": "your-bucket", |
| 62 | + "region": "us-east-1", |
| 63 | + "path_prefix": "data/json/", |
| 64 | + "file_format": "json", |
| 65 | + "json": { "line_delimited": true }, |
| 66 | + "compression": "gzip", |
| 67 | + "max_threads": 5, |
| 68 | + "retry_count": 3 |
| 69 | +} |
| 70 | +``` |
| 71 | +### CSV stream example |
| 72 | +```json |
| 73 | +{ |
| 74 | + "bucket_name": "your-bucket", |
| 75 | + "region": "us-east-1", |
| 76 | + "path_prefix": "data/csv/", |
| 77 | + "file_format": "csv", |
| 78 | + "csv": { "has_header": true, "delimiter": "," }, |
| 79 | + "max_threads": 5, |
| 80 | + "retry_count": 3 |
| 81 | +} |
| 82 | +``` |
| 83 | +### Parquet stream example |
| 84 | +```json |
| 85 | +{ |
| 86 | + "bucket_name": "your-bucket", |
| 87 | + "region": "us-east-1", |
| 88 | + "path_prefix": "data/parquet/", |
| 89 | + "file_format": "parquet", |
| 90 | + "compression": "none", |
| 91 | + "max_threads": 5, |
| 92 | + "retry_count": 3 |
| 93 | +} |
| 94 | +``` |
| 95 | + |
| 96 | +## Catalog guidance (streams) |
| 97 | +- `selected_streams` selects which folders (streams) to sync; each entry maps to a stream name under your prefix (e.g., `users`). |
| 98 | +- Each `streams[]` entry includes the inferred schema, available sync modes (`full_refresh`, `incremental`), and the cursor field (`_last_modified_time`). |
| 99 | +- `_last_modified_time` is added to every stream so you can configure incremental syncs per folder. |
| 100 | + |
| 101 | +Example catalog structure: |
| 102 | +```json |
| 103 | +{ |
| 104 | + "selected_streams": { |
| 105 | + "data": [ |
| 106 | + { "stream_name": "users", "partition_regex": "" }, |
| 107 | + { "stream_name": "orders", "partition_regex": "" } |
| 108 | + ] |
| 109 | + }, |
| 110 | + "streams": [ |
| 111 | + { |
| 112 | + "stream": { |
| 113 | + "name": "users", |
| 114 | + "namespace": "data", |
| 115 | + "supported_sync_modes": ["full_refresh", "incremental"], |
| 116 | + "cursor_field": "_last_modified_time", |
| 117 | + "sync_mode": "incremental" |
| 118 | + } |
| 119 | + } |
| 120 | + ] |
| 121 | +} |
| 122 | +``` |
| 123 | + |
| 124 | +## State guidance |
| 125 | +The `state.json` structure mirrors the catalog streams and records the latest `_last_modified_time` per stream. |
| 126 | +```json |
| 127 | +{ |
| 128 | + "type": "STREAM", |
| 129 | + "streams": [ |
| 130 | + { "stream": "users", "state": { "_last_modified_time": "2025-01-01T00:00:00Z", "chunks": [] } } |
| 131 | + ] |
| 132 | +} |
| 133 | +``` |
| 134 | +Use this file when re-running syncs to resume from the last `_last_modified_time` per stream. |
| 135 | + |
| 136 | +Find more at [S3 Docs](https://olake.io/docs/category/s3) |
0 commit comments