Single Bucket Architecture

Overview

The ingestion pipeline uses a single MinIO bucket with folder prefixes to separate raw CSVs, temporary staging files, Parquet tables, and archives. This keeps configuration simple while remaining production-friendly.

Bucket Layout

bucket/
├── raw/              → CSV uploads (incoming)
├── staging/          → Temporary CSV processing
├── hive/             → Parquet tables (queryable)
└── processed/        → Archived CSV + README

Configuration

Set these in .env or via docker-compose environment variables:

MINIO_BUCKET: data
RAW_PREFIX: raw/
STAGING_PREFIX: staging/
HIVE_PREFIX: hive/
PROCESSED_PREFIX: processed/
HIVE_DEFAULT_SCHEMA: tables
INGESTION_MODE: merge

Schema Naming

Schema name is derived from the first folder under raw/. If no folder exists, HIVE_DEFAULT_SCHEMA is used.

Examples:

raw/fisheries/catches/data.csv → schema fisheries, table data
raw/eu-catches/data.csv → schema tables, table eu_catches

Why This Design

Production flexibility: bucket name is configurable.
Clear separation: CSV processing vs. Parquet storage.
Minimal IAM changes: permissions can target prefixes within one bucket.

Related Docs

docs/INGESTION_MODES.md
docs/SCHEMA_NAMING.md
docs/OPERATIONS.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Single Bucket Architecture

Overview

Bucket Layout

Configuration

Schema Naming

Why This Design

Related Docs

FilesExpand file tree

SINGLE_BUCKET_REFACTORING.md

Latest commit

History

SINGLE_BUCKET_REFACTORING.md

File metadata and controls

Single Bucket Architecture

Overview

Bucket Layout

Configuration

Schema Naming

Why This Design

Related Docs