Skip to content

Latest commit

 

History

History
50 lines (35 loc) · 1.34 KB

File metadata and controls

50 lines (35 loc) · 1.34 KB

Single Bucket Architecture

Overview

The ingestion pipeline uses a single MinIO bucket with folder prefixes to separate raw CSVs, temporary staging files, Parquet tables, and archives. This keeps configuration simple while remaining production-friendly.

Bucket Layout

bucket/
├── raw/              → CSV uploads (incoming)
├── staging/          → Temporary CSV processing
├── hive/             → Parquet tables (queryable)
└── processed/        → Archived CSV + README

Configuration

Set these in .env or via docker-compose environment variables:

MINIO_BUCKET: data
RAW_PREFIX: raw/
STAGING_PREFIX: staging/
HIVE_PREFIX: hive/
PROCESSED_PREFIX: processed/
HIVE_DEFAULT_SCHEMA: tables
INGESTION_MODE: merge

Schema Naming

Schema name is derived from the first folder under raw/. If no folder exists, HIVE_DEFAULT_SCHEMA is used.

Examples:

  • raw/fisheries/catches/data.csv → schema fisheries, table data
  • raw/eu-catches/data.csv → schema tables, table eu_catches

Why This Design

  • Production flexibility: bucket name is configurable.
  • Clear separation: CSV processing vs. Parquet storage.
  • Minimal IAM changes: permissions can target prefixes within one bucket.

Related Docs

  • docs/INGESTION_MODES.md
  • docs/SCHEMA_NAMING.md
  • docs/OPERATIONS.md