The ingestion pipeline uses a single MinIO bucket with folder prefixes to separate raw CSVs, temporary staging files, Parquet tables, and archives. This keeps configuration simple while remaining production-friendly.
bucket/
├── raw/ → CSV uploads (incoming)
├── staging/ → Temporary CSV processing
├── hive/ → Parquet tables (queryable)
└── processed/ → Archived CSV + README
Set these in .env or via docker-compose environment variables:
MINIO_BUCKET: data
RAW_PREFIX: raw/
STAGING_PREFIX: staging/
HIVE_PREFIX: hive/
PROCESSED_PREFIX: processed/
HIVE_DEFAULT_SCHEMA: tables
INGESTION_MODE: mergeSchema name is derived from the first folder under raw/. If no folder exists, HIVE_DEFAULT_SCHEMA is used.
Examples:
raw/fisheries/catches/data.csv→ schemafisheries, tabledataraw/eu-catches/data.csv→ schematables, tableeu_catches
- Production flexibility: bucket name is configurable.
- Clear separation: CSV processing vs. Parquet storage.
- Minimal IAM changes: permissions can target prefixes within one bucket.
docs/INGESTION_MODES.mddocs/SCHEMA_NAMING.mddocs/OPERATIONS.md