This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
SURIMI DataLab is a comprehensive data infrastructure platform for the SURIMI project (European Commission/EDITO platform). It implements a modern data lakehouse architecture for healthcare data interoperability, caregiver/patient management, and multilingual data handling.
- Storage: MinIO (S3-compatible object storage)
- Query Engine: Trino (distributed SQL)
- Catalog: Hive Metastore (table metadata)
- Metadata Management: DataHub (data discovery, lineage, ownership)
- Analytics: Superset (BI dashboards), Jupyter notebooks
- Orchestration: Airflow (workflow automation)
- Ingestion: Airflow DAG with PostgreSQL tracking
- Data Formats: CSV ingestion → Parquet storage; Shapefile ingestion → Parquet (geometry as WKT)
- Geospatial: geopandas / fiona / shapely / pyproj (installed in Airflow container)
CSV Files → MinIO (data/raw/) → Airflow DAG → Schema Validation & Conversion →
Shapefiles ↗ → Geometry → WKT string column
MinIO (data/hive/ as Parquet) → Hive Metastore → Trino → Superset/Jupyter/DataHub
↓
data/processed/ (archived originals)
# Start all services
docker-compose up -d
# View all logs
docker-compose logs -f
# View ingestion logs
docker-compose logs -f airflow-scheduler
# Stop all services
docker-compose down
# Restart after DAG changes
docker-compose restart airflow-scheduler
# Check service status
docker-compose ps# Create required folders (run once after first startup)
docker exec -it minio bash -c "mkdir -p /bucket_data/data/raw /bucket_data/data/staging /bucket_data/data/hive /bucket_data/data/processed"
# Access MinIO console: http://localhost:9001
# Credentials: minioadmin/minioadmin# Access PostgreSQL (for metadata and ingestion tracking)
docker exec -it postgres psql -U postgres -d datahub
# Useful queries:
# SELECT * FROM ingestion_audit ORDER BY processed_at DESC LIMIT 10;
# SELECT table_name, COUNT(*) as files FROM ingestion_audit GROUP BY table_name;# Access Trino CLI
docker exec -it trino trino
# Or via HTTP: http://localhost:8080
# Useful Trino queries:
# SHOW CATALOGS;
# SHOW SCHEMAS FROM hive;
# SHOW TABLES FROM hive.default;
# DESCRIBE hive.default.table_name;
# SELECT * FROM hive.default.table_name LIMIT 10;# DAG location: airflow/dags/comprehensive_csv_ingestion_dag.py
# Trigger ingestion manually
docker exec airflow-scheduler airflow dags trigger comprehensive_csv_ingestion
# View DAG status
docker exec airflow-scheduler airflow dags list
# Test specific task
docker exec airflow-scheduler airflow tasks test comprehensive_csv_ingestion scan_raw_bucket 2025-01-01
# Access Airflow Web UI
# Open http://localhost:8081The comprehensive CSV ingestion pipeline with automatic schema detection, Parquet conversion, and DataHub integration.
Key Functions:
scan_raw_bucket()- Scans MinIO bucket/raw/ for CSV files and Shapefile bundles (.shp+ companions)process_csv_files()- Parses README or auto-detects schema from CSV; reads shapefile schema via geopandasmove_to_staging()- Moves files from raw/ to staging/ folder; moves all shapefile components togetherconvert_to_parquet()- Converts CSV to Parquet with MERGE/REPLACE/APPEND modes; converts shapefile geometry to WKTcreate_hive_tables()- Creates external Hive tables pointing to Parquet filesemit_to_datahub()- Emits metadata, lineage, and statistics to DataHub; adds geospatial/shapefile tags and CRS infomove_to_processed()- Archives files to processed/ folder; moves all shapefile components together
Ingestion Pipeline Flow:
- DAG runs on schedule (every 5 minutes by default)
- Scans
bucket/raw/for new CSV files - Parses README.txt or auto-detects schema
- Moves CSV to
bucket/staging/ - Converts to Parquet with deduplication (MERGE mode)
- Stores Parquet in
bucket/hive/schema/table/ - Creates Hive external table
- Emits metadata to DataHub
- Archives CSV to
bucket/processed/
Environment Variables:
MINIO_BUCKET- Bucket name (default:data)RAW_PREFIX- Raw folder prefix (default:raw/)STAGING_PREFIX- Staging folder prefix (default:staging/)HIVE_PREFIX- Hive storage prefix (default:hive/)PROCESSED_PREFIX- Archive folder prefix (default:processed/)HIVE_DEFAULT_SCHEMA- Default schema name (default:tables)INGESTION_MODE- Ingestion mode:merge,replace, orappend(default:merge)DATAHUB_ENABLED- Enable DataHub emission (default:true)DATAHUB_GMS_URL- DataHub GMS endpoint
Service Dependencies:
postgres→ Foundation for all metadatahive-metastore→ Depends on postgrestrino→ Depends on hive-metastore, miniodatahub-gms→ Depends on elasticsearch, neo4j, kafka, postgresdatahub-frontend→ Depends on datahub-gmsairflow→ Depends on postgres, minio, trino
Port Mappings:
- MinIO: 9000 (API), 9001 (Console)
- PostgreSQL: 5432
- Hive Metastore: 9083
- Trino: 8080
- DataHub Frontend: 3000
- DataHub GMS: 8090
- Superset: 8088
- Airflow: 8081
- Jupyter: 8888
- Elasticsearch: 9200
- Neo4j: 7474 (HTTP), 7687 (Bolt)
- Kafka: 9092
config.properties- Trino server configjvm.config- JVM memory settingsnode.properties- Node identificationcatalog/hive.properties- Hive connector config (connects to hive-metastore:9083)core-site.xml- Hadoop configuration for S3/MinIO access
Important: Trino uses Hive connector to query data stored in MinIO via the Hive Metastore catalog.
comprehensive_csv_ingestion_dag.py- Main ingestion pipeline with README parsing, Parquet conversion, and DataHub integration- Logs stored in
airflow/logs/ - Custom operators/hooks go in
airflow/plugins/
Within the MinIO bucket (default: data):
- raw/ - Original CSV files uploaded by users
- staging/ - Temporary CSV storage during processing
- hive/schema/table/ - Parquet files organized by Hive schema and table
- processed/ - Archived CSV files after successful ingestion
The DAG supports README.txt files in the same folder as CSV files to define schema:
Schema:
- column_name (TYPE): Description
- another_column (VARCHAR): Another description
Primary Key: column_name
The DAG's readme_parser.py module extracts metadata including:
- Column definitions with types and descriptions
- Primary keys for MERGE mode deduplication
- Table name and description
When no README exists, the DAG automatically:
- Infers table name from folder structure
- Detects column types from CSV data using pandas
- Determines schema name from folder path (two-level folders use first level as schema)
- Uses configurable
HIVE_DEFAULT_SCHEMAfor single-level folders
The DAG determines Hive schema names from folder structure:
- Two-level:
raw/fisheries/catches/→ Schema:fisheries, Table:catches - Single-level:
raw/eu-catches/→ Schema:tables(orHIVE_DEFAULT_SCHEMA) - Configurable via
HIVE_DEFAULT_SCHEMAenvironment variable
The DAG supports three ingestion modes (configurable via INGESTION_MODE):
- merge (default) - Deduplicates based on primary keys from README
- replace - Full table refresh, replaces all existing data
- append - No deduplication, appends all rows
The DAG handles multilingual data (Greek, German, French, English) with:
- UTF-8 encoding for all CSV files
- Proper handling of special characters in Parquet conversion
- Upload all shapefile components to MinIO under
data/raw/<folder>/(all files must share the same base name):mc cp rivers.shp rivers.dbf rivers.shx rivers.prj minio/data/raw/DataLakeFile/ # Or use MinIO Console: http://localhost:9001 - The DAG auto-detects the bundle and derives schema from the
.dbfattribute table.- Table name: base name of the
.shpfile (e.g.,rivers) - Schema name: first folder segment after
raw/(e.g.,DataLakeFile) - Geometry column:
geometry(WKT string, VARCHAR in Hive)
- Table name: base name of the
- Wait for the DAG (hourly) or trigger manually:
docker exec airflow-scheduler airflow dags trigger comprehensive_csv_ingestion - Query via Trino:
SELECT geometry, * FROM hive.DataLakeFile.rivers LIMIT 10;
- Upload CSV file to MinIO at
data/raw/your_dataset_name/# Using mc (MinIO client) mc cp your_file.csv minio/data/raw/your_dataset_name/ # Or use MinIO Console: http://localhost:9001
- (Optional) Add README.txt with schema in same folder
- Wait for automatic ingestion (runs every 5 minutes) or trigger manually:
docker exec airflow-scheduler airflow dags trigger comprehensive_csv_ingestion - Check Airflow UI for progress: http://localhost:8081
- Query data:
docker exec trino trino --execute "SELECT * FROM hive.tables.your_dataset_name LIMIT 10"
- Edit DAG file:
airflow/dags/comprehensive_csv_ingestion_dag.py - Restart Airflow:
docker-compose restart airflow-scheduler airflow-webserver - Verify changes in Airflow UI: http://localhost:8081
- Open Airflow Web UI: http://localhost:8081
- Navigate to the DAG and check task logs
- Query audit table for failures:
docker exec postgres psql -U postgres -d datahub -c \ "SELECT * FROM ingestion_audit WHERE status = 'failed' ORDER BY processed_at DESC LIMIT 10;"
-- Connect to PostgreSQL
docker exec -it postgres psql -U postgres -d datahub
-- Total ingestion stats
SELECT
COUNT(*) as total_files,
SUM(rows_appended) as total_rows,
COUNT(DISTINCT table_name) as unique_tables,
COUNT(CASE WHEN status = 'success' THEN 1 END) as successful,
COUNT(CASE WHEN status = 'failed' THEN 1 END) as failed
FROM ingestion_audit;
-- Recent ingestions
SELECT
file_path,
table_name,
rows_appended,
status,
processed_at
FROM ingestion_audit
ORDER BY processed_at DESC
LIMIT 20;
-- Files per table
SELECT
table_name,
COUNT(*) as file_count,
SUM(rows_appended) as total_rows
FROM ingestion_audit
WHERE status = 'success'
GROUP BY table_name;# From Trino container
docker exec -it trino trino --server localhost:8080
# From external Python
docker exec -it jupyter bash
pip install trino
python3 << EOF
from trino.dbapi import connect
conn = connect(host='trino', port=8080, user='admin', catalog='hive', schema='default')
cur = conn.cursor()
cur.execute('SHOW TABLES')
print(cur.fetchall())
EOF# Use the reset script
./reset_ingestion_only.sh
# Or manually:
# 1. Clear audit table
docker exec postgres psql -U postgres -d datahub -c "TRUNCATE ingestion_audit;"
# 2. Trigger reprocessing
docker exec airflow-scheduler airflow dags trigger comprehensive_csv_ingestiondocker-compose.yml- Orchestrates all services, defines volumes and networksairflow/dags/comprehensive_csv_ingestion_dag.py- Main ingestion pipeline with all logicairflow/dags/ingestion_scripts/readme_parser.py- README parsing module (used by DAG)trino/etc/catalog/hive.properties- Connects Trino to Hive Metastore (critical for queries)postgres/postgres-init.sql- PostgreSQL initialization script (creates databases)
- Tables are created with
CREATE TABLE IF NOT EXISTS - Schema evolution (adding columns) is not yet implemented
- Changing column types requires manual intervention (DROP/CREATE or ALTER)
- Fully implemented in the DAG
- CSVs are converted to Parquet with three modes: MERGE (default), REPLACE, APPEND
- Parquet files stored in
bucket/hive/schema/table/folder - External Hive tables point to Parquet files in MinIO
All services have health checks defined in docker-compose.yml:
- Services wait for dependencies to be healthy before starting
- If services repeatedly restart, check logs:
docker-compose logs [service-name]
Named volumes persist data across container restarts:
minio_data- All object storage data (raw/, staging/, hive/, processed/)postgres_data- PostgreSQL databases (includes ingestion_audit table)hive_data- Hive Metastore warehouseairflow_db_data- Airflow metadataneo4j_data,elasticsearch_data- DataHub dependencies
To completely reset: docker-compose down -v (WARNING: deletes all data!)
All services communicate via the surimi-network bridge network. Services reference each other by container name (e.g., postgres:5432, minio:9000).
README.md- Project overview and quick start (root)QUICKSTART.md- Detailed step-by-step setup guide (root)docs/DOCUMENTATION_INDEX.md- Master documentation roadmapdocs/RECENT_UPDATES.md- Changelog and migration guidedocs/SCHEMA_NAMING.md- Intelligent schema naming guidedocs/SINGLE_BUCKET_REFACTORING.md- Single bucket architecture detailsdocs/INGESTION_MODES.md- Complete ingestion modes referencedocs/ARCHITECTURE_GUIDE.md- Deep dive into architecturedocs/DEPLOYMENT_CHECKLIST.md- Production deployment considerationsdocs/OPERATIONS.md- Day-to-day operations and commandsdocs/DATAHUB_SETUP.md- DataHub setup guidedocs/SUPERSET_SETUP.md- Superset setup guidedocs/CONTRIBUTING.md- Development workflow and guidelines
SURIMI is a European Commission project focused on:
- Healthcare data interoperability across EU member states
- Caregiver and patient management systems
- Multilingual data handling (Greek, German, French, English)
- Privacy-preserving analytics
- Built on the EDITO platform infrastructure
This codebase provides the data lakehouse foundation for SURIMI's analytical needs.