Welcome to SURIMI DataLab, a comprehensive data infrastructure platform for the SURIMI project (European Commission/EDITO platform).
This guide will get you up and running in under 30 minutes.
- Docker & Docker Compose installed (v20.10+)
- At least 8GB RAM available
- 20GB free disk space
- Git (optional, for cloning)
# Clone the repository or download and extract files
cd DataGUI_Claude_version
# Verify you have these key files:
# - docker-compose.yml
# - .env
# - trino/etc/catalog/hive.properties
# - airflow/dags/comprehensive_csv_ingestion_dag.py
# - postgres/postgres-init.sql# Start all services in detached mode (builds images automatically)
docker-compose up -d
# Wait 3-5 minutes for all services to initialize
# Watch the startup progress (optional)
docker-compose logs -fExpected output after startup:
NAME SERVICE STATUS
minio minio Up (healthy)
postgres postgres Up (healthy)
hive-metastore hive-metastore Up (healthy)
trino trino Up (healthy)
superset superset Up (healthy)
datahub-gms datahub-gms Up (healthy)
datahub-frontend datahub-frontend Up (healthy)
airflow-webserver airflow-webserver Up (healthy)
airflow-scheduler airflow-scheduler Up (healthy)
jupyterhub jupyterhub Up
Note: All services are automatically initialized with default users and configurations. No manual setup required!
MinIO is bootstrapped automatically on startup by the minio-init service.
What it does:
- Creates the
databucket (if missing) - Adds folder documentation objects in
raw/,staging/,processed/ - Creates a processed-only read policy for MinIO
The airflow-init service creates the admin user on startup using values from .env
(AIRFLOW_ADMIN_USER, AIRFLOW_ADMIN_PASSWORD, and related fields).
If you change those values later, re-run the initializer:
docker-compose run --rm airflow-init| Service | URL | Credentials |
|---|---|---|
| MinIO Console | http://localhost:9001 | minioadmin / minioadmin |
| Trino UI | http://localhost:8080 | (no auth required) |
| Superset | http://localhost:8088 | admin / admin |
| DataHub | http://localhost:3000 | datahub / datahub |
| Airflow | http://localhost:8083 | from .env (AIRFLOW_ADMIN_*) |
| JupyterHub | http://localhost:8000 | OAuth (see .env) |
You're ready! 🎉
SURIMI DataLab uses a single bucket architecture with folder-based organization. All data goes into the data bucket with different prefixes.
Storage Layout:
data/raw/- Upload CSV files here (automatically processed)data/staging/- Temporary CSV processing (auto-managed)data/processed/- Archived CSV files after ingestiondata/hive/- Parquet tables organized by schema (created automatically by the pipeline)
- Open http://localhost:9001
- Login with minioadmin/minioadmin
- Click on "data" bucket
- Navigate to "raw/" folder (create if doesn't exist)
- Create a subfolder (e.g., "test_data")
- Upload your CSV file
- View processing at http://localhost:8083 (Airflow UI)
# Create a test CSV
cat > test.csv << 'EOF'
id,name,value,date
1,Alice,100,2025-01-01
2,Bob,200,2025-01-02
3,Charlie,150,2025-01-03
EOF
# Create the folder structure and copy file
docker exec minio mkdir -p /bucket_data/data/raw/test_data
docker cp test.csv minio:/bucket_data/data/raw/test_data/
# Verify upload
docker exec minio ls /bucket_data/data/raw/test_data/# Install mc (MinIO Client)
brew install minio/stable/mc # macOS
# or download from https://min.io/docs/minio/linux/reference/minio-mc.html
# Configure alias
mc alias set minio http://localhost:9000 minioadmin minioadmin
# Upload file to raw folder
mc cp test.csv minio/data/raw/test_data/
# Upload entire folder recursively
mc cp --recursive ./my_datasets/ minio/data/raw/from minio import Minio
client = Minio(
'localhost:9000',
access_key='minioadmin',
secret_key='minioadmin',
secure=False
)
# Upload file to raw folder in data bucket
client.fput_object('data', 'raw/test_data/test.csv', 'test.csv')For automatic schema detection and primary key definition, add a README.txt file alongside your CSV:
# Dataset Title
Test Dataset
## Schema
- id (INTEGER): Unique identifier
- name (VARCHAR): Person name
- value (NUMERIC): Measurement value
- date (DATE): Observation date
## Primary Key
id
## Description
This is test data for SURIMI DataLab.
Upload both files to the same folder in MinIO:
data/raw/
└── test_data/
├── test.csv
└── README.txt
The Airflow ingestion DAG will:
- Detect the CSV file (scans every 5 minutes)
- Parse the README for schema metadata and primary keys
- Move CSV to
data/staging/for processing - Convert to Parquet with deduplication (MERGE mode by default)
- Store Parquet in
data/hive/schema/table/ - Create external Hive table automatically
- Emit metadata to DataHub
- Archive CSV to
data/processed/
Tables are automatically organized by schema based on folder structure:
Two-level folders:
data/raw/fisheries/catches/data.csv
→ Schema: fisheries, Table: catches
→ Query: SELECT * FROM hive.fisheries.catches
Single-level folders:
data/raw/eu_catches/data.csv
→ Schema: tables (default), Table: eu_catches
→ Query: SELECT * FROM hive.tables.eu_catches
Ingestion Modes:
- merge (default) - Deduplicates based on primary keys from README
- replace - Full table refresh
- append - No deduplication
Configure via INGESTION_MODE in .env file. See docs/INGESTION_MODES.md and docs/SCHEMA_NAMING.md for details.
Via Airflow UI:
- Open http://localhost:8083
- Login with airflow / airflow
- View the
comprehensive_csv_ingestionDAG - Check task execution and logs
Via Database:
# Check processing status
docker exec postgres psql -U postgres -d datahub -c \
"SELECT file_path, table_name, rows_appended, status, processed_at
FROM ingestion_audit
ORDER BY processed_at DESC
LIMIT 10;"
# View ingestion statistics
docker exec postgres psql -U postgres -d datahub -c \
"SELECT
COUNT(*) as total_files,
SUM(rows_appended) as total_rows,
COUNT(DISTINCT table_name) as unique_tables
FROM ingestion_audit
WHERE status = 'success';"Manually Trigger Ingestion:
docker exec airflow-scheduler airflow dags trigger comprehensive_csv_ingestion# Access Trino CLI
docker exec -it trino trino
# Inside Trino CLI:
SHOW CATALOGS;
SHOW SCHEMAS FROM hive;
SHOW TABLES FROM hive.tables; -- Default schema
SHOW TABLES FROM hive.fisheries; -- Custom schema (if exists)
# Query your data
SELECT * FROM hive.tables.your_table_name LIMIT 10;
# Exit
quit;from trino.dbapi import connect
import pandas as pd
# Connect to Trino
conn = connect(
host='localhost',
port=8080,
user='admin',
catalog='hive',
schema='tables' # or your custom schema
)
# Query data into pandas DataFrame
df = pd.read_sql('SELECT * FROM your_table_name LIMIT 100', conn)
print(df.head())- Open http://localhost:8000
- Sign in with your configured OAuth provider (see
.env) - Launch a Python notebook and run:
from trino.dbapi import connect
import pandas as pd
# Note: Use 'trino' as hostname from inside container
conn = connect(
host='trino',
port=8080,
user='admin',
catalog='hive',
schema='tables'
)
# Show all tables
df = pd.read_sql('SHOW TABLES', conn)
print(df)
# Query specific table
df = pd.read_sql('SELECT * FROM your_table_name LIMIT 100', conn)
df.head()Note: Superset is automatically initialized with an admin user (admin/admin).
- Open http://localhost:8088
- Login: admin / admin
- Go to Settings → Database Connections
- Click + Database
- Select Trino from the list
- Fill in:
Display Name: SURIMI Trino SQLAlchemy URI: trino://trino:8080/hive - Click Test Connection (should succeed)
- Click Connect
- Go to SQL Lab → SQL Editor
- Select Database: "SURIMI Trino"
- Select Schema: "tables" (or your custom schema)
- Run a query:
SELECT * FROM your_table_name LIMIT 100
- Click Save → Save as Dataset
- Go to Charts → + Chart
- Select your dataset
- Choose visualization type (Table, Bar Chart, Line Chart, etc.)
- Configure and save
- Go to Dashboards → + Dashboard
- Add your chart to the dashboard
For detailed Superset configuration, see docs/SUPERSET_SETUP.md.
- Open http://localhost:8083
- Login: airflow / airflow
- You'll see the
comprehensive_csv_ingestionDAG
The platform includes a production-ready ingestion DAG at airflow/dags/comprehensive_csv_ingestion_dag.py.
Features:
- Automatic CSV detection in
data/raw/ - README parsing for schema and primary keys
- Parquet conversion with MERGE/REPLACE/APPEND modes
- DataHub metadata emission
- Automatic archival to
data/processed/
To trigger manually:
docker exec airflow-scheduler airflow dags trigger comprehensive_csv_ingestionOr use the Airflow UI:
- Click on "comprehensive_csv_ingestion" DAG
- Click the "Trigger DAG" button (play icon
▶️ ) - View progress in Graph View or Grid View
- Check task logs for detailed execution info
The DAG runs automatically every 5 minutes to detect new CSV files.
Single Bucket Data Lakehouse Architecture:
CSV Files (Upload)
↓
MinIO: data/raw/
↓
Airflow DAG (every 5 min)
├→ README Parser (schema + primary keys)
├→ Schema Validator
└→ File Tracker (audit table)
↓
MinIO: data/staging/ (temporary CSV)
↓
Parquet Conversion (MERGE/REPLACE/APPEND)
↓
MinIO: data/hive/schema/table/ (permanent Parquet)
↓
Hive Metastore (external table catalog)
↓
DataHub (metadata emission)
↓
MinIO: data/processed/ (archived CSV + README)
↓
Trino (SQL Query Engine)
├→ Superset (Dashboards)
├→ JupyterHub (Notebooks)
└→ Python/Pandas (Analytics)
| Service | Purpose | Technology | Port |
|---|---|---|---|
| MinIO | S3-compatible object storage | MinIO | 9000, 9001 |
| PostgreSQL | Metadata database | PostgreSQL 14 | 5432 |
| Hive Metastore | Table catalog | Apache Hive 3.1 | 9083 |
| Trino | Distributed SQL engine | Trino 414 | 8080 |
| Superset | BI & visualization | Apache Superset 2.1 | 8088 |
| DataHub | Metadata management | DataHub 0.11 | 3000, 8090 |
| Airflow | Workflow orchestration | Apache Airflow 2.7 | 8083 |
| JupyterHub | Notebooks | JupyterHub | 8000 |
| Elasticsearch | Search engine | Elasticsearch 7.17 | 9200 |
| Neo4j | Graph database (lineage) | Neo4j 4.4 | 7474, 7687 |
| Kafka | Message broker | Kafka 7.2 | 9092 |
# Check service status
docker-compose ps
# View logs for specific service
docker-compose logs trino
docker-compose logs airflow-scheduler
# Restart all services
docker-compose restart
# Restart specific service
docker-compose restart trinoIf Airflow reports PermissionError for /opt/airflow/logs, ensure the host logs directory is writable:
sudo chown -R 50000:0 airflow/logs
sudo chmod -R 775 airflow/logs# Check Hive Metastore is running
docker-compose ps hive-metastore
# Check logs
docker-compose logs hive-metastore
# Verify network connectivity
docker exec trino ping hive-metastore
# Restart both services
docker-compose restart hive-metastore trino# Check Airflow scheduler logs
docker-compose logs airflow-scheduler
# Verify MinIO is accessible from Airflow
docker exec airflow-scheduler ping minio
# Check environment variables
docker exec airflow-scheduler env | grep MINIO
# Manually trigger the DAG
docker exec airflow-scheduler airflow dags trigger comprehensive_csv_ingestion
# Restart Airflow scheduler
docker-compose restart airflow-schedulerCheck the credentials in .env (AIRFLOW_ADMIN_*), then re-run:
docker-compose run --rm airflow-init# Check Trino is running and healthy
docker-compose ps trino
docker exec trino trino --execute "SHOW CATALOGS"
# Test connection from Superset container
docker exec superset ping trino
# Verify SQLAlchemy URI format:
# trino://trino:8080/hive# Check disk usage
df -h
docker system df
# Clean up unused Docker resources
docker system prune -f
# Remove old volumes (CAUTION: deletes data)
docker-compose down -v
docker volume prune -f# Check container resource usage
docker stats
# Stop non-essential services
docker-compose stop datahub-gms datahub-frontend
# Adjust JVM memory for Trino (edit .env)
# TRINO_JVM_MAX_HEAP=2G# Find what's using port 8080 (example)
lsof -i :8080
netstat -tulpn | grep 8080
# Kill the process or change port in docker-compose.yml
# ports:
# - "8083:8080" # Example: change Airflow webserver port# Start services
docker-compose up -d
# Stop services (keep data)
docker-compose down
# Stop services and delete data
docker-compose down -v
# View all logs
docker-compose logs -f
# View specific service logs
docker-compose logs -f trino
# Restart specific service
docker-compose restart airflow-scheduler
# Rebuild and restart service
docker-compose build airflow-scheduler
docker-compose up -d airflow-scheduler
# Check service status
docker-compose ps
# Execute command in container
docker exec -it trino bash
docker exec postgres psql -U postgres -d datahub
# View resource usage
docker stats- Upload your data: Copy CSV files to MinIO
data/raw/folder - Create dashboards: Use Superset to visualize your data
- Build pipelines: Extend the Airflow DAG or create new ones
- Explore in notebooks: Use JupyterHub for data analysis
- Document metadata: Use DataHub for data cataloging and lineage
For detailed information, see:
- docs/OPERATIONS.md - Day-to-day operations and commands
- docs/INGESTION_MODES.md - MERGE/REPLACE/APPEND modes explained
- docs/SCHEMA_NAMING.md - Intelligent schema naming from folders
- docs/SINGLE_BUCKET_REFACTORING.md - Single bucket architecture
- docs/ARCHITECTURE_GUIDE.md - Complete architecture with code examples
- docs/AIRFLOW_DATAHUB_INTEGRATION.md - DataHub integration details
- docs/SUPERSET_SETUP.md - Superset configuration details
- docs/DATAHUB_SETUP.md - DataHub configuration details
- docs/CONTRIBUTING.md - Development workflow and guidelines
- docs/DEPLOYMENT_CHECKLIST.md - Production deployment
- CLAUDE.md - AI assistant usage guide
- docs/DOCUMENTATION_INDEX.md - Complete documentation roadmap
SURIMI DataLab – Data Infrastructure for Healthcare Interoperability Built on EDITO Platform | European Commission