SURIMI DataLab - Quick Start Guide

Welcome to SURIMI DataLab, a comprehensive data infrastructure platform for the SURIMI project (European Commission/EDITO platform).

This guide will get you up and running in under 30 minutes.

Prerequisites

Docker & Docker Compose installed (v20.10+)
At least 8GB RAM available
20GB free disk space
Git (optional, for cloning)

Quick Start (3 Steps)

Step 1: Clone/Download Project

# Clone the repository or download and extract files
cd DataGUI_Claude_version

# Verify you have these key files:
# - docker-compose.yml
# - .env
# - trino/etc/catalog/hive.properties
# - airflow/dags/comprehensive_csv_ingestion_dag.py
# - postgres/postgres-init.sql

Step 2: Start All Services

# Start all services in detached mode (builds images automatically)
docker-compose up -d

# Wait 3-5 minutes for all services to initialize
# Watch the startup progress (optional)
docker-compose logs -f

Expected output after startup:

NAME                SERVICE             STATUS
minio               minio               Up (healthy)
postgres            postgres            Up (healthy)
hive-metastore      hive-metastore      Up (healthy)
trino               trino               Up (healthy)
superset            superset            Up (healthy)
datahub-gms         datahub-gms         Up (healthy)
datahub-frontend    datahub-frontend    Up (healthy)
airflow-webserver   airflow-webserver   Up (healthy)
airflow-scheduler   airflow-scheduler   Up (healthy)
jupyterhub          jupyterhub          Up

Note: All services are automatically initialized with default users and configurations. No manual setup required!

Step 3: MinIO Bootstrap (Automatic)

MinIO is bootstrapped automatically on startup by the minio-init service.

What it does:

Creates the data bucket (if missing)
Adds folder documentation objects in raw/, staging/, processed/
Creates a processed-only read policy for MinIO

Step 4: Airflow Admin User (Automatic)

The airflow-init service creates the admin user on startup using values from .env (AIRFLOW_ADMIN_USER, AIRFLOW_ADMIN_PASSWORD, and related fields).

If you change those values later, re-run the initializer:

docker-compose run --rm airflow-init

Step 5: Access the Platform

Service	URL	Credentials
MinIO Console	http://localhost:9001	minioadmin / minioadmin
Trino UI	http://localhost:8080	(no auth required)
Superset	http://localhost:8088	admin / admin
DataHub	http://localhost:3000	datahub / datahub
Airflow	http://localhost:8083	from `.env` (AIRFLOW_ADMIN_*)
JupyterHub	http://localhost:8000	OAuth (see `.env`)

You're ready! 🎉

Data Ingestion Guide

Upload Your First CSV

SURIMI DataLab uses a single bucket architecture with folder-based organization. All data goes into the data bucket with different prefixes.

Storage Layout:

data/raw/ - Upload CSV files here (automatically processed)
data/staging/ - Temporary CSV processing (auto-managed)
data/processed/ - Archived CSV files after ingestion
data/hive/ - Parquet tables organized by schema (created automatically by the pipeline)

Option 1: Using MinIO Console (Easiest)

Open http://localhost:9001
Login with minioadmin/minioadmin
Click on "data" bucket
Navigate to "raw/" folder (create if doesn't exist)
Create a subfolder (e.g., "test_data")
Upload your CSV file
View processing at http://localhost:8083 (Airflow UI)

Option 2: Using Docker Command

# Create a test CSV
cat > test.csv << 'EOF'
id,name,value,date
1,Alice,100,2025-01-01
2,Bob,200,2025-01-02
3,Charlie,150,2025-01-03
EOF

# Create the folder structure and copy file
docker exec minio mkdir -p /bucket_data/data/raw/test_data
docker cp test.csv minio:/bucket_data/data/raw/test_data/

# Verify upload
docker exec minio ls /bucket_data/data/raw/test_data/

Option 3: Using MinIO Client (mc)

# Install mc (MinIO Client)
brew install minio/stable/mc  # macOS
# or download from https://min.io/docs/minio/linux/reference/minio-mc.html

# Configure alias
mc alias set minio http://localhost:9000 minioadmin minioadmin

# Upload file to raw folder
mc cp test.csv minio/data/raw/test_data/

# Upload entire folder recursively
mc cp --recursive ./my_datasets/ minio/data/raw/

Option 4: Using Python

from minio import Minio

client = Minio(
    'localhost:9000',
    access_key='minioadmin',
    secret_key='minioadmin',
    secure=False
)

# Upload file to raw folder in data bucket
client.fput_object('data', 'raw/test_data/test.csv', 'test.csv')

CSV Ingestion with README Metadata

For automatic schema detection and primary key definition, add a README.txt file alongside your CSV:

# Dataset Title
Test Dataset

## Schema
- id (INTEGER): Unique identifier
- name (VARCHAR): Person name
- value (NUMERIC): Measurement value
- date (DATE): Observation date

## Primary Key
id

## Description
This is test data for SURIMI DataLab.

Upload both files to the same folder in MinIO:

data/raw/
  └── test_data/
      ├── test.csv
      └── README.txt

The Airflow ingestion DAG will:

Detect the CSV file (scans every 5 minutes)
Parse the README for schema metadata and primary keys
Move CSV to data/staging/ for processing
Convert to Parquet with deduplication (MERGE mode by default)
Store Parquet in data/hive/schema/table/
Create external Hive table automatically
Emit metadata to DataHub
Archive CSV to data/processed/

Schema and Table Naming

Tables are automatically organized by schema based on folder structure:

Two-level folders:

data/raw/fisheries/catches/data.csv
→ Schema: fisheries, Table: catches
→ Query: SELECT * FROM hive.fisheries.catches

Single-level folders:

data/raw/eu_catches/data.csv
→ Schema: tables (default), Table: eu_catches
→ Query: SELECT * FROM hive.tables.eu_catches

Ingestion Modes:

merge (default) - Deduplicates based on primary keys from README
replace - Full table refresh
append - No deduplication

Configure via INGESTION_MODE in .env file. See docs/INGESTION_MODES.md and docs/SCHEMA_NAMING.md for details.

Monitor Ingestion

Via Airflow UI:

Open http://localhost:8083
Login with airflow / airflow
View the comprehensive_csv_ingestion DAG
Check task execution and logs

Via Database:

# Check processing status
docker exec postgres psql -U postgres -d datahub -c \
  "SELECT file_path, table_name, rows_appended, status, processed_at
   FROM ingestion_audit
   ORDER BY processed_at DESC
   LIMIT 10;"

# View ingestion statistics
docker exec postgres psql -U postgres -d datahub -c \
  "SELECT
     COUNT(*) as total_files,
     SUM(rows_appended) as total_rows,
     COUNT(DISTINCT table_name) as unique_tables
   FROM ingestion_audit
   WHERE status = 'success';"

Manually Trigger Ingestion:

docker exec airflow-scheduler airflow dags trigger comprehensive_csv_ingestion

Query Data with Trino

Using Trino CLI

# Access Trino CLI
docker exec -it trino trino

# Inside Trino CLI:
SHOW CATALOGS;
SHOW SCHEMAS FROM hive;
SHOW TABLES FROM hive.tables;      -- Default schema
SHOW TABLES FROM hive.fisheries;   -- Custom schema (if exists)

# Query your data
SELECT * FROM hive.tables.your_table_name LIMIT 10;

# Exit
quit;

Using Python (from host machine)

from trino.dbapi import connect
import pandas as pd

# Connect to Trino
conn = connect(
    host='localhost',
    port=8080,
    user='admin',
    catalog='hive',
    schema='tables'  # or your custom schema
)

# Query data into pandas DataFrame
df = pd.read_sql('SELECT * FROM your_table_name LIMIT 100', conn)
print(df.head())

Using JupyterHub

Open http://localhost:8000
Sign in with your configured OAuth provider (see .env)
Launch a Python notebook and run:

from trino.dbapi import connect
import pandas as pd

# Note: Use 'trino' as hostname from inside container
conn = connect(
    host='trino',
    port=8080,
    user='admin',
    catalog='hive',
    schema='tables'
)

# Show all tables
df = pd.read_sql('SHOW TABLES', conn)
print(df)

# Query specific table
df = pd.read_sql('SELECT * FROM your_table_name LIMIT 100', conn)
df.head()

Create Dashboards in Superset

Note: Superset is automatically initialized with an admin user (admin/admin).

Add Trino Connection

Open http://localhost:8088
Login: admin / admin
Go to Settings → Database Connections
Click + Database
Select Trino from the list

Fill in:

Display Name: SURIMI Trino
SQLAlchemy URI: trino://trino:8080/hive

Click Test Connection (should succeed)
Click Connect

Create Your First Dashboard

Go to SQL Lab → SQL Editor
Select Database: "SURIMI Trino"
Select Schema: "tables" (or your custom schema)
Run a query:
```
SELECT * FROM your_table_name LIMIT 100
```
Click Save → Save as Dataset
Go to Charts → + Chart
Select your dataset
Choose visualization type (Table, Bar Chart, Line Chart, etc.)
Configure and save
Go to Dashboards → + Dashboard
Add your chart to the dashboard

For detailed Superset configuration, see docs/SUPERSET_SETUP.md.

Airflow Workflows

Access Airflow

Open http://localhost:8083
Login: airflow / airflow
You'll see the comprehensive_csv_ingestion DAG

The CSV Ingestion DAG

The platform includes a production-ready ingestion DAG at airflow/dags/comprehensive_csv_ingestion_dag.py.

Features:

Automatic CSV detection in data/raw/
README parsing for schema and primary keys
Parquet conversion with MERGE/REPLACE/APPEND modes
DataHub metadata emission
Automatic archival to data/processed/

To trigger manually:

docker exec airflow-scheduler airflow dags trigger comprehensive_csv_ingestion

Or use the Airflow UI:

Click on "comprehensive_csv_ingestion" DAG
Click the "Trigger DAG" button (play icon ▶️)
View progress in Graph View or Grid View
Check task logs for detailed execution info

The DAG runs automatically every 5 minutes to detect new CSV files.

Architecture Overview

Single Bucket Data Lakehouse Architecture:

CSV Files (Upload)
  ↓
MinIO: data/raw/
  ↓
Airflow DAG (every 5 min)
  ├→ README Parser (schema + primary keys)
  ├→ Schema Validator
  └→ File Tracker (audit table)
  ↓
MinIO: data/staging/ (temporary CSV)
  ↓
Parquet Conversion (MERGE/REPLACE/APPEND)
  ↓
MinIO: data/hive/schema/table/ (permanent Parquet)
  ↓
Hive Metastore (external table catalog)
  ↓
DataHub (metadata emission)
  ↓
MinIO: data/processed/ (archived CSV + README)
  ↓
Trino (SQL Query Engine)
  ├→ Superset (Dashboards)
  ├→ JupyterHub (Notebooks)
  └→ Python/Pandas (Analytics)

Services Breakdown

Service	Purpose	Technology	Port
MinIO	S3-compatible object storage	MinIO	9000, 9001
PostgreSQL	Metadata database	PostgreSQL 14	5432
Hive Metastore	Table catalog	Apache Hive 3.1	9083
Trino	Distributed SQL engine	Trino 414	8080
Superset	BI & visualization	Apache Superset 2.1	8088
DataHub	Metadata management	DataHub 0.11	3000, 8090
Airflow	Workflow orchestration	Apache Airflow 2.7	8083
JupyterHub	Notebooks	JupyterHub	8000
Elasticsearch	Search engine	Elasticsearch 7.17	9200
Neo4j	Graph database (lineage)	Neo4j 4.4	7474, 7687
Kafka	Message broker	Kafka 7.2	9092

Troubleshooting

Services Won't Start

# Check service status
docker-compose ps

# View logs for specific service
docker-compose logs trino
docker-compose logs airflow-scheduler

# Restart all services
docker-compose restart

# Restart specific service
docker-compose restart trino

Airflow Log Permission Denied

If Airflow reports PermissionError for /opt/airflow/logs, ensure the host logs directory is writable:

sudo chown -R 50000:0 airflow/logs
sudo chmod -R 775 airflow/logs

Trino Can't Connect to Hive Metastore

# Check Hive Metastore is running
docker-compose ps hive-metastore

# Check logs
docker-compose logs hive-metastore

# Verify network connectivity
docker exec trino ping hive-metastore

# Restart both services
docker-compose restart hive-metastore trino

Airflow DAG Not Detecting Files

# Check Airflow scheduler logs
docker-compose logs airflow-scheduler

# Verify MinIO is accessible from Airflow
docker exec airflow-scheduler ping minio

# Check environment variables
docker exec airflow-scheduler env | grep MINIO

# Manually trigger the DAG
docker exec airflow-scheduler airflow dags trigger comprehensive_csv_ingestion

# Restart Airflow scheduler
docker-compose restart airflow-scheduler

Can't Login to Airflow

Check the credentials in .env (AIRFLOW_ADMIN_*), then re-run:

docker-compose run --rm airflow-init

Superset Database Connection Fails

# Check Trino is running and healthy
docker-compose ps trino
docker exec trino trino --execute "SHOW CATALOGS"

# Test connection from Superset container
docker exec superset ping trino

# Verify SQLAlchemy URI format:
# trino://trino:8080/hive

Out of Disk Space

# Check disk usage
df -h
docker system df

# Clean up unused Docker resources
docker system prune -f

# Remove old volumes (CAUTION: deletes data)
docker-compose down -v
docker volume prune -f

High Memory Usage

# Check container resource usage
docker stats

# Stop non-essential services
docker-compose stop datahub-gms datahub-frontend

# Adjust JVM memory for Trino (edit .env)
# TRINO_JVM_MAX_HEAP=2G

Port Already in Use

# Find what's using port 8080 (example)
lsof -i :8080
netstat -tulpn | grep 8080

# Kill the process or change port in docker-compose.yml
# ports:
#   - "8083:8080"  # Example: change Airflow webserver port

Common Commands Reference

# Start services
docker-compose up -d

# Stop services (keep data)
docker-compose down

# Stop services and delete data
docker-compose down -v

# View all logs
docker-compose logs -f

# View specific service logs
docker-compose logs -f trino

# Restart specific service
docker-compose restart airflow-scheduler

# Rebuild and restart service
docker-compose build airflow-scheduler
docker-compose up -d airflow-scheduler

# Check service status
docker-compose ps

# Execute command in container
docker exec -it trino bash
docker exec postgres psql -U postgres -d datahub

# View resource usage
docker stats

Next Steps

Upload your data: Copy CSV files to MinIO data/raw/ folder
Create dashboards: Use Superset to visualize your data
Build pipelines: Extend the Airflow DAG or create new ones
Explore in notebooks: Use JupyterHub for data analysis
Document metadata: Use DataHub for data cataloging and lineage

Additional Documentation

For detailed information, see:

Operations & Guides

docs/OPERATIONS.md - Day-to-day operations and commands
docs/INGESTION_MODES.md - MERGE/REPLACE/APPEND modes explained
docs/SCHEMA_NAMING.md - Intelligent schema naming from folders
docs/SINGLE_BUCKET_REFACTORING.md - Single bucket architecture

Technical Deep Dives

docs/ARCHITECTURE_GUIDE.md - Complete architecture with code examples
docs/AIRFLOW_DATAHUB_INTEGRATION.md - DataHub integration details

Component Setup

docs/SUPERSET_SETUP.md - Superset configuration details
docs/DATAHUB_SETUP.md - DataHub configuration details

Development

docs/CONTRIBUTING.md - Development workflow and guidelines
docs/DEPLOYMENT_CHECKLIST.md - Production deployment
CLAUDE.md - AI assistant usage guide

Master Index

docs/DOCUMENTATION_INDEX.md - Complete documentation roadmap

SURIMI DataLab – Data Infrastructure for Healthcare Interoperability Built on EDITO Platform | European Commission

FilesExpand file tree

QUICKSTART.md

Latest commit

History