Skip to content

Latest commit

 

History

History
662 lines (496 loc) · 16.5 KB

File metadata and controls

662 lines (496 loc) · 16.5 KB

SURIMI DataLab - Quick Start Guide

Welcome to SURIMI DataLab, a comprehensive data infrastructure platform for the SURIMI project (European Commission/EDITO platform).

This guide will get you up and running in under 30 minutes.

Prerequisites

  • Docker & Docker Compose installed (v20.10+)
  • At least 8GB RAM available
  • 20GB free disk space
  • Git (optional, for cloning)

Quick Start (3 Steps)

Step 1: Clone/Download Project

# Clone the repository or download and extract files
cd DataGUI_Claude_version

# Verify you have these key files:
# - docker-compose.yml
# - .env
# - trino/etc/catalog/hive.properties
# - airflow/dags/comprehensive_csv_ingestion_dag.py
# - postgres/postgres-init.sql

Step 2: Start All Services

# Start all services in detached mode (builds images automatically)
docker-compose up -d

# Wait 3-5 minutes for all services to initialize
# Watch the startup progress (optional)
docker-compose logs -f

Expected output after startup:

NAME                SERVICE             STATUS
minio               minio               Up (healthy)
postgres            postgres            Up (healthy)
hive-metastore      hive-metastore      Up (healthy)
trino               trino               Up (healthy)
superset            superset            Up (healthy)
datahub-gms         datahub-gms         Up (healthy)
datahub-frontend    datahub-frontend    Up (healthy)
airflow-webserver   airflow-webserver   Up (healthy)
airflow-scheduler   airflow-scheduler   Up (healthy)
jupyterhub          jupyterhub          Up

Note: All services are automatically initialized with default users and configurations. No manual setup required!

Step 3: MinIO Bootstrap (Automatic)

MinIO is bootstrapped automatically on startup by the minio-init service.

What it does:

  • Creates the data bucket (if missing)
  • Adds folder documentation objects in raw/, staging/, processed/
  • Creates a processed-only read policy for MinIO

Step 4: Airflow Admin User (Automatic)

The airflow-init service creates the admin user on startup using values from .env (AIRFLOW_ADMIN_USER, AIRFLOW_ADMIN_PASSWORD, and related fields).

If you change those values later, re-run the initializer:

docker-compose run --rm airflow-init

Step 5: Access the Platform

Service URL Credentials
MinIO Console http://localhost:9001 minioadmin / minioadmin
Trino UI http://localhost:8080 (no auth required)
Superset http://localhost:8088 admin / admin
DataHub http://localhost:3000 datahub / datahub
Airflow http://localhost:8083 from .env (AIRFLOW_ADMIN_*)
JupyterHub http://localhost:8000 OAuth (see .env)

You're ready! 🎉


Data Ingestion Guide

Upload Your First CSV

SURIMI DataLab uses a single bucket architecture with folder-based organization. All data goes into the data bucket with different prefixes.

Storage Layout:

  • data/raw/ - Upload CSV files here (automatically processed)
  • data/staging/ - Temporary CSV processing (auto-managed)
  • data/processed/ - Archived CSV files after ingestion
  • data/hive/ - Parquet tables organized by schema (created automatically by the pipeline)

Option 1: Using MinIO Console (Easiest)

  1. Open http://localhost:9001
  2. Login with minioadmin/minioadmin
  3. Click on "data" bucket
  4. Navigate to "raw/" folder (create if doesn't exist)
  5. Create a subfolder (e.g., "test_data")
  6. Upload your CSV file
  7. View processing at http://localhost:8083 (Airflow UI)

Option 2: Using Docker Command

# Create a test CSV
cat > test.csv << 'EOF'
id,name,value,date
1,Alice,100,2025-01-01
2,Bob,200,2025-01-02
3,Charlie,150,2025-01-03
EOF

# Create the folder structure and copy file
docker exec minio mkdir -p /bucket_data/data/raw/test_data
docker cp test.csv minio:/bucket_data/data/raw/test_data/

# Verify upload
docker exec minio ls /bucket_data/data/raw/test_data/

Option 3: Using MinIO Client (mc)

# Install mc (MinIO Client)
brew install minio/stable/mc  # macOS
# or download from https://min.io/docs/minio/linux/reference/minio-mc.html

# Configure alias
mc alias set minio http://localhost:9000 minioadmin minioadmin

# Upload file to raw folder
mc cp test.csv minio/data/raw/test_data/

# Upload entire folder recursively
mc cp --recursive ./my_datasets/ minio/data/raw/

Option 4: Using Python

from minio import Minio

client = Minio(
    'localhost:9000',
    access_key='minioadmin',
    secret_key='minioadmin',
    secure=False
)

# Upload file to raw folder in data bucket
client.fput_object('data', 'raw/test_data/test.csv', 'test.csv')

CSV Ingestion with README Metadata

For automatic schema detection and primary key definition, add a README.txt file alongside your CSV:

# Dataset Title
Test Dataset

## Schema
- id (INTEGER): Unique identifier
- name (VARCHAR): Person name
- value (NUMERIC): Measurement value
- date (DATE): Observation date

## Primary Key
id

## Description
This is test data for SURIMI DataLab.

Upload both files to the same folder in MinIO:

data/raw/
  └── test_data/
      ├── test.csv
      └── README.txt

The Airflow ingestion DAG will:

  1. Detect the CSV file (scans every 5 minutes)
  2. Parse the README for schema metadata and primary keys
  3. Move CSV to data/staging/ for processing
  4. Convert to Parquet with deduplication (MERGE mode by default)
  5. Store Parquet in data/hive/schema/table/
  6. Create external Hive table automatically
  7. Emit metadata to DataHub
  8. Archive CSV to data/processed/

Schema and Table Naming

Tables are automatically organized by schema based on folder structure:

Two-level folders:

data/raw/fisheries/catches/data.csv
→ Schema: fisheries, Table: catches
→ Query: SELECT * FROM hive.fisheries.catches

Single-level folders:

data/raw/eu_catches/data.csv
→ Schema: tables (default), Table: eu_catches
→ Query: SELECT * FROM hive.tables.eu_catches

Ingestion Modes:

  • merge (default) - Deduplicates based on primary keys from README
  • replace - Full table refresh
  • append - No deduplication

Configure via INGESTION_MODE in .env file. See docs/INGESTION_MODES.md and docs/SCHEMA_NAMING.md for details.

Monitor Ingestion

Via Airflow UI:

  • Open http://localhost:8083
  • Login with airflow / airflow
  • View the comprehensive_csv_ingestion DAG
  • Check task execution and logs

Via Database:

# Check processing status
docker exec postgres psql -U postgres -d datahub -c \
  "SELECT file_path, table_name, rows_appended, status, processed_at
   FROM ingestion_audit
   ORDER BY processed_at DESC
   LIMIT 10;"

# View ingestion statistics
docker exec postgres psql -U postgres -d datahub -c \
  "SELECT
     COUNT(*) as total_files,
     SUM(rows_appended) as total_rows,
     COUNT(DISTINCT table_name) as unique_tables
   FROM ingestion_audit
   WHERE status = 'success';"

Manually Trigger Ingestion:

docker exec airflow-scheduler airflow dags trigger comprehensive_csv_ingestion

Query Data with Trino

Using Trino CLI

# Access Trino CLI
docker exec -it trino trino

# Inside Trino CLI:
SHOW CATALOGS;
SHOW SCHEMAS FROM hive;
SHOW TABLES FROM hive.tables;      -- Default schema
SHOW TABLES FROM hive.fisheries;   -- Custom schema (if exists)

# Query your data
SELECT * FROM hive.tables.your_table_name LIMIT 10;

# Exit
quit;

Using Python (from host machine)

from trino.dbapi import connect
import pandas as pd

# Connect to Trino
conn = connect(
    host='localhost',
    port=8080,
    user='admin',
    catalog='hive',
    schema='tables'  # or your custom schema
)

# Query data into pandas DataFrame
df = pd.read_sql('SELECT * FROM your_table_name LIMIT 100', conn)
print(df.head())

Using JupyterHub

  1. Open http://localhost:8000
  2. Sign in with your configured OAuth provider (see .env)
  3. Launch a Python notebook and run:
from trino.dbapi import connect
import pandas as pd

# Note: Use 'trino' as hostname from inside container
conn = connect(
    host='trino',
    port=8080,
    user='admin',
    catalog='hive',
    schema='tables'
)

# Show all tables
df = pd.read_sql('SHOW TABLES', conn)
print(df)

# Query specific table
df = pd.read_sql('SELECT * FROM your_table_name LIMIT 100', conn)
df.head()

Create Dashboards in Superset

Note: Superset is automatically initialized with an admin user (admin/admin).

Add Trino Connection

  1. Open http://localhost:8088
  2. Login: admin / admin
  3. Go to Settings → Database Connections
  4. Click + Database
  5. Select Trino from the list
  6. Fill in:
    Display Name: SURIMI Trino
    SQLAlchemy URI: trino://trino:8080/hive
    
  7. Click Test Connection (should succeed)
  8. Click Connect

Create Your First Dashboard

  1. Go to SQL Lab → SQL Editor
  2. Select Database: "SURIMI Trino"
  3. Select Schema: "tables" (or your custom schema)
  4. Run a query:
    SELECT * FROM your_table_name LIMIT 100
  5. Click SaveSave as Dataset
  6. Go to Charts+ Chart
  7. Select your dataset
  8. Choose visualization type (Table, Bar Chart, Line Chart, etc.)
  9. Configure and save
  10. Go to Dashboards+ Dashboard
  11. Add your chart to the dashboard

For detailed Superset configuration, see docs/SUPERSET_SETUP.md.


Airflow Workflows

Access Airflow

  1. Open http://localhost:8083
  2. Login: airflow / airflow
  3. You'll see the comprehensive_csv_ingestion DAG

The CSV Ingestion DAG

The platform includes a production-ready ingestion DAG at airflow/dags/comprehensive_csv_ingestion_dag.py.

Features:

  • Automatic CSV detection in data/raw/
  • README parsing for schema and primary keys
  • Parquet conversion with MERGE/REPLACE/APPEND modes
  • DataHub metadata emission
  • Automatic archival to data/processed/

To trigger manually:

docker exec airflow-scheduler airflow dags trigger comprehensive_csv_ingestion

Or use the Airflow UI:

  1. Click on "comprehensive_csv_ingestion" DAG
  2. Click the "Trigger DAG" button (play icon ▶️)
  3. View progress in Graph View or Grid View
  4. Check task logs for detailed execution info

The DAG runs automatically every 5 minutes to detect new CSV files.


Architecture Overview

Single Bucket Data Lakehouse Architecture:

CSV Files (Upload)
  ↓
MinIO: data/raw/
  ↓
Airflow DAG (every 5 min)
  ├→ README Parser (schema + primary keys)
  ├→ Schema Validator
  └→ File Tracker (audit table)
  ↓
MinIO: data/staging/ (temporary CSV)
  ↓
Parquet Conversion (MERGE/REPLACE/APPEND)
  ↓
MinIO: data/hive/schema/table/ (permanent Parquet)
  ↓
Hive Metastore (external table catalog)
  ↓
DataHub (metadata emission)
  ↓
MinIO: data/processed/ (archived CSV + README)
  ↓
Trino (SQL Query Engine)
  ├→ Superset (Dashboards)
  ├→ JupyterHub (Notebooks)
  └→ Python/Pandas (Analytics)

Services Breakdown

Service Purpose Technology Port
MinIO S3-compatible object storage MinIO 9000, 9001
PostgreSQL Metadata database PostgreSQL 14 5432
Hive Metastore Table catalog Apache Hive 3.1 9083
Trino Distributed SQL engine Trino 414 8080
Superset BI & visualization Apache Superset 2.1 8088
DataHub Metadata management DataHub 0.11 3000, 8090
Airflow Workflow orchestration Apache Airflow 2.7 8083
JupyterHub Notebooks JupyterHub 8000
Elasticsearch Search engine Elasticsearch 7.17 9200
Neo4j Graph database (lineage) Neo4j 4.4 7474, 7687
Kafka Message broker Kafka 7.2 9092

Troubleshooting

Services Won't Start

# Check service status
docker-compose ps

# View logs for specific service
docker-compose logs trino
docker-compose logs airflow-scheduler

# Restart all services
docker-compose restart

# Restart specific service
docker-compose restart trino

Airflow Log Permission Denied

If Airflow reports PermissionError for /opt/airflow/logs, ensure the host logs directory is writable:

sudo chown -R 50000:0 airflow/logs
sudo chmod -R 775 airflow/logs

Trino Can't Connect to Hive Metastore

# Check Hive Metastore is running
docker-compose ps hive-metastore

# Check logs
docker-compose logs hive-metastore

# Verify network connectivity
docker exec trino ping hive-metastore

# Restart both services
docker-compose restart hive-metastore trino

Airflow DAG Not Detecting Files

# Check Airflow scheduler logs
docker-compose logs airflow-scheduler

# Verify MinIO is accessible from Airflow
docker exec airflow-scheduler ping minio

# Check environment variables
docker exec airflow-scheduler env | grep MINIO

# Manually trigger the DAG
docker exec airflow-scheduler airflow dags trigger comprehensive_csv_ingestion

# Restart Airflow scheduler
docker-compose restart airflow-scheduler

Can't Login to Airflow

Check the credentials in .env (AIRFLOW_ADMIN_*), then re-run:

docker-compose run --rm airflow-init

Superset Database Connection Fails

# Check Trino is running and healthy
docker-compose ps trino
docker exec trino trino --execute "SHOW CATALOGS"

# Test connection from Superset container
docker exec superset ping trino

# Verify SQLAlchemy URI format:
# trino://trino:8080/hive

Out of Disk Space

# Check disk usage
df -h
docker system df

# Clean up unused Docker resources
docker system prune -f

# Remove old volumes (CAUTION: deletes data)
docker-compose down -v
docker volume prune -f

High Memory Usage

# Check container resource usage
docker stats

# Stop non-essential services
docker-compose stop datahub-gms datahub-frontend

# Adjust JVM memory for Trino (edit .env)
# TRINO_JVM_MAX_HEAP=2G

Port Already in Use

# Find what's using port 8080 (example)
lsof -i :8080
netstat -tulpn | grep 8080

# Kill the process or change port in docker-compose.yml
# ports:
#   - "8083:8080"  # Example: change Airflow webserver port

Common Commands Reference

# Start services
docker-compose up -d

# Stop services (keep data)
docker-compose down

# Stop services and delete data
docker-compose down -v

# View all logs
docker-compose logs -f

# View specific service logs
docker-compose logs -f trino

# Restart specific service
docker-compose restart airflow-scheduler

# Rebuild and restart service
docker-compose build airflow-scheduler
docker-compose up -d airflow-scheduler

# Check service status
docker-compose ps

# Execute command in container
docker exec -it trino bash
docker exec postgres psql -U postgres -d datahub

# View resource usage
docker stats

Next Steps

  1. Upload your data: Copy CSV files to MinIO data/raw/ folder
  2. Create dashboards: Use Superset to visualize your data
  3. Build pipelines: Extend the Airflow DAG or create new ones
  4. Explore in notebooks: Use JupyterHub for data analysis
  5. Document metadata: Use DataHub for data cataloging and lineage

Additional Documentation

For detailed information, see:

Operations & Guides

Technical Deep Dives

Component Setup

Development

Master Index


SURIMI DataLab – Data Infrastructure for Healthcare Interoperability Built on EDITO Platform | European Commission