Skip to content

Latest commit

 

History

History
721 lines (534 loc) · 15.8 KB

File metadata and controls

721 lines (534 loc) · 15.8 KB

DataHub Setup and Configuration

This guide covers the setup, configuration, and integration of DataHub for metadata management in the SURIMI DataLab platform.

Table of Contents


Overview

DataHub provides centralized metadata management, data discovery, lineage tracking, and ownership for the SURIMI DataLab platform.

What DataHub Provides

  • Data Discovery: Search and browse datasets
  • Lineage Tracking: Understand data dependencies
  • Data Ownership: Assign and track data owners
  • Data Quality: Track quality metrics
  • Tags and Glossary: Organize and categorize data

Services

DataHub consists of multiple services:

  • datahub-gms: Backend service (GraphQL API)
  • datahub-frontend-react: React-based UI
  • elasticsearch: Search and indexing
  • neo4j: Graph database for lineage
  • kafka: Message streaming
  • mysql: Metadata storage

Architecture

Network Configuration

All DataHub services run on the surimi-network shared with other SURIMI DataLab services:

surimi-datagui_surimi-network
    ├── trino
    ├── minio
    ├── postgres
    ├── hive-metastore
    ├── datahub-gms
    ├── datahub-frontend-react
    ├── elasticsearch
    ├── neo4j
    └── kafka

This unified network enables seamless communication between all services.

Integration with Data Flow

CSV Files → MinIO → Ingestion → Parquet → Hive Metastore → Trino
                                                              ↓
                                                          DATAHUB
                                                    (Metadata & Lineage)

Bootstrap and Initialization

Initial Status

DataHub requires bootstrap/initialization to:

  1. Create Elasticsearch indices for metadata
  2. Initialize Neo4j graph database schema
  3. Set up Kafka topics
  4. Create default users and policies

Symptoms of Uninitialized DataHub

If you see a minimal UI with browse errors:

Exception while fetching data (/browse) : java.lang.RuntimeException: Failed to execute browse

This means DataHub needs to be bootstrapped.

Bootstrap Options

Option 1: Use DataHub Quickstart (Recommended)

The official DataHub quickstart handles all initialization automatically:

# Clone DataHub repository
git clone https://github.com/datahub-project/datahub.git /tmp/datahub
cd /tmp/datahub/docker/quickstart

# Run quickstart (handles bootstrap automatically)
./quickstart.sh

# This will:
# - Start all required services
# - Bootstrap the system
# - Create sample data

Option 2: Manual Bootstrap with Upgrade Container

Run the bootstrap explicitly:

docker run --rm \
  --network surimi-datagui_surimi-network \
  -e DATAHUB_GMS_HOST=datahub-gms \
  -e DATAHUB_GMS_PORT=8080 \
  -e ELASTIC_CLIENT_HOST=elasticsearch \
  -e ELASTIC_CLIENT_PORT=9200 \
  -e NEO4J_HOST=bolt://neo4j:7687 \
  -e NEO4J_USERNAME=neo4j \
  -e NEO4J_PASSWORD=datahub \
  -e KAFKA_BOOTSTRAP_SERVER=kafka:29092 \
  acryldata/datahub-upgrade:latest \
  -u SystemUpdate

Wait 5-10 minutes for completion, then refresh the DataHub UI.

Option 3: Use DataHub CLI

Install and use the DataHub CLI:

# Install DataHub CLI
pip3 install --user acryldata-datahub

# Add to PATH
export PATH="$PATH:$HOME/Library/Python/3.9/bin"

# Configure connection (use .env values)
export DATAHUB_GMS_URL=http://localhost:${DATAHUB_MAPPED_GMS_PORT:-8080}

# Check connection
datahub check gms

# Bootstrap DataHub
datahub docker bootstrap

# Or ingest sample data (triggers initialization)
datahub docker ingest-sample-data

Verify Bootstrap Success

After bootstrap completes:

# Check Elasticsearch indices (should see multiple datahub indices)
curl 'http://localhost:9200/_cat/indices?v' | grep datahub

# Expected indices:
# - datasetindex_v2
# - corpuserindex_v2
# - dashboardindex_v2
# - dataflowindex_v2
# - datajobindex_v2
# - And more...

# Check DataHub health
curl http://localhost:${DATAHUB_MAPPED_GMS_PORT:-8080}/health

# Test browse API
curl "http://localhost:${DATAHUB_MAPPED_GMS_PORT:-8080}/api/v2/search?type=DATASET&input=*"

Access DataHub UI

After successful bootstrap:

  1. Open http://localhost:9002 (or ${DATAHUB_MAPPED_FRONTEND_PORT:-9002})
  2. You should see the full DataHub interface
  3. Login credentials: (default varies, check documentation)

Trino Integration

Connection Details

When creating a Trino data source in DataHub, use these settings:

Basic Configuration:

  • Host: trino
  • Port: 8080
  • Catalog: hive
  • Schema: default
  • Username: admin
  • Authentication: None (development setup)

Connection URL:

trino://trino:8080/hive

Or for specific schema:

trino://trino:8080/hive/default

Create Trino Source in DataHub UI

  1. Log in to DataHub at http://localhost:9002 (or ${DATAHUB_MAPPED_FRONTEND_PORT:-9002})
  2. Navigate to IngestionSources
  3. Click Create new source
  4. Select Trino as the source type
  5. Fill in the configuration:
host_port: trino:8080
catalog: hive
schema_pattern:
  allow:
    - "default"
username: admin
  1. Save and run the ingestion

CLI-Based Ingestion

Create a recipe file trino_ingestion_recipe.yml:

source:
  type: trino
  config:
    host_port: "trino:8080"
    catalog: "hive"
    username: "admin"
    schema_pattern:
      allow:
        - "default"
    table_pattern:
      allow:
        - "*"

sink:
  type: datahub-rest
  config:
    server: "http://datahub-gms:8080"

Run ingestion:

# Set environment variables
export DATAHUB_GMS_URL=http://localhost:${DATAHUB_MAPPED_GMS_PORT:-8080}

# Check connection
datahub check gms

# Deploy ingestion
datahub ingest deploy -c trino_ingestion_recipe.yml

Verify Trino Connection

Test connectivity from DataHub to Trino:

# Check network connectivity
docker exec datahub-gms sh -c 'nc -zv trino 8080'

# Test Trino API
docker exec datahub-gms curl -s http://trino:8080/v1/info

# List catalogs from Trino
docker exec trino trino --execute "SHOW CATALOGS"

Common Operations

Access DataHub Services

# Access DataHub frontend
open http://localhost:9002

# Access DataHub GMS (GraphQL)
open http://localhost:8080

# Check GMS health
curl http://localhost:8080/health

Manage Metadata

Search for Datasets:

curl -X POST http://localhost:8080/api/v2/graphql \
  -H "Content-Type: application/json" \
  -d '{"query": "{ search(input: {type: DATASET, query: \"*\", start: 0, count: 10}) { total entities { urn } } }"}'

Get Dataset Details:

# Replace URN with actual dataset URN
curl http://localhost:8080/api/v2/datasets/urn:li:dataset:(urn:li:dataPlatform:trino,hive.tables.your_table,PROD)

Ingest Metadata

From Trino:

datahub ingest deploy -c trino_ingestion_recipe.yml

From MinIO/S3:

# Create recipe for MinIO
cat > minio_recipe.yml << EOF
source:
  type: s3
  config:
    aws_config:
      aws_access_key_id: minioadmin
      aws_secret_access_key: minioadmin
      aws_endpoint_url: http://minio:9000
    path_specs:
      - include: s3://raw/**/*.csv
EOF

datahub ingest deploy -c minio_recipe.yml

View Lineage

Lineage shows data dependencies:

  1. Go to DataHub UI
  2. Search for a dataset
  3. Click on the dataset
  4. Navigate to Lineage tab
  5. View upstream and downstream dependencies

Add Tags and Glossary Terms

Add Tags:

  1. Navigate to dataset
  2. Click Add Tags
  3. Select or create tags
  4. Save

Add Glossary Terms:

  1. Go to Govern → Glossary
  2. Create terms and categories
  3. Apply to datasets

Set Ownership

Assign Owners:

  1. Navigate to dataset
  2. Click Edit Owners
  3. Add owners (users or groups)
  4. Set ownership type (Business Owner, Technical Owner, etc.)
  5. Save

Troubleshooting

DataHub Services Won't Start

Check service status:

docker-compose ps | rg -i datahub

View logs:

docker-compose logs datahub-gms
docker-compose logs datahub-frontend-react

Check dependencies:

# Elasticsearch
docker-compose ps elasticsearch
curl http://localhost:9200

# Neo4j
docker-compose ps neo4j
docker exec neo4j cypher-shell -u neo4j -p password "RETURN 1"

# Kafka
docker-compose ps kafka
docker exec kafka kafka-topics --list --bootstrap-server localhost:9092

# MySQL
docker-compose ps mysql
docker exec mysql mysql -udatahub -pdatahub -e "SELECT 1"

Kafka Connection Issues

If you see "Bootstrap broker localhost:9092 disconnected":

Check Kafka configuration:

# Verify Kafka is running
docker-compose ps kafka

# Check Kafka logs
docker-compose logs kafka | tail -50

# Test Kafka from GMS container
docker exec datahub-gms nc -zv kafka 29092

Update DataHub GMS environment:

Ensure KAFKA_BOOTSTRAP_SERVER is set correctly in .env:

KAFKA_BOOTSTRAP_SERVERS=kafka:29092

Elasticsearch Connection Issues

Check Elasticsearch:

# Verify ES is accessible
curl http://localhost:9200

# Check ES health
curl http://localhost:9200/_cluster/health

# Test from DataHub
docker exec datahub-gms curl http://elasticsearch:9200

Verify GMS environment:

ES_HOST=elasticsearch
ES_PORT=9200

Neo4j Connection Issues

Check Neo4j:

# Verify Neo4j is running
docker-compose ps neo4j

# Test connection
docker exec neo4j cypher-shell -u neo4j -p datahub "RETURN 1"

# Check from DataHub
docker exec datahub-gms nc -zv neo4j 7687

Verify credentials match:

In .env:

NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=datahub

Browse Errors in UI

If you see "Failed to execute browse":

  1. Check bootstrap completed:

    curl 'http://localhost:9200/_cat/indices?v' | grep datahub
  2. If no indices exist, run bootstrap:

    # Using CLI
    datahub docker bootstrap
    
    # Or using upgrade container
    docker run --rm \
      --network surimi-datagui_surimi-network \
      -e DATAHUB_GMS_HOST=datahub-gms \
      -e DATAHUB_GMS_PORT=8080 \
      -e ELASTIC_CLIENT_HOST=elasticsearch \
      -e ELASTIC_CLIENT_PORT=9200 \
      -e NEO4J_HOST=bolt://neo4j:7687 \
      -e NEO4J_USERNAME=neo4j \
      -e NEO4J_PASSWORD=password \
      -e KAFKA_BOOTSTRAP_SERVER=kafka:29092 \
      acryldata/datahub-upgrade:latest \
      -u SystemUpdate
  3. Restart DataHub after bootstrap:

    docker-compose restart datahub-gms datahub-frontend-react

Missing Indices

Expected Elasticsearch indices after bootstrap:

  • corpgroupindex_v2
  • corpuserindex_v2
  • dataflowindex_v2
  • datajobindex_v2
  • dataprocessindex_v2
  • datasetindex_v2
  • dashboardindex_v2
  • chartindex_v2
  • tagindex_v2

Roles Screen Error and Disabled Stats Tab

Symptoms:

  • Roles page shows "Failed to load roles! An unexpected error occurred."
  • Dataset Stats tab is disabled even though profiles are ingested.

Root cause:

  • DataHubPolicy.name is required by GraphQL, but policy records only have displayName.

Verify the error in GMS logs:

docker logs --tail=200 surimi-datagui-datahub-gms-1

Fix by backfilling name from displayName in MySQL:

docker exec -i surimi-datagui-mysql-1 mysql -udatahub -pdatahub -D datahub \
  -e "update metadata_aspect_v2 set metadata = JSON_SET(metadata,'$.name', JSON_EXTRACT(metadata,'$.displayName')) where aspect='dataHubPolicyInfo' and JSON_EXTRACT(metadata,'$.name') is null;"

Restart DataHub services after the update:

docker restart surimi-datagui-datahub-gms-1 surimi-datagui-datahub-frontend-react-1

Once Roles load, grant profile-view permissions so Stats is enabled.

  • mlmodelindex_v2
  • mlfeatureindex_v2
  • mlfeaturetableindex_v2
  • mlprimarykeyindex_v2
  • glossarytermindex_v2
  • glossarynodeindex_v2

Check for missing indices:

curl 'http://localhost:9200/_cat/indices?v' | grep datahub | wc -l
# Should return 15+

Reset DataHub

Complete reset (WARNING: deletes all metadata):

# Stop DataHub services
   docker-compose stop datahub-gms datahub-frontend-react

# Remove DataHub data volumes
docker volume rm surimi-datagui_elasticsearch_data
docker volume rm surimi-datagui_neo4j_data

# Restart services
docker-compose up -d elasticsearch neo4j kafka
sleep 30  # Wait for services to be ready

   docker-compose up -d datahub-gms datahub-frontend-react

# Run bootstrap
datahub docker bootstrap

Configuration

Environment Variables

Key DataHub environment variables in .env:

# DataHub GMS
DATAHUB_GMS_HOST=datahub-gms
DATAHUB_GMS_PORT=8080

# DataHub Frontend
DATAHUB_SECRET=your-secret-key-here

# Elasticsearch
ES_HOST=elasticsearch
ES_PORT=9200

# Neo4j
NEO4J_HOST=neo4j
NEO4J_PORT=7687
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=password

# Kafka
KAFKA_BOOTSTRAP_SERVERS=kafka:29092

Security Configuration

Change DataHub Secret (Production):

Edit .env:

DATAHUB_SECRET=$(openssl rand -base64 42)

Restart DataHub:

docker-compose restart datahub-gms datahub-frontend-react

Enable Authentication:

DataHub supports various authentication methods. Configure in datahub-frontend-react settings.

Troubleshooting — authentication.tokenService.signingKey must be set

This error from datahub-upgrade means the service started with auth enabled but no signing key. Both datahub-gms and datahub-upgrade need matching auth settings.

In .env, ensure:

METADATA_SERVICE_AUTH_ENABLED=false   # disable token auth (allows Airflow to ingest without a token)
DATAHUB_SECRET=YouKnowNothing         # used as the signing key if auth is ever enabled

Then recreate the containers (restart is not enough — env vars are only read at container creation):

docker-compose -f docker-compose.datahub.quickstart.yml up -d

Alternative: Managed DataHub

Consider using Acryl Data's managed DataHub Cloud:

Benefits:

  • No bootstrap required
  • Automatic updates
  • Better performance
  • Professional support
  • Free tier available

Visit: https://acryldata.io/


Quick Reference

Common Commands

# Start DataHub services
docker-compose up -d datahub-gms datahub-frontend-react elasticsearch neo4j kafka

# View logs
docker-compose logs -f datahub-gms
docker-compose logs -f datahub-frontend-react

# Check health
curl http://localhost:8080/health

# Bootstrap DataHub
datahub docker bootstrap

# Ingest from Trino
datahub ingest deploy -c trino_recipe.yml

# Search datasets
curl http://localhost:8080/api/v2/search?type=DATASET&query=*

# List Elasticsearch indices
curl http://localhost:9200/_cat/indices?v | grep datahub

# Test connectivity
docker exec datahub-gms nc -zv trino 8080
docker exec datahub-gms nc -zv elasticsearch 9200
docker exec datahub-gms nc -zv neo4j 7687

Service URLs


Additional Resources


Last Updated: December 2025 DataHub Version: 0.11+ Network: surimi-datagui_surimi-network