This guide covers the setup, configuration, and integration of DataHub for metadata management in the SURIMI DataLab platform.
- Overview
- Architecture
- Bootstrap and Initialization
- Trino Integration
- Common Operations
- Troubleshooting
DataHub provides centralized metadata management, data discovery, lineage tracking, and ownership for the SURIMI DataLab platform.
- Data Discovery: Search and browse datasets
- Lineage Tracking: Understand data dependencies
- Data Ownership: Assign and track data owners
- Data Quality: Track quality metrics
- Tags and Glossary: Organize and categorize data
DataHub consists of multiple services:
- datahub-gms: Backend service (GraphQL API)
- datahub-frontend-react: React-based UI
- elasticsearch: Search and indexing
- neo4j: Graph database for lineage
- kafka: Message streaming
- mysql: Metadata storage
All DataHub services run on the surimi-network shared with other SURIMI DataLab services:
surimi-datagui_surimi-network
├── trino
├── minio
├── postgres
├── hive-metastore
├── datahub-gms
├── datahub-frontend-react
├── elasticsearch
├── neo4j
└── kafka
This unified network enables seamless communication between all services.
CSV Files → MinIO → Ingestion → Parquet → Hive Metastore → Trino
↓
DATAHUB
(Metadata & Lineage)
DataHub requires bootstrap/initialization to:
- Create Elasticsearch indices for metadata
- Initialize Neo4j graph database schema
- Set up Kafka topics
- Create default users and policies
If you see a minimal UI with browse errors:
Exception while fetching data (/browse) : java.lang.RuntimeException: Failed to execute browse
This means DataHub needs to be bootstrapped.
The official DataHub quickstart handles all initialization automatically:
# Clone DataHub repository
git clone https://github.com/datahub-project/datahub.git /tmp/datahub
cd /tmp/datahub/docker/quickstart
# Run quickstart (handles bootstrap automatically)
./quickstart.sh
# This will:
# - Start all required services
# - Bootstrap the system
# - Create sample dataRun the bootstrap explicitly:
docker run --rm \
--network surimi-datagui_surimi-network \
-e DATAHUB_GMS_HOST=datahub-gms \
-e DATAHUB_GMS_PORT=8080 \
-e ELASTIC_CLIENT_HOST=elasticsearch \
-e ELASTIC_CLIENT_PORT=9200 \
-e NEO4J_HOST=bolt://neo4j:7687 \
-e NEO4J_USERNAME=neo4j \
-e NEO4J_PASSWORD=datahub \
-e KAFKA_BOOTSTRAP_SERVER=kafka:29092 \
acryldata/datahub-upgrade:latest \
-u SystemUpdateWait 5-10 minutes for completion, then refresh the DataHub UI.
Install and use the DataHub CLI:
# Install DataHub CLI
pip3 install --user acryldata-datahub
# Add to PATH
export PATH="$PATH:$HOME/Library/Python/3.9/bin"
# Configure connection (use .env values)
export DATAHUB_GMS_URL=http://localhost:${DATAHUB_MAPPED_GMS_PORT:-8080}
# Check connection
datahub check gms
# Bootstrap DataHub
datahub docker bootstrap
# Or ingest sample data (triggers initialization)
datahub docker ingest-sample-dataAfter bootstrap completes:
# Check Elasticsearch indices (should see multiple datahub indices)
curl 'http://localhost:9200/_cat/indices?v' | grep datahub
# Expected indices:
# - datasetindex_v2
# - corpuserindex_v2
# - dashboardindex_v2
# - dataflowindex_v2
# - datajobindex_v2
# - And more...
# Check DataHub health
curl http://localhost:${DATAHUB_MAPPED_GMS_PORT:-8080}/health
# Test browse API
curl "http://localhost:${DATAHUB_MAPPED_GMS_PORT:-8080}/api/v2/search?type=DATASET&input=*"After successful bootstrap:
- Open http://localhost:9002 (or
${DATAHUB_MAPPED_FRONTEND_PORT:-9002}) - You should see the full DataHub interface
- Login credentials: (default varies, check documentation)
When creating a Trino data source in DataHub, use these settings:
Basic Configuration:
- Host:
trino - Port:
8080 - Catalog:
hive - Schema:
default - Username:
admin - Authentication: None (development setup)
Connection URL:
trino://trino:8080/hive
Or for specific schema:
trino://trino:8080/hive/default
- Log in to DataHub at http://localhost:9002 (or
${DATAHUB_MAPPED_FRONTEND_PORT:-9002}) - Navigate to Ingestion → Sources
- Click Create new source
- Select Trino as the source type
- Fill in the configuration:
host_port: trino:8080
catalog: hive
schema_pattern:
allow:
- "default"
username: admin- Save and run the ingestion
Create a recipe file trino_ingestion_recipe.yml:
source:
type: trino
config:
host_port: "trino:8080"
catalog: "hive"
username: "admin"
schema_pattern:
allow:
- "default"
table_pattern:
allow:
- "*"
sink:
type: datahub-rest
config:
server: "http://datahub-gms:8080"Run ingestion:
# Set environment variables
export DATAHUB_GMS_URL=http://localhost:${DATAHUB_MAPPED_GMS_PORT:-8080}
# Check connection
datahub check gms
# Deploy ingestion
datahub ingest deploy -c trino_ingestion_recipe.ymlTest connectivity from DataHub to Trino:
# Check network connectivity
docker exec datahub-gms sh -c 'nc -zv trino 8080'
# Test Trino API
docker exec datahub-gms curl -s http://trino:8080/v1/info
# List catalogs from Trino
docker exec trino trino --execute "SHOW CATALOGS"# Access DataHub frontend
open http://localhost:9002
# Access DataHub GMS (GraphQL)
open http://localhost:8080
# Check GMS health
curl http://localhost:8080/healthSearch for Datasets:
curl -X POST http://localhost:8080/api/v2/graphql \
-H "Content-Type: application/json" \
-d '{"query": "{ search(input: {type: DATASET, query: \"*\", start: 0, count: 10}) { total entities { urn } } }"}'Get Dataset Details:
# Replace URN with actual dataset URN
curl http://localhost:8080/api/v2/datasets/urn:li:dataset:(urn:li:dataPlatform:trino,hive.tables.your_table,PROD)From Trino:
datahub ingest deploy -c trino_ingestion_recipe.ymlFrom MinIO/S3:
# Create recipe for MinIO
cat > minio_recipe.yml << EOF
source:
type: s3
config:
aws_config:
aws_access_key_id: minioadmin
aws_secret_access_key: minioadmin
aws_endpoint_url: http://minio:9000
path_specs:
- include: s3://raw/**/*.csv
EOF
datahub ingest deploy -c minio_recipe.ymlLineage shows data dependencies:
- Go to DataHub UI
- Search for a dataset
- Click on the dataset
- Navigate to Lineage tab
- View upstream and downstream dependencies
Add Tags:
- Navigate to dataset
- Click Add Tags
- Select or create tags
- Save
Add Glossary Terms:
- Go to Govern → Glossary
- Create terms and categories
- Apply to datasets
Assign Owners:
- Navigate to dataset
- Click Edit Owners
- Add owners (users or groups)
- Set ownership type (Business Owner, Technical Owner, etc.)
- Save
Check service status:
docker-compose ps | rg -i datahubView logs:
docker-compose logs datahub-gms
docker-compose logs datahub-frontend-reactCheck dependencies:
# Elasticsearch
docker-compose ps elasticsearch
curl http://localhost:9200
# Neo4j
docker-compose ps neo4j
docker exec neo4j cypher-shell -u neo4j -p password "RETURN 1"
# Kafka
docker-compose ps kafka
docker exec kafka kafka-topics --list --bootstrap-server localhost:9092
# MySQL
docker-compose ps mysql
docker exec mysql mysql -udatahub -pdatahub -e "SELECT 1"If you see "Bootstrap broker localhost:9092 disconnected":
Check Kafka configuration:
# Verify Kafka is running
docker-compose ps kafka
# Check Kafka logs
docker-compose logs kafka | tail -50
# Test Kafka from GMS container
docker exec datahub-gms nc -zv kafka 29092Update DataHub GMS environment:
Ensure KAFKA_BOOTSTRAP_SERVER is set correctly in .env:
KAFKA_BOOTSTRAP_SERVERS=kafka:29092Check Elasticsearch:
# Verify ES is accessible
curl http://localhost:9200
# Check ES health
curl http://localhost:9200/_cluster/health
# Test from DataHub
docker exec datahub-gms curl http://elasticsearch:9200Verify GMS environment:
ES_HOST=elasticsearch
ES_PORT=9200Check Neo4j:
# Verify Neo4j is running
docker-compose ps neo4j
# Test connection
docker exec neo4j cypher-shell -u neo4j -p datahub "RETURN 1"
# Check from DataHub
docker exec datahub-gms nc -zv neo4j 7687Verify credentials match:
In .env:
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=datahubIf you see "Failed to execute browse":
-
Check bootstrap completed:
curl 'http://localhost:9200/_cat/indices?v' | grep datahub
-
If no indices exist, run bootstrap:
# Using CLI datahub docker bootstrap # Or using upgrade container docker run --rm \ --network surimi-datagui_surimi-network \ -e DATAHUB_GMS_HOST=datahub-gms \ -e DATAHUB_GMS_PORT=8080 \ -e ELASTIC_CLIENT_HOST=elasticsearch \ -e ELASTIC_CLIENT_PORT=9200 \ -e NEO4J_HOST=bolt://neo4j:7687 \ -e NEO4J_USERNAME=neo4j \ -e NEO4J_PASSWORD=password \ -e KAFKA_BOOTSTRAP_SERVER=kafka:29092 \ acryldata/datahub-upgrade:latest \ -u SystemUpdate
-
Restart DataHub after bootstrap:
docker-compose restart datahub-gms datahub-frontend-react
Expected Elasticsearch indices after bootstrap:
corpgroupindex_v2corpuserindex_v2dataflowindex_v2datajobindex_v2dataprocessindex_v2datasetindex_v2dashboardindex_v2chartindex_v2tagindex_v2
Symptoms:
- Roles page shows "Failed to load roles! An unexpected error occurred."
- Dataset Stats tab is disabled even though profiles are ingested.
Root cause:
DataHubPolicy.nameis required by GraphQL, but policy records only havedisplayName.
Verify the error in GMS logs:
docker logs --tail=200 surimi-datagui-datahub-gms-1Fix by backfilling name from displayName in MySQL:
docker exec -i surimi-datagui-mysql-1 mysql -udatahub -pdatahub -D datahub \
-e "update metadata_aspect_v2 set metadata = JSON_SET(metadata,'$.name', JSON_EXTRACT(metadata,'$.displayName')) where aspect='dataHubPolicyInfo' and JSON_EXTRACT(metadata,'$.name') is null;"Restart DataHub services after the update:
docker restart surimi-datagui-datahub-gms-1 surimi-datagui-datahub-frontend-react-1Once Roles load, grant profile-view permissions so Stats is enabled.
mlmodelindex_v2mlfeatureindex_v2mlfeaturetableindex_v2mlprimarykeyindex_v2glossarytermindex_v2glossarynodeindex_v2
Check for missing indices:
curl 'http://localhost:9200/_cat/indices?v' | grep datahub | wc -l
# Should return 15+Complete reset (WARNING: deletes all metadata):
# Stop DataHub services
docker-compose stop datahub-gms datahub-frontend-react
# Remove DataHub data volumes
docker volume rm surimi-datagui_elasticsearch_data
docker volume rm surimi-datagui_neo4j_data
# Restart services
docker-compose up -d elasticsearch neo4j kafka
sleep 30 # Wait for services to be ready
docker-compose up -d datahub-gms datahub-frontend-react
# Run bootstrap
datahub docker bootstrapKey DataHub environment variables in .env:
# DataHub GMS
DATAHUB_GMS_HOST=datahub-gms
DATAHUB_GMS_PORT=8080
# DataHub Frontend
DATAHUB_SECRET=your-secret-key-here
# Elasticsearch
ES_HOST=elasticsearch
ES_PORT=9200
# Neo4j
NEO4J_HOST=neo4j
NEO4J_PORT=7687
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=password
# Kafka
KAFKA_BOOTSTRAP_SERVERS=kafka:29092Change DataHub Secret (Production):
Edit .env:
DATAHUB_SECRET=$(openssl rand -base64 42)Restart DataHub:
docker-compose restart datahub-gms datahub-frontend-reactEnable Authentication:
DataHub supports various authentication methods. Configure in datahub-frontend-react settings.
Troubleshooting —
authentication.tokenService.signingKey must be setThis error from
datahub-upgrademeans the service started with auth enabled but no signing key. Bothdatahub-gmsanddatahub-upgradeneed matching auth settings.In
.env, ensure:METADATA_SERVICE_AUTH_ENABLED=false # disable token auth (allows Airflow to ingest without a token) DATAHUB_SECRET=YouKnowNothing # used as the signing key if auth is ever enabledThen recreate the containers (restart is not enough — env vars are only read at container creation):
docker-compose -f docker-compose.datahub.quickstart.yml up -d
Consider using Acryl Data's managed DataHub Cloud:
Benefits:
- No bootstrap required
- Automatic updates
- Better performance
- Professional support
- Free tier available
Visit: https://acryldata.io/
# Start DataHub services
docker-compose up -d datahub-gms datahub-frontend-react elasticsearch neo4j kafka
# View logs
docker-compose logs -f datahub-gms
docker-compose logs -f datahub-frontend-react
# Check health
curl http://localhost:8080/health
# Bootstrap DataHub
datahub docker bootstrap
# Ingest from Trino
datahub ingest deploy -c trino_recipe.yml
# Search datasets
curl http://localhost:8080/api/v2/search?type=DATASET&query=*
# List Elasticsearch indices
curl http://localhost:9200/_cat/indices?v | grep datahub
# Test connectivity
docker exec datahub-gms nc -zv trino 8080
docker exec datahub-gms nc -zv elasticsearch 9200
docker exec datahub-gms nc -zv neo4j 7687- DataHub UI: http://localhost:9002 (or
${DATAHUB_MAPPED_FRONTEND_PORT:-9002}) - DataHub GMS: http://localhost:8080 (or
${DATAHUB_MAPPED_GMS_PORT:-8080}) - Elasticsearch: http://localhost:9200
- Neo4j Browser: http://localhost:7474
- Kafka: localhost:9092 (internal: kafka:29092)
- Official DataHub Docs: https://datahubproject.io
- Quickstart Guide: https://datahubproject.io/docs/quickstart
- Ingestion Sources: https://datahubproject.io/docs/metadata-ingestion
- GraphQL API: https://datahubproject.io/docs/api/graphql/overview
- QUICKSTART.md: Getting started with SURIMI DataLab
- OPERATIONS.md: Day-to-day operations
- DEPLOYMENT_CHECKLIST.md: Production deployment
Last Updated: December 2025 DataHub Version: 0.11+ Network: surimi-datagui_surimi-network