Skip to content

Latest commit

 

History

History
462 lines (359 loc) · 11.7 KB

File metadata and controls

462 lines (359 loc) · 11.7 KB

DataHub Integration Guide - SURIMI Ingestion Pipeline

Overview

DataHub is a metadata management platform that catalogs your data assets. In the SURIMI pipeline, DataHub provides:

  • 📊 Data Discovery - Search and browse all ingested datasets
  • 🔄 Data Lineage - Track data flow from CSV → Parquet → Hive tables
  • 📖 Schema Documentation - Column names, types, descriptions from README files
  • 👥 Data Governance - Ownership, tags, classifications
  • 🔍 Search - Full-text search across all metadata

How DataHub Fits in the Pipeline

CSV Upload → Process → Parquet → Hive Table → DataHub
                                                  ↓
                                    Searchable in DataHub UI
                                    Shows lineage, schema, docs

What Gets Sent to DataHub

From your ingestion pipeline, DataHub receives:

  1. Dataset Information

    • Dataset name (e.g., default.test)
    • Platform (Hive)
    • Environment (PROD)
  2. Schema Metadata

    • Column names
    • Column types (VARCHAR, BIGINT, etc.)
    • Column descriptions (from README files)
  3. Custom Properties

    • Source CSV file path
    • Row count
    • README file path
    • Processing timestamp
  4. Lineage Information

    • Source: CSV in MinIO
    • Transformation: Parquet conversion
    • Destination: Hive table

Current Implementation Status

✅ Implemented (Logging Only)

Currently in comprehensive_csv_ingestion_dag.py, the ingest_metadata_to_datahub task:

def ingest_metadata_to_datahub(**context):
    # Prepares DataHub metadata structure
    dataset_urn = f"urn:li:dataset:(urn:li:dataPlatform:hive,{schema_name}.{table_name},PROD)"

    dataset_properties = {
        "customProperties": {
            "source_file": metadata['object_name'],
            "row_count": str(metadata.get('row_count', 0)),
            "readme_path": metadata.get('readme_path', '')
        },
        "name": table_name,
        "description": description,
        "uri": f"s3a://data/hive/{metadata.get('parquet_path', '')}"
    }

    # Currently just logs - doesn't actually send to DataHub
    logger.info(f"Would ingest to DataHub: {dataset_urn}")

Status: ⚠️ Metadata is prepared but NOT sent to DataHub (logging only)


Three Methods to Send Metadata to DataHub

Method 1: DataHub Python SDK (Recommended)

Pros:

  • ✅ Official DataHub client
  • ✅ Full feature support
  • ✅ Type-safe API
  • ✅ Handles authentication

Cons:

  • ❌ Requires installing acryl-datahub package
  • ❌ More complex code

Implementation:

from datahub.emitter.mce_builder import make_dataset_urn
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.schema_classes import (
    DatasetPropertiesClass,
    SchemaMetadataClass,
    SchemaFieldClass,
    StringTypeClass
)

# Create emitter
emitter = DatahubRestEmitter(
    gms_server='http://datahub-gms:8080',
    token=None  # Add if auth enabled
)

# Create dataset URN
dataset_urn = make_dataset_urn(
    platform='hive',
    name=f'{schema_name}.{table_name}',
    env='PROD'
)

# Create metadata events
dataset_properties = DatasetPropertiesClass(
    name=table_name,
    description=description,
    customProperties={
        'source_file': object_name,
        'row_count': str(row_count)
    }
)

# Emit to DataHub
emitter.emit_mcp(
    entity_urn=dataset_urn,
    aspect=dataset_properties
)

emitter.close()

Method 2: DataHub REST API (Flexible)

Pros:

  • ✅ No special dependencies
  • ✅ Works with standard HTTP
  • ✅ Easy to debug

Cons:

  • ❌ Need to construct JSON payloads manually
  • ❌ More verbose

Implementation:

import requests
import json

datahub_gms_url = 'http://datahub-gms:8080'

# Prepare metadata change proposal (MCP)
mcp = {
    "entityType": "dataset",
    "entityUrn": f"urn:li:dataset:(urn:li:dataPlatform:hive,{schema_name}.{table_name},PROD)",
    "changeType": "UPSERT",
    "aspectName": "datasetProperties",
    "aspect": {
        "value": json.dumps({
            "name": table_name,
            "description": description,
            "customProperties": {
                "source_file": object_name,
                "row_count": str(row_count),
                "readme_path": readme_path
            }
        })
    }
}

# Send to DataHub
response = requests.post(
    f'{datahub_gms_url}/entities?action=ingest',
    json={"proposals": [mcp]},
    headers={'Content-Type': 'application/json'}
)

if response.status_code == 200:
    logger.info(f"Successfully ingested {table_name} to DataHub")
else:
    logger.error(f"Failed to ingest: {response.text}")

Method 3: DataHub CLI (Manual/Batch)

Pros:

  • ✅ No code changes needed
  • ✅ Good for one-time bulk imports

Cons:

  • ❌ Not integrated with Airflow DAG
  • ❌ Manual process

Usage:

# Install DataHub CLI
pip install acryl-datahub

# Ingest via recipe file
datahub ingest -c recipe.yml

Example recipe.yml:

source:
  type: hive
  config:
    host_port: trino:8080
    database: default

sink:
  type: datahub-rest
  config:
    server: http://datahub-gms:8080

Recommended Implementation Plan

Step 1: Install DataHub SDK in Airflow

Add to airflow/requirements.txt:

acryl-datahub==0.12.0

Rebuild Airflow:

docker-compose build airflow-init airflow-scheduler airflow-webserver
docker-compose up -d airflow-init airflow-scheduler airflow-webserver

Step 2: Update DAG to Use DataHub SDK

Replace the logging code in ingest_metadata_to_datahub() with actual DataHub SDK calls.

Step 3: Test with One Dataset

Upload a small CSV and verify it appears in DataHub UI at http://localhost:9002


What You'll See in DataHub

After successful ingestion, in the DataHub UI you'll see:

1. Dataset Page

  • Dataset name: default.test
  • Platform: Hive
  • Description from README
  • Custom properties (source file, row count, etc.)

2. Schema Tab

  • All columns with their types
  • Column descriptions from README
  • Data types mapped correctly

3. Lineage Tab

CSV File (MinIO)
    ↓
Parquet File (MinIO hive)
    ↓
Hive Table (Trino)

4. Properties Tab

  • Source file path
  • Row count
  • Processing timestamp
  • README location

5. Search

  • Search by table name, column name, or description
  • Filter by platform, domain, or tags

DataHub Architecture in Your Stack

┌──────────────────────────────────────────────┐
│         Airflow DAG (comprehensive_csv)       │
│  Task: ingest_metadata_to_datahub            │
└──────────────┬───────────────────────────────┘
               │
               ↓ (HTTP/SDK)
┌──────────────────────────────────────────────┐
│         DataHub GMS (Generalized Metadata     │
│         Service) - Port 8080                  │
│                                               │
│  - REST API for metadata ingestion           │
│  - GraphQL API for queries                   │
│  - Stores metadata in backends                │
└──────────────┬───────────────────────────────┘
               │
               ↓ (Stores in)
┌──────────────────────────────────────────────┐
│         Backend Stores                        │
│                                               │
│  - PostgreSQL (main metadata)                 │
│  - Elasticsearch (search index)               │
│  - Neo4j (lineage graph)                      │
│  - Kafka (change events)                      │
└───────────────────────────────────────────────┘
               │
               ↑ (Read from)
┌──────────────────────────────────────────────┐
│         DataHub Frontend - Port 9002          │
│         http://localhost:9002                 │
│                                               │
│  - Web UI for browsing metadata              │
│  - Search interface                           │
│  - Lineage visualization                      │
└───────────────────────────────────────────────┘

Accessing DataHub UI

URL: http://localhost:9002

Default Credentials:

  • Username: datahub
  • Password: datahub

What to Check:

  1. Navigate to "Datasets"
  2. Filter by Platform: "Hive"
  3. Look for your tables (e.g., default.test)
  4. Click on a table to see schema, properties, lineage

DataHub Metadata Examples

Example 1: Dataset with README Documentation

Source: CSV with README.txt

README – Fish Landings Data
============================

Summary
-------
Commercial fish landings by country and species.

Schema
------
- country (VARCHAR): Country name
- year (INTEGER): Year of landing
- species (VARCHAR): Fish species code
- tonnage (DOUBLE): Landed weight in tonnes

In DataHub:

  • Dataset: default.fish_landings
  • Description: "Commercial fish landings by country and species"
  • Columns:
    • country (VARCHAR) - "Country name"
    • year (INTEGER) - "Year of landing"
    • species (VARCHAR) - "Fish species code"
    • tonnage (DOUBLE) - "Landed weight in tonnes"

Example 2: Dataset without README (Auto-detected)

Source: CSV without README

id,name,value
1,test,100
2,sample,200

In DataHub:

  • Dataset: default.my_data
  • Description: (empty)
  • Columns:
    • id (BIGINT) - (no description)
    • name (VARCHAR) - (no description)
    • value (BIGINT) - (no description)
  • Custom Properties:
    • source_file: my_data/data.csv
    • row_count: 2

Troubleshooting

DataHub Not Receiving Metadata

Check 1: Is DataHub GMS running?

docker-compose ps | grep datahub-gms
curl http://localhost:8080/health

Check 2: Check Airflow task logs

docker exec airflow-scheduler cat /opt/airflow/logs/.../ingest_metadata_to_datahub/attempt=1.log

Check 3: Check DataHub GMS logs

docker-compose logs datahub-gms | tail -50

Tables Not Appearing in DataHub UI

Possible causes:

  1. Metadata not actually sent (currently just logging)
  2. DataHub indexing lag (wait 1-2 minutes)
  3. Wrong platform/environment in URN
  4. Elasticsearch not indexed yet

Solution: Force reindex:

docker exec datahub-gms curl -X POST http://localhost:8080/operations?action=reindex

Authentication Errors

If DataHub has auth enabled:

  1. Create an access token in DataHub UI (Settings → Access Tokens)
  2. Pass token in SDK:
    emitter = DatahubRestEmitter(
        gms_server='http://datahub-gms:8080',
        token='your-token-here'
    )

Next Steps

  1. Decision: Choose Method 1 (DataHub SDK) for production use
  2. Install: Add acryl-datahub to airflow/requirements.txt
  3. Implement: Replace logging code with actual SDK calls
  4. Test: Upload a CSV and verify in DataHub UI
  5. Enhance: Add lineage, ownership, tags, glossary terms

Would you like me to implement the full DataHub SDK integration in your DAG?


Last Updated: 2025-12-12 DataHub Version: HEAD (latest) DataHub UI: http://localhost:9002