Releases: acryldata/datahub
v1.4.0.9
🚀 Ingestion Release Notes: v1.4.0.8 → v1.4.0.9rc1
🌟 New Features
- dbt — Extract and emit stats from catalog.json (datahub-project#16044) — @alfiyas-datahub
- BigQuery — Enrich external table metadata with source format, URIs, compression, and max bad records (datahub-project#16348) — @EladLeev
- Glue — Iceberg Lineage support (datahub-project#16562) — @ligfx
- Power BI — Add external URL for Power BI App entities (datahub-project#16572) — @alfiyas-datahub
🐛 Bug Fixes
- Ingestion — Bump authlib to >=1.6.9 to address JWE RSA1_5 padding oracle vulnerability (datahub-project#16633) — @david-leifker
- CLI — Add .gql files to the wheel build (datahub-project#16637) — @skrydal
🙌 Contributors
Thanks to @alfiyas-datahub, @EladLeev, @ligfx, @david-leifker, and @skrydal for their contributions to this release!
v1.4.0.9rc1
Full Changelog: v1.4.0.8...v1.4.0.9rc1
v1.4.0.8
DataHub Ingestion v1.4.0.8
🌟 New Features
Configurable ingestion report sample sizes — You can now control how many failure and warning entries appear in ingestion reports via environment variables DATAHUB_REPORT_FAILURE_SAMPLE_SIZE and DATAHUB_REPORT_WARNING_SAMPLE_SIZE (default: 10 each). Useful when debugging large ingestion runs where you need more failure context. (datahub-project#16165) — thanks @rob-1019!
🐛 Bug Fixes
Pin sqlglot dependency — Pinned sqlglotc to prevent unexpected version drift that could break SQL parsing in ingestion sources. (datahub-project#16614) — thanks @jayacryl!
📄 Documentation
Fix dead link on Integrations page — Added the missing request-connector documentation page, resolving a broken link on the Integrations page that pointed to a non-existent route. (datahub-project#16617) — thanks @shirshanka!
Full Changelog: v1.4.0.7...v1.4.0.8
v1.4.0.7
What's New in v1.4.0.7
Breaking Changes 🚨
-
Browse paths for DataFlow/DataJob with
platform_instance: Whenplatform_instanceis configured, DataFlow and DataJob entities now receive abrowsePathsV2aspect with the platform instance as the root path. Previously these entities were placed in a generic "Default" folder, mixing entities from multiple platform instances. Affects Fivetran, Glue, Kafka-Connect, and other sources that emit DataFlow/DataJob entities withplatform_instance. Sources withoutplatform_instanceare unaffected. (datahub-project#15270) by @treff7es -
DataHubRestEmitter.emit_mcps()return type changed: The method now returnsList[TraceData]instead ofint. To get the previous chunk count, uselen(result)on the returned list.emit_mcp()now returnsOptional[TraceData]instead ofNone. (datahub-project#15744) by @pedro93 -
Mode connector: SQL parsing behavior change: Join resolution for CTEs and subqueries was optimized. In rare edge cases with unusual CTE patterns, join metadata may differ from previous results. Affects all SQL-based lineage connectors. (datahub-project#16300) by @treff7es
-
PowerBI:
extract_column_level_lineagenow defaults totrue: Previously defaulted tofalse. Setextract_column_level_lineage: falsein your recipe to restore previous behavior. (datahub-project#16568) by @ligfx
New Features ✨
-
New RDF ingestion connector: Ingest metadata from RDF data sources including support for FIBO and BCBS-239 ontology dialects, glossary terms, domains, and relationships. Supports Turtle, JSON-LD, RDF/XML, and other formats, with SPARQL filtering. (datahub-project#15741) by @stephengoldbaum
-
Trace ID support in REST emitter: The SDK now exposes trace IDs for
SYNC_PRIMARYandASYNCemit modes, enabling easier debugging and status checking of ingestion operations. (datahub-project#15744) by @pedro93 -
Mode connector performance improvements: Major performance upgrade — concurrent API fetching with threading, SQL query response caching, improved rate limiting, and SQL parsing optimizations. Column-level lineage now falls back to table-level lineage when parsing times out. (datahub-project#16300) by @treff7es
-
Kafka-Connect: Debezium and Confluent JDBC sink connector support: Added support for Debezium source connectors and Confluent JDBC sink connectors, expanding lineage coverage for Kafka Connect pipelines. (datahub-project#16483) by @acrylJonny
-
datahub searchCLI command: New command with semantic search, field projection, and agent context support for querying DataHub from the command line. (datahub-project#16471) by @shirshanka -
SQLAlchemy profiler feature parity: The SQLAlchemy profiler now achieves feature parity with the GE profiler, including additional column statistics and type mappings. (datahub-project#16529) by @sgomezvillamor
-
Glue:
lastModifiedfrom tableUpdateTime: AWS Glue datasets now populate thelastModifiedfield in dataset properties from the Glue table'sUpdateTime. (datahub-project#16508) by @alokr-dhub -
Configurable ingestion report sample sizes: New options to control the number of failure/warning samples kept in ingestion reports, with improved failure logging for easier debugging. (datahub-project#16165) by @rob-1019
Bug Fixes 🐛
-
Glue: fine-grained lineage without a graph: Fixed an error that prevented column-level lineage generation when no graph service is configured. (datahub-project#16494) by @ligfx
-
Glue: treat table
UpdateTimeas UTC: AWS Glue returns update times as UTC without timezone info; the connector now correctly interprets them as UTC. (datahub-project#16561) by @ligfx -
BigQuery: thread-safe GCP client credentials: Fixed a thread-safety issue by passing explicit credentials to GCP clients, preventing credential sharing between threads during concurrent ingestion. (datahub-project#16579) by @jayacryl
-
Ingestion emit modes regression: Restored correct emit mode behavior (
SYNC_PRIMARY,ASYNC,SYNC) after a regression in a prior release. (datahub-project#16521) by @askumar27 -
Kafka-Connect: platform instance resolution: Fixed incorrect platform instance resolution in the Kafka Connect schema resolver. (datahub-project#16526) by @kevinkarchacryl
-
Kafka-Connect: JDBC sink DataJobs when runtime topics API is empty: Fixed an issue where DataJobs were not produced for JDBC sink connectors when the runtime topics API returned no data. (datahub-project#16557) by @acrylJonny
-
Schema resolver bulk-fetch caching: Fixed a caching bug in the schema resolver's bulk-fetch path that could cause redundant API calls and slow down SQL lineage resolution. (datahub-project#16499) by @treff7es
-
Pin
sqlglotcdependency: Pinnedsqlglotcto prevent unexpected breakage from upstream updates. (datahub-project#16614) by @jayacryl -
PyArrow minimum version bumped for CVE: Updated the minimum
pyarrowversion to address a known security vulnerability. (datahub-project#16563) by @david-leifker -
Security dependency updates: Applied CVE minimum versions via constraints and bumped
Authlibandfilelockto address known vulnerabilities. (datahub-project#16517) by @david-leifker
Improvements 🔧
-
Reproducible ingestion builds: Added
uv.lockandconstraints.txtto pin all transitive dependencies, enabling fully reproducible ingestion environment builds. (datahub-project#16489) by @kyungsoo-datahub -
Lock file freshness checks: Added CI validation to verify that
uv.lockandconstraints.txtstay in sync with dependency manifests. (datahub-project#16559) by @kyungsoo-datahub -
Dependency constraint fixes: Added missing dependency constraints to resolve installation conflicts in certain environments. (datahub-project#16513) by @kyungsoo-datahub
-
DataPlex: updated
datacataloglineage and protobuf dependencies: Upgraded the DataPlex connector to use newer library versions. (datahub-project#16560) by @sgomezvillamor -
Kafka connector configurable replication factor: Replication factor is now configurable per topic for Kafka topic ingestion. (datahub-project#16585) by @david-leifker
Documentation 📚
-
Connector docs structure consistency: Standardized the structure of all ingestion connector documentation pages. (datahub-project#16431) by @sgomezvillamor
-
Power BI docs updated for Entra configuration: Updated Power BI ingestion documentation to reflect Microsoft Entra (formerly Azure AD) authentication setup. (datahub-project#16519) by @ligfx
-
Streamlined integrations catalog: Improved the integrations page with an expanded connector catalog and updated logos for many platforms including RDF, Confluent, DataPlex, and more. (datahub-project#16597) by @shirshanka
-
RDF connector documentation: Added documentation for the new RDF ingestion connector. (datahub-project#16589, datahub-project#16617) by @shirshanka
Contributors
Thanks to all contributors: @treff7es, @stephengoldbaum, @pedro93, @rob-1019, @acrylJonny, @ligfx, @shirshanka, @kyungsoo-datahub, @alokr-dhub, @sgomezvillamor, @jayacryl, @askumar27, @kevinkarchacryl, @david-leifker, @Dutt23
Full Changelog: v1.4.0.6...v1.4.0.7
v1.4.0.8rc1
DataHub Ingestion v1.4.0.8rc1
🌟 New Features
- Configurable ingestion report sample sizes — You can now control how many failure and warning entries appear in ingestion reports via environment variables
DATAHUB_REPORT_FAILURE_SAMPLE_SIZEandDATAHUB_REPORT_WARNING_SAMPLE_SIZE(default: 10 each). Useful when debugging large ingestion runs where you need more failure context. (datahub-project#16165) — thanks @rob-1019!
🐛 Bug Fixes
- Pin
sqlglotdependency — Pinnedsqlglotcto prevent unexpected version drift that could break SQL parsing in ingestion sources. (datahub-project#16614) — thanks @jayacryl!
📄 Documentation
- Fix dead link on Integrations page — Added the missing
request-connectordocumentation page, resolving a broken link on the Integrations page that pointed to a non-existent route. (datahub-project#16617) — thanks @shirshanka!
Full Changelog: v1.4.0.7...v1.4.0.8rc1
v1.4.0.7rc3
DataHub Ingestion — v1.4.0.7rc3
🌟 New Features
-
Configurable ingestion report sample sizes — Control how many failure and warning samples appear in ingestion reports via environment variables (
DATAHUB_REPORT_FAILURE_SAMPLE_SIZE,DATAHUB_REPORT_WARNING_SAMPLE_SIZE). Defaults remain unchanged at 10. (datahub-project#16165, @rob-1019) -
SQLAlchemy profiler feature parity with Great Expectations — The SQLAlchemy profiler now matches GE behavior: generates basic profiles even when row count fails due to permission errors, skips column profiling for empty tables (performance win), and adds support for
DECIMAL/NUMERICcolumn types via a newProfilerDataType.NUMERICtype. (datahub-project#16529, @sgomezvillamor) -
Improved Integrations catalog page — The integrations catalog is now auto-generated from
docgenwith descriptions, logos, support tiers, and platform type metadata. The page features category pill filters, support-level badges, and improved card layout. (datahub-project#16597, @shirshanka)
🐛 Bug Fixes
-
Kafka Connect: column-level lineage URN fix — When building column-level lineage URNs for Kafka topics, the
connect_to_platform_mapwas being ignored. The fix uses the existingget_platform_instance()helper that correctly checks bothplatform_instance_mapandconnect_to_platform_map. (datahub-project#16526, @kevinkarchacryl) -
Kafka Connect: JDBC sink DataJobs when runtime topics API is empty — When a JDBC sink connector hasn't yet processed messages or after a topic reset, the runtime topics API returns an empty list, causing lineage edges and DataJob entities to be silently dropped. The fix falls back to config-defined topics when the runtime API returns nothing. (datahub-project#16557, @acrylJonny)
-
RDF connector: fixed docGen CI failure —
docgen.pythrew aKeyError: 'rdf'because it only iteratedsource_registryplugins. Now iterates the union ofsource_registryandconnector_registry, fixing CI on master. Also adds missing lockfiles from the original RDF PR. (datahub-project#16589, @shirshanka) -
Pin
sqlglotto a stable version to prevent unexpected breakage from upstream releases. (datahub-project#16614, @jayacryl)
🔒 Security
- Protobuf upgraded to 5.x (CVE-2026-0994) — The
google-cloud-datacatalog-lineagedependency was pinned to0.2.2, constrainingprotobufto vulnerable versions<5.x. Updated to>=0.5.0,<1.0.0with migrated import paths and regenerated lockfiles. (datahub-project#16560, @sgomezvillamor)
📚 Documentation
- Added a
request-connectorpage to fix a dead link on the Integrations page. The page guides users to request new connectors via the FeatureOS portal, GitHub issues, or by building their own. (datahub-project#16617, @shirshanka)
Full Changelog: v1.4.0.7rc2...v1.4.0.7rc3
v1.4.0.7rc2
DataHub v1.4.0.7rc2 Release Notes (Ingestion)
New Features
- RDF Connector (MVP) (#15741) — New connector for RDF/Linked Data metadata ingestion, supporting Turtle, N-Triples, and other RDF serialization formats. by @stephengoldbaum
- GraphQL Query Projection System (#16522) — Introduces a GraphQL query projection system for schema compatibility, improving the reliability of the CLI and Python SDK against varying server versions. by @shirshanka
- SQLAlchemy Profiler Feature Parity (#16529) — The SQLAlchemy profiler now achieves full feature parity with the Great Expectations profiler, including improved type mapping. by @sgomezvillamor
- Configurable Report Sample Sizes (#16165) — Adds configurable sample sizes for ingestion reports and enhanced failure logging for better observability. by @rob-1019
- PowerBI Column-Level Lineage Enabled by Default (#16568) —
extract_column_level_lineageis nowtrueby default for PowerBI. Setextract_column_level_lineage: falseto restore the previous behavior. by @ligfx
Bug Fixes
- Emit Modes Regression (#16521) — Restored correct async/sync/test emit modes in the DataHub REST sink following a regression introduced in datahub-project#15968. by @askumar27
- Kafka Connect JDBC Sink Datajobs (#16557) — Kafka Connect now correctly emits DataJob entities for JDBC sink connectors when the runtime topics API returns empty results. by @acrylJonny
- Kafka Platform Instance Helper (#16526) — Fixed platform instance resolution in the Kafka Connect schema resolver. by @kevinkarchacryl
- BigQuery Thread Safety (#16579) — BigQuery ingestion now passes explicit credentials to all GCP clients, preventing credential sharing across threads. by @jayacryl
- Glue Table UpdateTime Timezone (#16561) — Fixed AWS Glue table
UpdateTimenot being correctly interpreted as UTC. by @ligfx
Security
- PyArrow Minimum Version (#16563) — Bumped minimum
pyarrowversion to address CVE-2026-25087. by @david-leifker
Other Improvements
- Updated
google-cloud-datacataloglineage andprotobufdependencies (#16560) by @sgomezvillamor - Pinned
sqlglotcfor ingestion stability (#16614) by @jayacryl
Documentation
- Streamlined Integrations page and improved connector catalog generation (#16597) by @shirshanka
- Added request-connector page, fixing a dead link on the Integrations page (#16617) by @shirshanka
- Fixed doc generation failure for the RDF connector (#16589) by @shirshanka
Contributors
Thanks to all 12 contributors for this release: @acrylJonny, @askumar27, @david-leifker, @jayacryl, @kevinkarchacryl, @ligfx, @rob-1019, @sgomezvillamor, @shirshanka, @sgomezvillamor, @stephengoldbaum, and @alokr-dhub.
Full Changelog: v1.4.0.7rc1...v1.4.0.7rc2
v1.4.0.6
DataHub Ingestion v1.4.0.6
🚨 Breaking Changes
- Oracle connector URN update — When connecting via
service_nameto a multitenant Oracle database, dataset URNs now use the Pluggable Database (PDB) name instead of the Container Database (CDB) name. Seturn_db_namein your recipe to preserve old URNs. (datahub-project#16396) — @acrylJonny - Python packaging migration — Dependency declarations now use
pyproject.toml(PEP 621).setup.pyremains the source of truth for now but will be deprecated in a future release. (datahub-project#16339) — @kyungsoo-datahub
🌟 New Features
- New Snowplow connector — Ingest metadata from Snowplow analytics pipelines (datahub-project#15735) — @treff7es
- dbt semantic models — Full support for ingesting dbt semantic model metadata (datahub-project#16236) — @alfiyas-datahub
- dbt
convert_urns_to_lowercase— Opt-in flag to prevent duplicate entities from mixed-case identifiers on case-insensitive platforms like Snowflake (datahub-project#16358) — @alfiyas-datahub - Snowflake pattern pushdown — Metadata pattern pushdown and table type filtering for improved performance (datahub-project#16100) — @rajatoss
- Trino column-level lineage — Column-level lineage support on
upstreamLineage(datahub-project#16292) — @alfiyas-datahub - Iceberg domain assignment — Ingestion-time domain assignment for Iceberg sources (datahub-project#16443) — @sergey-pozdnyakov-epam
- MongoDB AWS IAM auth — Added
pymongo[aws]extra for AWS IAM authentication (datahub-project#16412) — @javabrett - Kafka Avro validation toggle — Option to disable Avro schema name validation (datahub-project#16310) — @Devarsh23
- Kafka Connect bundled JVM — Bundle JVM via
jdk4pyto remove system Java dependency (datahub-project#16445) — @StanDmitrievAiven - CLI agent improvements — Agent-friendly
datahub graphqlanddatahub initenhancements (datahub-project#16476) — @shirshanka
🐛 Bug Fixes
- Redshift — Boundary-aware segment stitching for query reconstruction (datahub-project#16253) — @kyungsoo-datahub
- Tableau — Apply project filters to embedded datasources when
emit_all_embedded_datasourcesis enabled (datahub-project#16340) — @aviraj-gour - Dagster — Preserve DataJob lineage on failed/canceled runs (datahub-project#16386) — @treff7es
- Snowflake — Quoting fix (datahub-project#16393), map COPY query type to INSERT (datahub-project#16461) — @treff7es
- Teradata — Set DATABASE context for view HELP commands (datahub-project#16208) — @JohnRTurner
- Kafka Connect — Use canonical
mssqlplatform for Debezium SQL Server (datahub-project#16413) — @treff7es - Oracle — Fix profiling crashes and silent table exclusions (datahub-project#16396) — @acrylJonny
- Snowplow — Add missing
cachetoolsdependency (datahub-project#16442) — @treff7es
⚡ Performance
- Snowflake tags — Emulate tag inheritance in-memory to eliminate N+1 queries (datahub-project#16400) — @treff7es
- Dataplex — Streamline ingestion by removing unnecessary entity lookups (datahub-project#16063) — @NehaGslab
🔧 Maintenance
- Upgrade to
urllib3v2 (datahub-project#16464) — @sgomezvillamor - Iceberg source recipe examples updated (datahub-project#16417) — @skrydal
📚 Documentation
- Microsoft Copilot Context Kit guide and misc docs improvements (datahub-project#16452) — @jjoyce0510
- Connector development guide for datahub-skills (datahub-project#16435) — @maggiehays
Contributors: @acrylJonny, @alfiyas-datahub, @aviraj-gour, @Devarsh23, @javabrett, @jjoyce0510, @JohnRTurner, @kyungsoo-datahub, @maggiehays, @NehaGslab, @rajatoss, @sergey-pozdnyakov-epam, @sgomezvillamor, @shirshanka, @skrydal, @StanDmitrievAiven, @treff7es
v1.4.0.7rc1
Full Changelog: v1.4.0.6...v1.4.0.7rc1
v1.4.0.6rc5
DataHub Ingestion v1.4.0.6rc5
🚨 Breaking Changes
- Dependency declarations migrated to
pyproject.toml(PEP 621) —setup.pyremains the source of truth for editing dependencies for now;pyproject.tomlis auto-generated via./gradlew :metadata-ingestion:generatePyprojectDeps.setup.pywill be deprecated in a future release. (datahub-project#16339) — @kyungsoo-datahub
🌟 Features
- dbt: Semantic model support — Ingestion now extracts dbt semantic models, including entities, measures, and dimensions, from both dbt Cloud and dbt Core. (datahub-project#16236) — @alfiyas-datahub
- Iceberg: Ingestion-time domain assignment — You can now assign domains to Iceberg datasets at ingestion time via source configuration. (datahub-project#16443) — @sergey-pozdnyakov-epam
- CLI: Agent-friendly
datahub graphqlanddatahub init— The CLI now supports schema introspection, operation discovery, and dry-run mode for GraphQL queries, making it easier to integrate with AI agents and automation. (datahub-project#16476) — @shirshanka
⚡ Performance
- Snowflake: In-memory tag inheritance — Tag inheritance is now emulated in-memory, eliminating N+1 queries against Snowflake and significantly improving ingestion speed for tagged environments. (datahub-project#16400) — @treff7es
🐛 Bug Fixes
- Tableau: Project filters now apply to embedded datasources — When
emit_all_embedded_datasourcesis enabled, project filters are correctly applied to embedded datasources. (datahub-project#16340) — @aviraj-gour
🛠️ Maintenance
- urllib3 upgraded to v2 — The ingestion framework now uses urllib3 v2, bringing improved connection handling and modern TLS defaults. (datahub-project#16464) — @sgomezvillamor
📖 Documentation
- DataHub Skills connector development guide added. (datahub-project#16435) — @maggiehays
- Microsoft Copilot Context Kit guide and miscellaneous docs improvements. (datahub-project#16452) — @jjoyce0510