Skip to content

Releases: acryldata/datahub

v1.4.0.9

17 Mar 20:29
a5e4380

Choose a tag to compare

🚀 Ingestion Release Notes: v1.4.0.8 → v1.4.0.9rc1

🌟 New Features

🐛 Bug Fixes

🙌 Contributors

Thanks to @alfiyas-datahub, @EladLeev, @ligfx, @david-leifker, and @skrydal for their contributions to this release!

v1.4.0.9rc1

17 Mar 19:12
a5e4380

Choose a tag to compare

v1.4.0.9rc1 Pre-release
Pre-release

Full Changelog: v1.4.0.8...v1.4.0.9rc1

v1.4.0.8

17 Mar 13:17
cbf7f9f

Choose a tag to compare

DataHub Ingestion v1.4.0.8

🌟 New Features
Configurable ingestion report sample sizes — You can now control how many failure and warning entries appear in ingestion reports via environment variables DATAHUB_REPORT_FAILURE_SAMPLE_SIZE and DATAHUB_REPORT_WARNING_SAMPLE_SIZE (default: 10 each). Useful when debugging large ingestion runs where you need more failure context. (datahub-project#16165) — thanks @rob-1019!

🐛 Bug Fixes
Pin sqlglot dependency — Pinned sqlglotc to prevent unexpected version drift that could break SQL parsing in ingestion sources. (datahub-project#16614) — thanks @jayacryl!

📄 Documentation
Fix dead link on Integrations page — Added the missing request-connector documentation page, resolving a broken link on the Integrations page that pointed to a non-existent route. (datahub-project#16617) — thanks @shirshanka!
Full Changelog: v1.4.0.7...v1.4.0.8

v1.4.0.7

16 Mar 17:27
92367ce

Choose a tag to compare

What's New in v1.4.0.7

Breaking Changes 🚨

  • Browse paths for DataFlow/DataJob with platform_instance: When platform_instance is configured, DataFlow and DataJob entities now receive a browsePathsV2 aspect with the platform instance as the root path. Previously these entities were placed in a generic "Default" folder, mixing entities from multiple platform instances. Affects Fivetran, Glue, Kafka-Connect, and other sources that emit DataFlow/DataJob entities with platform_instance. Sources without platform_instance are unaffected. (datahub-project#15270) by @treff7es

  • DataHubRestEmitter.emit_mcps() return type changed: The method now returns List[TraceData] instead of int. To get the previous chunk count, use len(result) on the returned list. emit_mcp() now returns Optional[TraceData] instead of None. (datahub-project#15744) by @pedro93

  • Mode connector: SQL parsing behavior change: Join resolution for CTEs and subqueries was optimized. In rare edge cases with unusual CTE patterns, join metadata may differ from previous results. Affects all SQL-based lineage connectors. (datahub-project#16300) by @treff7es

  • PowerBI: extract_column_level_lineage now defaults to true: Previously defaulted to false. Set extract_column_level_lineage: false in your recipe to restore previous behavior. (datahub-project#16568) by @ligfx


New Features ✨

  • New RDF ingestion connector: Ingest metadata from RDF data sources including support for FIBO and BCBS-239 ontology dialects, glossary terms, domains, and relationships. Supports Turtle, JSON-LD, RDF/XML, and other formats, with SPARQL filtering. (datahub-project#15741) by @stephengoldbaum

  • Trace ID support in REST emitter: The SDK now exposes trace IDs for SYNC_PRIMARY and ASYNC emit modes, enabling easier debugging and status checking of ingestion operations. (datahub-project#15744) by @pedro93

  • Mode connector performance improvements: Major performance upgrade — concurrent API fetching with threading, SQL query response caching, improved rate limiting, and SQL parsing optimizations. Column-level lineage now falls back to table-level lineage when parsing times out. (datahub-project#16300) by @treff7es

  • Kafka-Connect: Debezium and Confluent JDBC sink connector support: Added support for Debezium source connectors and Confluent JDBC sink connectors, expanding lineage coverage for Kafka Connect pipelines. (datahub-project#16483) by @acrylJonny

  • datahub search CLI command: New command with semantic search, field projection, and agent context support for querying DataHub from the command line. (datahub-project#16471) by @shirshanka

  • SQLAlchemy profiler feature parity: The SQLAlchemy profiler now achieves feature parity with the GE profiler, including additional column statistics and type mappings. (datahub-project#16529) by @sgomezvillamor

  • Glue: lastModified from table UpdateTime: AWS Glue datasets now populate the lastModified field in dataset properties from the Glue table's UpdateTime. (datahub-project#16508) by @alokr-dhub

  • Configurable ingestion report sample sizes: New options to control the number of failure/warning samples kept in ingestion reports, with improved failure logging for easier debugging. (datahub-project#16165) by @rob-1019


Bug Fixes 🐛

  • Glue: fine-grained lineage without a graph: Fixed an error that prevented column-level lineage generation when no graph service is configured. (datahub-project#16494) by @ligfx

  • Glue: treat table UpdateTime as UTC: AWS Glue returns update times as UTC without timezone info; the connector now correctly interprets them as UTC. (datahub-project#16561) by @ligfx

  • BigQuery: thread-safe GCP client credentials: Fixed a thread-safety issue by passing explicit credentials to GCP clients, preventing credential sharing between threads during concurrent ingestion. (datahub-project#16579) by @jayacryl

  • Ingestion emit modes regression: Restored correct emit mode behavior (SYNC_PRIMARY, ASYNC, SYNC) after a regression in a prior release. (datahub-project#16521) by @askumar27

  • Kafka-Connect: platform instance resolution: Fixed incorrect platform instance resolution in the Kafka Connect schema resolver. (datahub-project#16526) by @kevinkarchacryl

  • Kafka-Connect: JDBC sink DataJobs when runtime topics API is empty: Fixed an issue where DataJobs were not produced for JDBC sink connectors when the runtime topics API returned no data. (datahub-project#16557) by @acrylJonny

  • Schema resolver bulk-fetch caching: Fixed a caching bug in the schema resolver's bulk-fetch path that could cause redundant API calls and slow down SQL lineage resolution. (datahub-project#16499) by @treff7es

  • Pin sqlglotc dependency: Pinned sqlglotc to prevent unexpected breakage from upstream updates. (datahub-project#16614) by @jayacryl

  • PyArrow minimum version bumped for CVE: Updated the minimum pyarrow version to address a known security vulnerability. (datahub-project#16563) by @david-leifker

  • Security dependency updates: Applied CVE minimum versions via constraints and bumped Authlib and filelock to address known vulnerabilities. (datahub-project#16517) by @david-leifker


Improvements 🔧


Documentation 📚


Contributors

Thanks to all contributors: @treff7es, @stephengoldbaum, @pedro93, @rob-1019, @acrylJonny, @ligfx, @shirshanka, @kyungsoo-datahub, @alokr-dhub, @sgomezvillamor, @jayacryl, @askumar27, @kevinkarchacryl, @david-leifker, @Dutt23

Full Changelog: v1.4.0.6...v1.4.0.7

v1.4.0.8rc1

16 Mar 22:56
0b134a5

Choose a tag to compare

v1.4.0.8rc1 Pre-release
Pre-release

DataHub Ingestion v1.4.0.8rc1

🌟 New Features

  • Configurable ingestion report sample sizes — You can now control how many failure and warning entries appear in ingestion reports via environment variables DATAHUB_REPORT_FAILURE_SAMPLE_SIZE and DATAHUB_REPORT_WARNING_SAMPLE_SIZE (default: 10 each). Useful when debugging large ingestion runs where you need more failure context. (datahub-project#16165) — thanks @rob-1019!

🐛 Bug Fixes

  • Pin sqlglot dependency — Pinned sqlglotc to prevent unexpected version drift that could break SQL parsing in ingestion sources. (datahub-project#16614) — thanks @jayacryl!

📄 Documentation

  • Fix dead link on Integrations page — Added the missing request-connector documentation page, resolving a broken link on the Integrations page that pointed to a non-existent route. (datahub-project#16617) — thanks @shirshanka!

Full Changelog: v1.4.0.7...v1.4.0.8rc1

v1.4.0.7rc3

16 Mar 16:57
92367ce

Choose a tag to compare

v1.4.0.7rc3 Pre-release
Pre-release

DataHub Ingestion — v1.4.0.7rc3

🌟 New Features

  • Configurable ingestion report sample sizes — Control how many failure and warning samples appear in ingestion reports via environment variables (DATAHUB_REPORT_FAILURE_SAMPLE_SIZE, DATAHUB_REPORT_WARNING_SAMPLE_SIZE). Defaults remain unchanged at 10. (datahub-project#16165, @rob-1019)

  • SQLAlchemy profiler feature parity with Great Expectations — The SQLAlchemy profiler now matches GE behavior: generates basic profiles even when row count fails due to permission errors, skips column profiling for empty tables (performance win), and adds support for DECIMAL/NUMERIC column types via a new ProfilerDataType.NUMERIC type. (datahub-project#16529, @sgomezvillamor)

  • Improved Integrations catalog page — The integrations catalog is now auto-generated from docgen with descriptions, logos, support tiers, and platform type metadata. The page features category pill filters, support-level badges, and improved card layout. (datahub-project#16597, @shirshanka)

🐛 Bug Fixes

  • Kafka Connect: column-level lineage URN fix — When building column-level lineage URNs for Kafka topics, the connect_to_platform_map was being ignored. The fix uses the existing get_platform_instance() helper that correctly checks both platform_instance_map and connect_to_platform_map. (datahub-project#16526, @kevinkarchacryl)

  • Kafka Connect: JDBC sink DataJobs when runtime topics API is empty — When a JDBC sink connector hasn't yet processed messages or after a topic reset, the runtime topics API returns an empty list, causing lineage edges and DataJob entities to be silently dropped. The fix falls back to config-defined topics when the runtime API returns nothing. (datahub-project#16557, @acrylJonny)

  • RDF connector: fixed docGen CI failuredocgen.py threw a KeyError: 'rdf' because it only iterated source_registry plugins. Now iterates the union of source_registry and connector_registry, fixing CI on master. Also adds missing lockfiles from the original RDF PR. (datahub-project#16589, @shirshanka)

  • Pin sqlglot to a stable version to prevent unexpected breakage from upstream releases. (datahub-project#16614, @jayacryl)

🔒 Security

  • Protobuf upgraded to 5.x (CVE-2026-0994) — The google-cloud-datacatalog-lineage dependency was pinned to 0.2.2, constraining protobuf to vulnerable versions <5.x. Updated to >=0.5.0,<1.0.0 with migrated import paths and regenerated lockfiles. (datahub-project#16560, @sgomezvillamor)

📚 Documentation

  • Added a request-connector page to fix a dead link on the Integrations page. The page guides users to request new connectors via the FeatureOS portal, GitHub issues, or by building their own. (datahub-project#16617, @shirshanka)

Full Changelog: v1.4.0.7rc2...v1.4.0.7rc3

v1.4.0.7rc2

15 Mar 18:36
adf9bd9

Choose a tag to compare

v1.4.0.7rc2 Pre-release
Pre-release

DataHub v1.4.0.7rc2 Release Notes (Ingestion)

New Features

  • RDF Connector (MVP) (#15741) — New connector for RDF/Linked Data metadata ingestion, supporting Turtle, N-Triples, and other RDF serialization formats. by @stephengoldbaum
  • GraphQL Query Projection System (#16522) — Introduces a GraphQL query projection system for schema compatibility, improving the reliability of the CLI and Python SDK against varying server versions. by @shirshanka
  • SQLAlchemy Profiler Feature Parity (#16529) — The SQLAlchemy profiler now achieves full feature parity with the Great Expectations profiler, including improved type mapping. by @sgomezvillamor
  • Configurable Report Sample Sizes (#16165) — Adds configurable sample sizes for ingestion reports and enhanced failure logging for better observability. by @rob-1019
  • PowerBI Column-Level Lineage Enabled by Default (#16568) — extract_column_level_lineage is now true by default for PowerBI. Set extract_column_level_lineage: false to restore the previous behavior. by @ligfx

Bug Fixes

  • Emit Modes Regression (#16521) — Restored correct async/sync/test emit modes in the DataHub REST sink following a regression introduced in datahub-project#15968. by @askumar27
  • Kafka Connect JDBC Sink Datajobs (#16557) — Kafka Connect now correctly emits DataJob entities for JDBC sink connectors when the runtime topics API returns empty results. by @acrylJonny
  • Kafka Platform Instance Helper (#16526) — Fixed platform instance resolution in the Kafka Connect schema resolver. by @kevinkarchacryl
  • BigQuery Thread Safety (#16579) — BigQuery ingestion now passes explicit credentials to all GCP clients, preventing credential sharing across threads. by @jayacryl
  • Glue Table UpdateTime Timezone (#16561) — Fixed AWS Glue table UpdateTime not being correctly interpreted as UTC. by @ligfx

Security

Other Improvements

Documentation

  • Streamlined Integrations page and improved connector catalog generation (#16597) by @shirshanka
  • Added request-connector page, fixing a dead link on the Integrations page (#16617) by @shirshanka
  • Fixed doc generation failure for the RDF connector (#16589) by @shirshanka

Contributors

Thanks to all 12 contributors for this release: @acrylJonny, @askumar27, @david-leifker, @jayacryl, @kevinkarchacryl, @ligfx, @rob-1019, @sgomezvillamor, @shirshanka, @sgomezvillamor, @stephengoldbaum, and @alokr-dhub.

Full Changelog: v1.4.0.7rc1...v1.4.0.7rc2

v1.4.0.6

12 Mar 08:03
d1e9bb4

Choose a tag to compare

DataHub Ingestion v1.4.0.6

🚨 Breaking Changes

  • Oracle connector URN update — When connecting via service_name to a multitenant Oracle database, dataset URNs now use the Pluggable Database (PDB) name instead of the Container Database (CDB) name. Set urn_db_name in your recipe to preserve old URNs. (datahub-project#16396) — @acrylJonny
  • Python packaging migration — Dependency declarations now use pyproject.toml (PEP 621). setup.py remains the source of truth for now but will be deprecated in a future release. (datahub-project#16339) — @kyungsoo-datahub

🌟 New Features

🐛 Bug Fixes

⚡ Performance

🔧 Maintenance

📚 Documentation


Contributors: @acrylJonny, @alfiyas-datahub, @aviraj-gour, @Devarsh23, @javabrett, @jjoyce0510, @JohnRTurner, @kyungsoo-datahub, @maggiehays, @NehaGslab, @rajatoss, @sergey-pozdnyakov-epam, @sgomezvillamor, @shirshanka, @skrydal, @StanDmitrievAiven, @treff7es

v1.4.0.7rc1

13 Mar 04:08
d4e2a4b

Choose a tag to compare

v1.4.0.7rc1 Pre-release
Pre-release

Full Changelog: v1.4.0.6...v1.4.0.7rc1

v1.4.0.6rc5

12 Mar 07:38
d1e9bb4

Choose a tag to compare

v1.4.0.6rc5 Pre-release
Pre-release

DataHub Ingestion v1.4.0.6rc5

🚨 Breaking Changes

  • Dependency declarations migrated to pyproject.toml (PEP 621)setup.py remains the source of truth for editing dependencies for now; pyproject.toml is auto-generated via ./gradlew :metadata-ingestion:generatePyprojectDeps. setup.py will be deprecated in a future release. (datahub-project#16339) — @kyungsoo-datahub

🌟 Features

  • dbt: Semantic model support — Ingestion now extracts dbt semantic models, including entities, measures, and dimensions, from both dbt Cloud and dbt Core. (datahub-project#16236) — @alfiyas-datahub
  • Iceberg: Ingestion-time domain assignment — You can now assign domains to Iceberg datasets at ingestion time via source configuration. (datahub-project#16443) — @sergey-pozdnyakov-epam
  • CLI: Agent-friendly datahub graphql and datahub init — The CLI now supports schema introspection, operation discovery, and dry-run mode for GraphQL queries, making it easier to integrate with AI agents and automation. (datahub-project#16476) — @shirshanka

⚡ Performance

  • Snowflake: In-memory tag inheritance — Tag inheritance is now emulated in-memory, eliminating N+1 queries against Snowflake and significantly improving ingestion speed for tagged environments. (datahub-project#16400) — @treff7es

🐛 Bug Fixes

  • Tableau: Project filters now apply to embedded datasources — When emit_all_embedded_datasources is enabled, project filters are correctly applied to embedded datasources. (datahub-project#16340) — @aviraj-gour

🛠️ Maintenance

  • urllib3 upgraded to v2 — The ingestion framework now uses urllib3 v2, bringing improved connection handling and modern TLS defaults. (datahub-project#16464) — @sgomezvillamor

📖 Documentation