Skip to content

Conversation

MassyB
Copy link

@MassyB MassyB commented May 28, 2025

Resolves #11750

TL;DR

This PR integrates dbt with Openlineage. It unlocks lineage tracking and observability of the dbt pipelines.

Openlineage is an open source standard. From its main page:

OpenLineage is an open platform for collection and analysis of data lineage. It tracks metadata about datasets, jobs, and runs, giving users the information required to identify the root cause of complex issues and understand the impact of changes. OpenLineage contains an open standard for lineage data collection, a metadata repository reference implementation (Marquez), libraries for common languages, and integrations with data pipeline tools.

Openlineage (OL) defines events according to a specification. This PR constructs those OL events by consuming the dbt structured logs and sends them to an endpoint.

The endpoint that consumes OL events is totally configurable by the user. It can be Marquez, Datadog or something else. Examples in this PR are using Datadog.

Problem

Let's build the jaffle shop project using the following command

dbt build

We have this output

root@1805096f0987:/usr/src/dbt# dbt build
12:05:17  Running with dbt=1.9.0
12:05:17  [WARNING]: Deprecated functionality

User config should be moved from the 'config' key in profiles.yml to the 'flags' key in dbt_project.yml.
12:05:18  Registered adapter: postgres=1.9.0
12:05:21  Found 5 models, 1 snapshot, 3 seeds, 23 data tests, 1 source, 434 macros
12:05:21
12:05:21  Concurrency: 2 threads (target='pg')
12:05:21
12:05:22  1 of 32 START seed file public.raw_customers ................................... [RUN]
12:05:22  2 of 32 START seed file public.raw_orders ...................................... [RUN]
12:05:24  1 of 32 OK loaded seed file public.raw_customers ............................... [INSERT 100 in 1.33s]
12:05:24  2 of 32 OK loaded seed file public.raw_orders .................................. [INSERT 99 in 1.34s]
12:05:24  3 of 32 START seed file public.raw_payments .................................... [RUN]
12:05:24  4 of 32 START test source_unique_jaffle_shop_dim_date_id ....................... [RUN]
12:05:24  4 of 32 PASS source_unique_jaffle_shop_dim_date_id ............................. [PASS in 0.65s]
12:05:24  5 of 32 START sql view model public.stg_customers .............................. [RUN]
....
12:05:32  32 of 32 PASS unique_orders_order_id ........................................... [PASS in 0.35s]
12:05:32  31 of 32 PASS relationships_orders_customer_id__customer_id__ref_customers_ .... [PASS in 0.37s]
12:05:32
12:05:32  Finished running 3 seeds, 1 snapshot, 2 table models, 23 data tests, 3 view models in 0 hours 0 minutes and 10.92 seconds (10.92s).
12:05:33  Done. PASS=32 WARN=0 ERROR=0 SKIP=0 TOTAL=32

I've truncated the output but:

  • We can see when the resource (model, test, snapshot …) started and ended. We see some interleaved logs because the parallelism is set to two threads
  • We have a summary of the whole execution at the end

This output doesn't tell us the SQL queries executed by every model. We can use the --debug for that:

dbt build --debug

We have the following output:

12:11:28  Running with dbt=1.9.0
12:11:28  running dbt with arguments {'printer_width': '80', 'indirect_selection': 'eager', 'write_json': 'True', 'log_cache_events': 'False', 'partial_parse': 'True', 'cache_selected_only': 'False', 'profiles_dir': '/usr/src/dbt/profiles', 'version_check': 'True', 'warn_error': 'None', 'log_path': '/usr/src/dbt/logs', 'fail_fast': 'False', 'debug': 'True', 'use_colors': 'False', 'use_experimental_parser': 'False', 'no_print': 'None', 'quiet': 'False', 'empty': 'False', 'warn_error_options': 'WarnErrorOptions(include=[], exclude=[])', 'introspect': 'True', 'invocation_command': 'dbt build --debug', 'log_format': 'default', 'target_path': 'None', 'static_parser': 'True', 'send_anonymous_usage_stats': 'False'}
12:11:28  [WARNING]: Deprecated functionality

User config should be moved from the 'config' key in profiles.yml to the 'flags' key in dbt_project.yml.
12:11:29  Registered adapter: postgres=1.9.0
12:11:30  checksum: c99e828bba267739642b5a3ce85f17518764ea526e0e6c4fdc649171c1a66bff, vars: {}, profile: , target: , version: 1.9.0
12:11:31  Partial parsing enabled: 0 files deleted, 0 files added, 0 files changed.
12:11:31  Partial parsing enabled, no changes found, skipping parsing
12:11:32  Wrote artifact WritableManifest to /usr/src/dbt/target/manifest.json
12:11:32  Wrote artifact SemanticManifest to /usr/src/dbt/target/semantic_manifest.json
12:11:33  Found 5 models, 1 snapshot, 3 seeds, 23 data tests, 1 source, 434 macros
12:11:33
12:11:33  Concurrency: 2 threads (target='pg')
12:11:33
12:11:33  Acquiring new postgres connection 'master'
12:11:33  Acquiring new postgres connection 'list_postgres'
12:11:33  Acquiring new postgres connection 'list_postgres'
12:11:33  Using postgres connection "list_postgres"
12:11:33  Using postgres connection "list_postgres"
12:11:33  On list_postgres: /* {"app": "dbt", "dbt_version": "1.9.0", "profile_name": "jaffle_shop", "target_name": "pg", "connection_name": "list_postgres"} */

    select distinct nspname from pg_namespace

12:11:33  On list_postgres: /* {"app": "dbt", "dbt_version": "1.9.0", "profile_name": "jaffle_shop", "target_name": "pg", "connection_name": "list_postgres"} */

    select distinct nspname from pg_namespace

12:11:33  Opening a new connection, currently in state init
12:11:33  Opening a new connection, currently in state init
12:11:33  SQL status: SELECT 11 in 0.136 seconds
12:11:33  SQL status: SELECT 11 in 0.134 seconds
12:11:33  On list_postgres: Close
12:11:33  On list_postgres: Close
12:11:33  Re-using an available connection from the pool (formerly list_postgres, now list_postgres_snapshots)
12:11:33  Re-using an available connection from the pool (formerly list_postgres, now list_postgres_public)
12:11:33  Using postgres connection "list_postgres_snapshots"
12:11:33  Using postgres connection "list_postgres_public"
12:11:33  On list_postgres_snapshots: BEGIN
12:11:33  On list_postgres_public: BEGIN
12:11:33  Opening a new connection, currently in state closed
12:11:33  Opening a new connection, currently in state closed
12:11:33  SQL status: BEGIN in 0.086 seconds
12:11:33  SQL status: BEGIN in 0.084 seconds
12:11:33  Using postgres connection "list_postgres_snapshots"
12:11:33  Using postgres connection "list_postgres_public"
....
12:11:42  On test.jaffle_shop.unique_customers_customer_id.c5af1ff4b1: ROLLBACK
12:11:42  SQL status: SELECT 1 in 0.002 seconds
12:11:42  On test.jaffle_shop.unique_customers_customer_id.c5af1ff4b1: Close
12:11:42  On test.jaffle_shop.accepted_values_orders_status__placed__shipped__completed__return_pending__returned.be6b5b5ec3: ROLLBACK
12:11:42  21 of 32 PASS unique_customers_customer_id ..................................... [PASS in 0.43s]
12:11:42  On test.jaffle_shop.accepted_values_orders_status__placed__shipped__completed__return_pending__returned.be6b5b5ec3: Close
12:11:42  Finished running node test.jaffle_shop.unique_customers_customer_id.c5af1ff4b1
12:11:42  22 of 32 PASS accepted_values_orders_status__placed__shipped__completed__return_pending__returned  [PASS in 0.45s]
12:11:42  Began running node test.jaffle_shop.foo_bar_test_orders_bim.45fc81421f
12:11:42  Finished running node test.jaffle_shop.accepted_values_orders_status__placed__shipped__completed__return_pending__returned.be6b5b5ec3
12:11:42  23 of 32 START test foo_bar_test_orders_bim .................................... [RUN]
12:11:42  Began running node test.jaffle_shop.not_null_orders_amount.106140f9fd
...
12:11:45  Done. PASS=32 WARN=0 ERROR=0 SKIP=0 TOTAL=32
12:11:45  Resource report: {"command_name": "build", "command_success": true, "command_wall_clock_time": 17.469067, "process_in_blocks": "0", "process_kernel_time": 0.741213, "process_mem_max_rss": "176572", "process_out_blocks": "0", "process_user_time": 22.572857}
12:11:45  Command `dbt build` succeeded at 12:11:45.380948 after 17.48 seconds
  • The SQL queries are interleaved making it hard to read and to follow the execution flow

Observability of the dbt pipeline is not ideal:

  1. For projects that have 100+ resources it becomes difficult to parse the logs
  2. When adding debug and parallelism it becomes even harder to follow the execution even for small projects because logs are interleaved
  3. The execution time of SQL queries is not reported
  4. The lineage (input/output) of models is not reported

This PR is about enhancing the observability of dbt pipelines and addressing the shortcomings mentioned above.

Solution

Instead of relying on the textual logs to report progress of the dbt pipeline, This PR integrates dbt-core with Openlineage. Like what has been done for Apache Airflow.

Below are examples on how we leverage those OL events in Datadog to report on the progression of dbt pipelines.
When running:

dbt run --select orders 

In the waterfall view we can see:

  • The duration of the whole pipeline
  • The model that has been executed
  • The sql queries executed in the data platform

Pasted image 20250314155433

This is when we build the entire jaffle shop project:

  • errors are marked in the specific resource they occurred (in a test below)

Pasted image 20250314161040

An interesting flame graph view when the jaffle shop project is executed using two threads

image

PR details

You can see a presentation of the integration in this short YT video (relevant part is ~10 minutes long). Be sure to check the linked PRs in order to have more context.

in a nutshell this PR adds a callback that listens for particular dbt structured logs events.
for each of those events an OL event is generated and emitted.

How to test

This PR adds functional tests that checks the generated OL events against expected ones.
You can run them by setting up a dev environment and execute the following command

pytest "tests/functional/openlineage/openlineage_project.py"

If there is a failure you will get a json-path-like to the attribute that has a discrepancy.

For unit tests you can run

pytest  "tests/unit/openlineage/"

Linked PRs/Issues

Additional context form the Openlineage repository

Checklist

  • I have read the contributing guide and understand what's expected of me.
  • I have run this code in development, and it appears to resolve the stated issue.
  • This PR includes tests, or tests are not required or relevant for this PR.
  • This PR has no interface changes (e.g., macros, CLI, logs, JSON artifacts, config files, adapter interface, etc.) or this PR has already received feedback and approval from Product or DX.
  • This PR includes type annotations for new and modified functions.

PS

Perhaps the most important motivation of this PR: tell your dbt teammate @le-brice Massy says hi.

Copy link

cla-bot bot commented May 28, 2025

Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA.

In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, please reach out through a comment on this PR.

CLA has not been signed by users: @MassyB

Copy link
Contributor

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.

@MassyB MassyB force-pushed the massy.bourennani/openlineage branch from a84a7ec to 9ccb1ed Compare May 28, 2025 15:58
Copy link

cla-bot bot commented May 28, 2025

Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA.

In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, please reach out through a comment on this PR.

CLA has not been signed by users: @MassyB

@MassyB MassyB closed this May 30, 2025
@MassyB MassyB force-pushed the massy.bourennani/openlineage branch from 9ccb1ed to f26d822 Compare May 30, 2025 13:40
@cla-bot cla-bot bot added the cla:yes label May 30, 2025
@MassyB MassyB reopened this May 30, 2025
Copy link

cla-bot bot commented May 30, 2025

Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA.

In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, please reach out through a comment on this PR.

CLA has not been signed by users: @MassyB

@cla-bot cla-bot bot removed the cla:yes label May 30, 2025
@MassyB MassyB force-pushed the massy.bourennani/openlineage branch from 4fc0388 to d5c0d5c Compare May 30, 2025 16:25
Copy link

cla-bot bot commented May 30, 2025

Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA.

In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, please reach out through a comment on this PR.

CLA has not been signed by users: @MassyB

@MassyB MassyB force-pushed the massy.bourennani/openlineage branch from d5c0d5c to d77afdd Compare May 30, 2025 16:28
Copy link

cla-bot bot commented May 30, 2025

Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA.

In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, please reach out through a comment on this PR.

CLA has not been signed by users: @MassyB

@MassyB MassyB force-pushed the massy.bourennani/openlineage branch from d77afdd to 4167064 Compare June 3, 2025 21:27
Copy link

cla-bot bot commented Jun 3, 2025

Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA.

In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, please reach out through a comment on this PR.

CLA has not been signed by users: @MassyB

@MassyB MassyB force-pushed the massy.bourennani/openlineage branch from 4167064 to 9e2209b Compare June 3, 2025 21:30
Copy link

cla-bot bot commented Jun 3, 2025

Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA.

In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, please reach out through a comment on this PR.

CLA has not been signed by users: @MassyB

@MassyB MassyB force-pushed the massy.bourennani/openlineage branch from 9e2209b to a1d67cf Compare June 3, 2025 21:40
Copy link

cla-bot bot commented Jun 3, 2025

Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA.

In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, please reach out through a comment on this PR.

CLA has not been signed by users: @MassyB

@MassyB MassyB force-pushed the massy.bourennani/openlineage branch from a1d67cf to fd8fea2 Compare June 3, 2025 21:44
Copy link

cla-bot bot commented Jun 3, 2025

Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA.

In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, please reach out through a comment on this PR.

CLA has not been signed by users: @MassyB


def add_to_parser(self, parser: OptionParser, ctx: Context):
def parser_process(value: str, state: ParsingState):
@t.no_type_check
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my pre-commit mypy step was failing on this so I added some annotations and mypy ignore comments

@MassyB MassyB force-pushed the massy.bourennani/openlineage branch from fd8fea2 to 4e338af Compare June 10, 2025 09:14
@cla-bot cla-bot bot added the cla:yes label Jun 10, 2025
ol_handler = OpenLineageHandler(ctx)
callbacks = ctx.obj.get("callbacks", [])
if is_runnable_dbt_command(flags):
callbacks.append(ol_handler.handle)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is where the OL callback is added

)


ALL_PROTO_TYPES: Dict[str, Any] = {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

useful to convert a dict to an actual type defined in proto

return f"Artifacts skipped for command : {self.msg}"


class OpenLineageException(WarnLevel):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now that all the events have been moved to https://github.com/dbt-labs/proto-python-public
How do we do to add an event ?
the documentation still references core_types.proto but I couldn’t find it

@@ -0,0 +1,410 @@
import traceback
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the most important part of the PR where we construct OL events out of the dbt structured logs

"pydantic<2",
# ----
# OpenLineage Dependencies
"openlineage-python==1.30.1",
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only the python client is added from OL

return ParseDict(e, msg_cls())


def assert_ol_events_match(expected_event, actual_event):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the main function used in functional tests to assert that two sets of events are the same.

The interesting part is the usage of regex-like feature where patterns like {{ .* }} is used to match a given string.

You can use a regex by enclosing it like so
{{<space><YOUR-REGEX-HERE><space>}}

try:
self.handle_unsafe(e)
except Exception as exception:
self._handle_exception(exception)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all exceptions related to OL are non critical. They don't make dbt fail

self._handle_exception(exception)

def _handle_exception(self, e: Exception):
fire_event(OpenLineageException(exc=str(e), exc_info=traceback.format_exc()))
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need help to add this new event OpenLineageException following the new public proto

@@ -0,0 +1,1010 @@
[
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is an example of OL events generated

@@ -0,0 +1,1010 @@
[
{
"eventTime":"{{ .* }}",
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regexes have to be defined following:
{{<space><REGEX><space>}}

MassyB added 2 commits June 18, 2025 16:49
Signed-off-by: Massy Bourennani <[email protected]>
Signed-off-by: Massy Bourennani <[email protected]>
@MassyB MassyB force-pushed the massy.bourennani/openlineage branch from 6dd1a3d to b9b0e9c Compare June 18, 2025 14:49
@MassyB MassyB changed the title [WIP] contirbute Openlineage to dbt-core Contirbute Openlineage to dbt-core Jun 18, 2025
@MassyB MassyB marked this pull request as ready for review June 18, 2025 15:06
@MassyB MassyB requested a review from a team as a code owner June 18, 2025 15:06
@github-actions github-actions bot added the community This PR is from a community member label Jun 18, 2025
@MassyB MassyB changed the title Contirbute Openlineage to dbt-core Contribute Openlineage to dbt-core Jun 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla:yes community This PR is from a community member

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Integration of dbt with Openlineage

1 participant