Description
Is your feature request related to a problem? Please describe.
In the current state of DBT Snowplow, if you want to get recent events, you need to run dbt run
to process new data.
This package offers the "incremental" materialization option to process only new events and not every event with each run.
However, this approach still makes it challenging to have fresh data with low latency (<1 minute).
For instance, let's take an example:
- 08:40 am: an event is triggered in a browser
- 08:40 am: the Snowplow collector validates and enriches the event, then sends it to a stream
- 08:41 am: the event is stored in the data warehouse
- 08:45 am: a DBT job that runs every 5 minutes starts
- 08:47 am: the DBT job finishes running my custom model (that depends on
snowplow_web_base_events_this_run
)
So, my data is only available at 08:47 am.
There are delays that are very hard to compress because we can't realistically run DBT jobs every second, and the DBT job takes a few minutes to complete.
Describe the solution you'd like
We could take advantage of the "lambda view" pattern and introduce a new materialization option that would benefit from materialized views and dynamic tables (for Snowflake).
Describe alternatives you've considered
Running DBT more frequently, but it's costly.
Are you interested in contributing towards this feature?
I am willing to help, but I am a newbie in DBT. I've tried to modify the materialization but didn't succeed in making it work.
However, I've found interesting resources that can help: