Description
Describe the bug
When building base_create_snowplow_sessions_lifecycle_manifest with the snowplow__user_sql
var populated, or if snowplow__user_identifiers
has > 1 value, if the value produced is inconsistent throughout a session, the value chosen as user_identifier
can be unexpected because the values are aggregated via max()
.
This can occur for instance if user_id
is one of the preferred values and one or more event in the session does not have a user_id
; e.g. with a snowplow__user_sql
like coalesce(user_id, domain_userid)
if some events are missing the user_id
, then the domain_userid
will be selected for those events instead; the final user_identifier
when will be whichever value is largest, which will not consistently be user_id
if the random domain_userid
happens to be larger.
This is unexpected because the intent between coalescing multiple values is generally that the first found value is preferred but you will randomly get other values.
For snowplow__user_identifiers
, it would be preferable if rather than max(coalesce(..., ..., ...))
the computation was coalesce(max(...), max(...), max(...))
so preferences would be more deterministic.
For snowplow__user_sql
you can not use this approach because you can not nest aggregate functions. As a temporary workaround we're using an approach like:
coalesce(
max(user_id),
max(domain_userid),
max(network_userid)
)) over (partition by session_identifier
to turn the outer max()
call into a window function and allow the inner aggregates; this is obviously undesirable.
Steps to reproduce
- Configure
user_identifier
options that are inconsistent throughout a session - Get unexpected
user_identifier
values
Expected results
snowplow__user_sql: "coalesce(max(user_id), max(domain_userid), max(network_userid))"
select session_identifier, coalesce(max(user_id), max(domain_userid)) user_identifier
from (
select 'a' session_identifier, null user_id, '1' domain_userid
union all
select 'a' session_identifier, '2' user_id, '1' domain_userid
union all
select 'b' session_identifier, null user_id, '3' domain_userid
union all
select 'b' session_identifier, '2' user_id, '3' domain_userid
)
group by 1
session_identifier | user_identifier |
---|---|
a | 2 |
b | 2 |
Actual results
snowplow__user_sql: "coalesce(user_id, domain_userid, network_userid)"
select session_identifier, max(coalesce(user_id, domain_userid)) user_identifier
from (
select 'a' session_identifier, null user_id, '1' domain_userid
union all
select 'a' session_identifier, '2' user_id, '1' domain_userid
union all
select 'b' session_identifier, null user_id, '3' domain_userid
union all
select 'b' session_identifier, '2' user_id, '3' domain_userid
)
group by 1
session_identifier | user_identifier |
---|---|
a | 2 |
b | 3 |
Which database are you using dbt with?
- postgres
- redshift
- bigquery
- snowflake
- databricks
- other (specify: ____________)
Additional context
This won't impact most users that use the default identifier settings, but for those that have their own surrogate values used for user identification if an event is missing the surrogates for any reason this issue occurs.