Skip to content

Deduplication in snowplow_web_page_context drops all events that have a duplicate #100

@philipherrmann

Description

@philipherrmann

If I don't get it wrong, the deduplication implemented here
https://github.com/fishtown-analytics/snowplow/blob/3795d06f365213ca4930d2447bd1580cb7031557/models/page_views/default/snowplow_web_page_context.sql#L43
drops all events that have a duplicated event_id (named root_id there), instead of keeping only the first one of those. Seems strange to me, is there a reason?

Also, for bigquery it would be quite easy to implement an incremental model for this since all the timestamps are there, right? Should I try to submit a PR?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions