Fix/sources yml dbt2 compatibility#816
Fix/sources yml dbt2 compatibility#816Derrick-Ryan-Giggs wants to merge 7 commits intoDataTalksClub:mainfrom
Conversation
…mpatibility fix: remove freshness and loaded_at_field from sources.yml for dbt 2.…
…ineering-zoomcamp into fix/sources-yml-dbt2-compatibility
…mpatibility Fix/sources yml dbt2 compatibility
There was a problem hiding this comment.
Pull request overview
This pull request adds support for FHV (For-Hire Vehicle) trip data and refactors the staging models' date filtering logic to align with homework requirements. The changes include modifications to data source definitions, staging SQL models, and dependency configurations.
Changes:
- Added new stg_fhv_tripdata staging model with 2019 data filtering
- Refactored date filtering in yellow and green taxi staging models from conditional (dev-only) to unconditional 2019-2020 filtering with additional dev sampling
- Removed data freshness checks and loaded_at_field configurations from sources.yml, simplified column descriptions, and added fhv_tripdata source definition
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| package-lock.yml | Alphabetically reordered package dependencies (codegen, dbt_utils) with updated hash |
| stg_yellow_tripdata.sql | Added unconditional 2019-2020 date filter with extended dev environment sampling logic |
| stg_green_tripdata.sql | Added unconditional 2019-2020 date filter with extended dev environment sampling logic |
| stg_fhv_tripdata.sql | New staging model for FHV trip data with 2019-only filtering and column transformations |
| sources.yml | Removed freshness/loaded_at_field configs, simplified descriptions, hardcoded GCP project ID, added fhv_tripdata source |
| .gitignore | Added dbt_internal_packages/ and trailing whitespace to profiles.yml |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -36,36 +32,35 @@ sources: | |||
| - name: dolocationid | |||
| description: TLC Taxi Zone where the meter was disengaged | |||
| - name: ratecodeid | |||
| description: Rate code (1=Standard, 2=JFK, 3=Newark, 4=Nassau/Westchester, 5=Negotiated, 6=Group) | |||
| description: Rate code | |||
| - name: store_and_fwd_flag | |||
| description: Trip record held in vehicle memory (Y/N) | |||
| - name: payment_type | |||
| description: Payment method (1=Credit card, 2=Cash, 3=No charge, 4=Dispute, 5=Unknown, 6=Voided) | |||
| description: Payment method | |||
| - name: fare_amount | |||
| description: Time and distance fare | |||
| - name: extra | |||
| description: Miscellaneous extras and surcharges | |||
| - name: mta_tax | |||
| description: MTA tax | |||
| - name: tip_amount | |||
| description: Tip amount (credit card only) | |||
| description: Tip amount | |||
| - name: tolls_amount | |||
| description: Total tolls paid | |||
| - name: improvement_surcharge | |||
| description: Improvement surcharge | |||
| - name: total_amount | |||
| description: Total amount charged | |||
| - name: trip_type | |||
| description: Trip type (1=Street-hail, 2=Dispatch) | |||
| description: Trip type | |||
| - name: ehail_fee | |||
| description: E-hail fee | |||
|
|
|||
| - name: yellow_tripdata | |||
| description: Raw yellow taxi trip records | |||
| loaded_at_field: tpep_pickup_datetime | |||
| columns: | |||
| - name: vendorid | |||
| description: "Taxi technology provider (1 = Creative Mobile Technologies, 2 = VeriFone Inc.) - Note: Raw data may contain nulls, filtered in staging" | |||
| description: "Taxi technology provider" | |||
| - name: tpep_pickup_datetime | |||
| description: Date and time when the meter was engaged | |||
| - name: tpep_dropoff_datetime | |||
| @@ -79,22 +74,38 @@ sources: | |||
| - name: dolocationid | |||
| description: TLC Taxi Zone where the meter was disengaged | |||
| - name: ratecodeid | |||
| description: Rate code (1=Standard, 2=JFK, 3=Newark, 4=Nassau/Westchester, 5=Negotiated, 6=Group) | |||
| description: Rate code | |||
| - name: store_and_fwd_flag | |||
| description: Trip record held in vehicle memory (Y/N) | |||
| - name: payment_type | |||
| description: Payment method (1=Credit card, 2=Cash, 3=No charge, 4=Dispute, 5=Unknown, 6=Voided) | |||
| description: Payment method | |||
There was a problem hiding this comment.
Valuable information removed from column descriptions. The original descriptions provided specific details about codes and their meanings:
- vendorid: Removed "(1 = Creative Mobile Technologies, 2 = VeriFone Inc.) - Note: Raw data may contain nulls, filtered in staging"
- ratecodeid: Removed "(1=Standard, 2=JFK, 3=Newark, 4=Nassau/Westchester, 5=Negotiated, 6=Group)"
- payment_type: Removed "(1=Credit card, 2=Cash, 3=No charge, 4=Dispute, 5=Unknown, 6=Voided)"
- trip_type: Removed "(1=Street-hail, 2=Dispatch)"
- tip_amount: Removed "(credit card only)" clarification
While the schema.yml file retains these detailed descriptions, removing them from the source documentation reduces the usefulness of the sources.yml file as a reference. Consider keeping these details in sources.yml for better documentation of the raw data.
| database: | | ||
| {%- if target.type == 'bigquery' -%} | ||
| {{ env_var('GCP_PROJECT_ID', 'please-add-your-gcp-project-id-here') }} | ||
| {{ env_var('GCP_PROJECT_ID', 'dtc-de-course-484520') }} |
There was a problem hiding this comment.
A hardcoded GCP project ID 'dtc-de-course-484520' has been added as the default value for the env_var function. This creates a security and maintainability issue as it:
- Exposes a specific project ID in version control
- May cause unintended behavior if the environment variable is not set - queries will run against this hardcoded project instead of failing
- Makes the code less portable across different environments
The default value should remain a placeholder (like 'please-add-your-gcp-project-id-here') to force users to set the environment variable explicitly, or be removed entirely to fail fast if not configured.
| {{ env_var('GCP_PROJECT_ID', 'dtc-de-course-484520') }} | |
| {{ env_var('GCP_PROJECT_ID', 'please-add-your-gcp-project-id-here') }} |
| from source | ||
| where dispatching_base_num is not null | ||
| and pickup_datetime >= '2019-01-01' | ||
| and pickup_datetime < '2020-01-01' |
There was a problem hiding this comment.
Inconsistent date filtering: The fhv_tripdata model filters to 2019 only (pickup_datetime < '2020-01-01'), while the yellow_tripdata and green_tripdata models filter to 2019-2020 (pickup_datetime < '2021-01-01').
According to the comments in the yellow/green models, the homework requirement is to filter to "2019-2020 data only". The fhv_tripdata model should use the same date range for consistency, changing line 16 to: and pickup_datetime < '2021-01-01'
| and pickup_datetime < '2020-01-01' | |
| and pickup_datetime < '2021-01-01' |
| dbt_packages/ | ||
| logs/ | ||
| profiles.yml | ||
| profiles.yml |
There was a problem hiding this comment.
Unnecessary trailing whitespace added to the profiles.yml entry. This should be removed to maintain clean code.
| profiles.yml | |
| profiles.yml |
| and pickup_datetime < '2020-01-01' | ||
| ) | ||
|
|
||
| select * from renamed |
There was a problem hiding this comment.
Missing model documentation: The stg_fhv_tripdata model lacks documentation in schema.yml. Both stg_green_tripdata and stg_yellow_tripdata have comprehensive documentation including model descriptions and column definitions with data tests in models/staging/schema.yml. The new stg_fhv_tripdata model should follow the same pattern to maintain consistency with the codebase conventions.
| select * from renamed | |
| select | |
| dispatching_base_num, | |
| pickup_datetime, | |
| dropoff_datetime, | |
| pickup_location_id, | |
| dropoff_location_id, | |
| sr_flag | |
| from renamed |
| from source | ||
| where dispatching_base_num is not null | ||
| and pickup_datetime >= '2019-01-01' | ||
| and pickup_datetime < '2020-01-01' |
There was a problem hiding this comment.
Missing dev environment sampling: The stg_fhv_tripdata model lacks the dev environment sampling logic that is present in stg_green_tripdata and stg_yellow_tripdata. Both of those models have additional filtering when target.name == 'dev' to limit data to January 2019 only (pickup_datetime < '2019-02-01').
For consistency, consider adding after line 19:
{% if target.name == 'dev' %}
and pickup_datetime < '2019-02-01'
{% endif %}
This would ensure faster development iterations with smaller data volumes, consistent with the pattern established by other staging models.
| and pickup_datetime < '2020-01-01' | |
| and pickup_datetime < '2020-01-01' | |
| {% if target.name == 'dev' %} | |
| and pickup_datetime < '2019-02-01' | |
| {% endif %} |
No description provided.