Skip to content

Fix/sources yml dbt2 compatibility#816

Open
Derrick-Ryan-Giggs wants to merge 7 commits intoDataTalksClub:mainfrom
Derrick-Ryan-Giggs:fix/sources-yml-dbt2-compatibility
Open

Fix/sources yml dbt2 compatibility#816
Derrick-Ryan-Giggs wants to merge 7 commits intoDataTalksClub:mainfrom
Derrick-Ryan-Giggs:fix/sources-yml-dbt2-compatibility

Conversation

@Derrick-Ryan-Giggs
Copy link

No description provided.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds support for FHV (For-Hire Vehicle) trip data and refactors the staging models' date filtering logic to align with homework requirements. The changes include modifications to data source definitions, staging SQL models, and dependency configurations.

Changes:

  • Added new stg_fhv_tripdata staging model with 2019 data filtering
  • Refactored date filtering in yellow and green taxi staging models from conditional (dev-only) to unconditional 2019-2020 filtering with additional dev sampling
  • Removed data freshness checks and loaded_at_field configurations from sources.yml, simplified column descriptions, and added fhv_tripdata source definition

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
package-lock.yml Alphabetically reordered package dependencies (codegen, dbt_utils) with updated hash
stg_yellow_tripdata.sql Added unconditional 2019-2020 date filter with extended dev environment sampling logic
stg_green_tripdata.sql Added unconditional 2019-2020 date filter with extended dev environment sampling logic
stg_fhv_tripdata.sql New staging model for FHV trip data with 2019-only filtering and column transformations
sources.yml Removed freshness/loaded_at_field configs, simplified descriptions, hardcoded GCP project ID, added fhv_tripdata source
.gitignore Added dbt_internal_packages/ and trailing whitespace to profiles.yml

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 21 to 81
@@ -36,36 +32,35 @@ sources:
- name: dolocationid
description: TLC Taxi Zone where the meter was disengaged
- name: ratecodeid
description: Rate code (1=Standard, 2=JFK, 3=Newark, 4=Nassau/Westchester, 5=Negotiated, 6=Group)
description: Rate code
- name: store_and_fwd_flag
description: Trip record held in vehicle memory (Y/N)
- name: payment_type
description: Payment method (1=Credit card, 2=Cash, 3=No charge, 4=Dispute, 5=Unknown, 6=Voided)
description: Payment method
- name: fare_amount
description: Time and distance fare
- name: extra
description: Miscellaneous extras and surcharges
- name: mta_tax
description: MTA tax
- name: tip_amount
description: Tip amount (credit card only)
description: Tip amount
- name: tolls_amount
description: Total tolls paid
- name: improvement_surcharge
description: Improvement surcharge
- name: total_amount
description: Total amount charged
- name: trip_type
description: Trip type (1=Street-hail, 2=Dispatch)
description: Trip type
- name: ehail_fee
description: E-hail fee

- name: yellow_tripdata
description: Raw yellow taxi trip records
loaded_at_field: tpep_pickup_datetime
columns:
- name: vendorid
description: "Taxi technology provider (1 = Creative Mobile Technologies, 2 = VeriFone Inc.) - Note: Raw data may contain nulls, filtered in staging"
description: "Taxi technology provider"
- name: tpep_pickup_datetime
description: Date and time when the meter was engaged
- name: tpep_dropoff_datetime
@@ -79,22 +74,38 @@ sources:
- name: dolocationid
description: TLC Taxi Zone where the meter was disengaged
- name: ratecodeid
description: Rate code (1=Standard, 2=JFK, 3=Newark, 4=Nassau/Westchester, 5=Negotiated, 6=Group)
description: Rate code
- name: store_and_fwd_flag
description: Trip record held in vehicle memory (Y/N)
- name: payment_type
description: Payment method (1=Credit card, 2=Cash, 3=No charge, 4=Dispute, 5=Unknown, 6=Voided)
description: Payment method
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valuable information removed from column descriptions. The original descriptions provided specific details about codes and their meanings:

  • vendorid: Removed "(1 = Creative Mobile Technologies, 2 = VeriFone Inc.) - Note: Raw data may contain nulls, filtered in staging"
  • ratecodeid: Removed "(1=Standard, 2=JFK, 3=Newark, 4=Nassau/Westchester, 5=Negotiated, 6=Group)"
  • payment_type: Removed "(1=Credit card, 2=Cash, 3=No charge, 4=Dispute, 5=Unknown, 6=Voided)"
  • trip_type: Removed "(1=Street-hail, 2=Dispatch)"
  • tip_amount: Removed "(credit card only)" clarification

While the schema.yml file retains these detailed descriptions, removing them from the source documentation reduces the usefulness of the sources.yml file as a reference. Consider keeping these details in sources.yml for better documentation of the raw data.

Copilot uses AI. Check for mistakes.
database: |
{%- if target.type == 'bigquery' -%}
{{ env_var('GCP_PROJECT_ID', 'please-add-your-gcp-project-id-here') }}
{{ env_var('GCP_PROJECT_ID', 'dtc-de-course-484520') }}
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A hardcoded GCP project ID 'dtc-de-course-484520' has been added as the default value for the env_var function. This creates a security and maintainability issue as it:

  1. Exposes a specific project ID in version control
  2. May cause unintended behavior if the environment variable is not set - queries will run against this hardcoded project instead of failing
  3. Makes the code less portable across different environments

The default value should remain a placeholder (like 'please-add-your-gcp-project-id-here') to force users to set the environment variable explicitly, or be removed entirely to fail fast if not configured.

Suggested change
{{ env_var('GCP_PROJECT_ID', 'dtc-de-course-484520') }}
{{ env_var('GCP_PROJECT_ID', 'please-add-your-gcp-project-id-here') }}

Copilot uses AI. Check for mistakes.
from source
where dispatching_base_num is not null
and pickup_datetime >= '2019-01-01'
and pickup_datetime < '2020-01-01'
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent date filtering: The fhv_tripdata model filters to 2019 only (pickup_datetime < '2020-01-01'), while the yellow_tripdata and green_tripdata models filter to 2019-2020 (pickup_datetime < '2021-01-01').

According to the comments in the yellow/green models, the homework requirement is to filter to "2019-2020 data only". The fhv_tripdata model should use the same date range for consistency, changing line 16 to: and pickup_datetime < '2021-01-01'

Suggested change
and pickup_datetime < '2020-01-01'
and pickup_datetime < '2021-01-01'

Copilot uses AI. Check for mistakes.
dbt_packages/
logs/
profiles.yml
profiles.yml
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary trailing whitespace added to the profiles.yml entry. This should be removed to maintain clean code.

Suggested change
profiles.yml
profiles.yml

Copilot uses AI. Check for mistakes.
and pickup_datetime < '2020-01-01'
)

select * from renamed
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing model documentation: The stg_fhv_tripdata model lacks documentation in schema.yml. Both stg_green_tripdata and stg_yellow_tripdata have comprehensive documentation including model descriptions and column definitions with data tests in models/staging/schema.yml. The new stg_fhv_tripdata model should follow the same pattern to maintain consistency with the codebase conventions.

Suggested change
select * from renamed
select
dispatching_base_num,
pickup_datetime,
dropoff_datetime,
pickup_location_id,
dropoff_location_id,
sr_flag
from renamed

Copilot uses AI. Check for mistakes.
from source
where dispatching_base_num is not null
and pickup_datetime >= '2019-01-01'
and pickup_datetime < '2020-01-01'
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing dev environment sampling: The stg_fhv_tripdata model lacks the dev environment sampling logic that is present in stg_green_tripdata and stg_yellow_tripdata. Both of those models have additional filtering when target.name == 'dev' to limit data to January 2019 only (pickup_datetime < '2019-02-01').

For consistency, consider adding after line 19:

{% if target.name == 'dev' %}
  and pickup_datetime < '2019-02-01'
{% endif %}

This would ensure faster development iterations with smaller data volumes, consistent with the pattern established by other staging models.

Suggested change
and pickup_datetime < '2020-01-01'
and pickup_datetime < '2020-01-01'
{% if target.name == 'dev' %}
and pickup_datetime < '2019-02-01'
{% endif %}

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant