Skip to content

Fix: Remove customer_id causing Cartesian join explosion in agg_monthly_loans#11

Open
jjoyce0510 wants to merge 1 commit intomainfrom
fix/revert-cartesian-join-bug
Open

Fix: Remove customer_id causing Cartesian join explosion in agg_monthly_loans#11
jjoyce0510 wants to merge 1 commit intomainfrom
fix/revert-cartesian-join-bug

Conversation

@jjoyce0510
Copy link

🐛 Bug Fix

This PR fixes a critical data quality issue introduced in PR #3 that has been causing inflated values in the Risk Analytics Dashboard.

🔍 Problem

PR #3 added a customer_id field to the monthly aggregation table by joining to the loans table using only loan_type_name as the join condition. This created a Cartesian join explosion:

  • Expected behavior: ~30 rows (months × loan types)
  • Actual behavior: Thousands of rows (months × loan types × individual customers)
  • Impact: Dashboard shows loan origination values that are hundreds of times too high

💥 Root Cause

left join loans
    on orig.loan_type_name = loans.loan_type_name

This join matches each monthly summary row to EVERY individual loan of that type, causing:

  • Each amount_originated value to be repeated hundreds of times
  • Row count assertion (10-30 expected) failing continuously since Jan 27
  • Incorrect dashboard metrics for all stakeholders

✅ Solution

This PR reverts the problematic changes from PR #3:

  • ❌ Removes customer_id field from SELECT
  • ❌ Removes LEFT JOIN to loans table
  • ✅ Returns model to correct aggregation level (month + loan_type)

Note: customer_id is a loan-level field and cannot logically exist in a monthly aggregation without causing row multiplication. If customer-level monthly reporting is needed, a separate model should be created with proper grouping by customer_id.

📊 Expected Impact

After merging and rebuilding the table:

  • ✅ Table returns to ~30 rows
  • ✅ Dashboard values become accurate
  • ✅ Data quality assertions pass
  • ✅ Risk Analytics Dashboard shows correct loan metrics

🔗 Related

✋ Review Checklist

  • Code review completed
  • SQL logic verified for correct aggregation level
  • After merge, rebuild table with dbt run --select agg_monthly_loans
  • Verify row count returns to expected range (10-30)
  • Confirm dashboard values are correct

This commit fixes a critical data quality bug introduced in PR #3 where
adding customer_id to the monthly aggregation table caused a Cartesian
product explosion.

**Problem:**
- The LEFT JOIN to the loans table on loan_type_name created a many-to-many
  relationship, multiplying each monthly aggregate row by the number of
  individual loans of that type
- This caused the table to grow from ~30 rows to thousands of rows
- Each aggregated value (amount_originated, etc.) was repeated hundreds
  of times, making dashboard totals appear massively inflated
- Data quality assertion (row count between 10-30) has been failing since
  the change was deployed

**Root Cause:**
customer_id is a loan-level field and cannot logically exist in a
monthly aggregation without causing row multiplication. The original
join condition (loan_type_name only) matched each monthly summary to
EVERY loan of that type.

**Solution:**
- Remove customer_id field from the SELECT
- Remove the LEFT JOIN to loans table
- Return the model to its correct aggregation level (month + loan_type)

**Impact:**
- Table will return to expected ~30 rows
- Dashboard values will be accurate again
- Data quality assertions will pass

Fixes issues caused by PR #3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant