Skip to content

Fix: Remove customer_id causing Cartesian product in monthly loan aggregations#10

Open
jjoyce0510 wants to merge 1 commit intomainfrom
fix/remove-customer-id-cartesian-product
Open

Fix: Remove customer_id causing Cartesian product in monthly loan aggregations#10
jjoyce0510 wants to merge 1 commit intomainfrom
fix/remove-customer-id-cartesian-product

Conversation

@jjoyce0510
Copy link

🐛 Bug Fix: Data Duplication in Risk Analytics Dashboard

This PR fixes the critical data quality issue causing inflated loan values in the Risk Analytics Reporting Dashboard.


Problem Summary

PR #3 (merged Jan 27, 2026) introduced a Cartesian product bug that duplicated aggregated loan data:

  • ❌ Dashboard showing way higher loan values than previous months
  • ❌ Row count assertion failing continuously (10/10 recent checks)
  • ❌ Expected 10-30 rows, getting hundreds instead

Root Cause

The bug was in this join logic added by PR #3:

left join loans
    on orig.loan_type_name = loans.loan_type_name  -- ⚠️ No unique constraint!

What happened:

  • Each monthly aggregate (e.g., "Mortgages - January 2026")
  • Got multiplied by EVERY customer with that loan type
  • If 50 customers have mortgages → 1 row becomes 50 duplicate rows
  • Dashboard sums these duplicates → loan amounts inflated 50×

Why customer_id doesn't belong here:

  • This table aggregates data by month + loan type
  • One month has multiple customers → adding customer_id breaks the aggregation grain
  • You can't have one customer_id per aggregated row

The Fix

✅ Removed customer_id field
✅ Removed LEFT JOIN loans causing duplication
✅ Restored original aggregation logic (pre-PR #3)


What Changed

combined as (
    select
        coalesce(orig.month_start, pay.month_start) as month,
        orig.loan_type_name,
-       loans.customer_id,  ⬅️ REMOVED
        coalesce(orig.loans_originated, 0) as new_loans,
        coalesce(orig.total_amount_originated, 0) as amount_originated,
        ...
    from monthly_originations orig
    full outer join monthly_payments pay
        on orig.month_start = pay.month_start
-   left join loans                      ⬅️ REMOVED
-       on orig.loan_type_name = loans.loan_type_name  ⬅️ REMOVED
)

Impact

Fixes: Risk Analytics Reporting Dashboard will show correct loan values
Fixes: Row count assertion will pass (returns to expected 10-30 rows)
Fixes: Accurate monthly loan aggregations restored


Testing Recommendations

After merge:

  1. Re-run dbt model: dbt run --select agg_monthly_loans
  2. Verify row count is back to ~10-30 rows
  3. Check dashboard values match pre-Jan 27 trends
  4. Confirm row count assertion passes

Related

This fixes the data duplication bug introduced in PR #3 where adding
customer_id caused a Cartesian product explosion.

PROBLEM:
- The LEFT JOIN to loans table only matched on loan_type_name
- This created duplicates: each monthly aggregate row was multiplied
  by the number of customers with that loan type
- Example: 1 row for "Mortgage - January" × 50 mortgage customers = 50 rows
- Result: Dashboard showed inflated loan values (duplicated amounts)

ROOT CAUSE:
- customer_id doesn't belong in an aggregated monthly table
- One month has multiple customers, so adding customer_id breaks aggregation
- The join had no unique constraint (no date/id matching)

FIX:
- Removed customer_id field from SELECT
- Removed LEFT JOIN to loans table
- Restored original aggregation logic

IMPACT:
- Fixes Risk Analytics Reporting Dashboard
- Resolves failing row count assertion (expected 10-30 rows)
- Corrects loan origination amounts back to accurate values

Related: PR #3, Issue reported by john.joyce@acryldata.io
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant