Model not fitting well #1626

lixgl · 2025-04-10T12:00:58Z

lixgl
Apr 10, 2025

Hey team,

So I tried to fit the data using BG/NBD and Pareto/NBD models and it looks like it's not fitting well for reasons I don't quite understand

I thought the issue might be in an extremely heavy tailed distribution of frequencies but it doesn't look like it

The posterior predictive check also looks good

What might be the issue and how should I approach it in your opinion?

Thanks in advance!

Answered by ColtAllen

Apr 10, 2025

Hm, the model was actually fit for the whole data, not only the past year

So the first plot is just for the first year of results, then?

I'm afraid that week aggregation might significantly reduce data size and under fit the model

Aggregating by week would only rescale the T and recency variables (i.e., T=52 instead of T=365). It would not change the size of the dataset. To reduce the amount of effort required, I would suggest using clv.rfm_summary for this.

View full answer

stochastic1 · 2025-04-10T16:00:28Z

stochastic1
Apr 10, 2025

Thanks for sharing.
Let me ask some process questions:

Are you using summarized data, ie one row per customerID as your input? If you need help with this, use clv.rfm_summary to create your dataframe. Also, your input elements here are only [CustomerID, CreationDate, netTurnover] for your customer_id, datetime_col and monetary_value_col. The other columns you have aren't relevant to the BG/NBD model.
Did you remove customers with zero repeat transactions from the training data?
Did you use 'map' or 'mcmc' as your fit_method, and did you encounter trouble with convergences when you ran the fit?

1 reply

lixgl Apr 10, 2025
Author

Sure,

1 - No, I transformed the data using the following way:

df['LastOrderDate'] = pd.to_datetime(df['LastOrderDate'])
df['CreationDate'] = pd.to_datetime(df['CreationDate'])
df['netTurnover'] = df['netTurnover'].apply(Decimal)
df = df[df['netTurnover'] != 0]

cltv = df.groupby('CustomerID').agg({
    'CreationDate': [
        lambda x: (x.max() - x.min()).days,  # recency
        lambda x: (today_date - x.min()).days  # T
    ],
    'OrderedThatDay': lambda x: x.sum(),  # frequency
    'netTurnover': lambda x: x.sum()  # monetary
})

As we need repeat purchases I adjusted the frequency column decreasing by 1 every row:

cltv['frequency'] = cltv['frequency'] - 1

For the BG/NBD and consequent fitting (as I tested various models) I have transformed it to the following format that would match the PyMC standard:

cltv_bgf = (
    cltv.reset_index()
        .rename(columns={"CustomerID": "customer_id"})
        .drop(columns="monetary")
        .drop(columns="monetary_percentage")
)

2 -- I didn't as according to original paper, lifetimes package requirements and PyMC requirements we still need 0 purchases for model fitting. Correct me if I'm wrong but we only drop them during Gamma-Gamma value fitting

3 -- Both and I currently use chains (MCMC). I had divergences though but managed to overcome them by finetuningtarget_accept=0.95

ColtAllen · 2025-04-10T17:55:44Z

ColtAllen
Apr 10, 2025
Maintainer

Hey @lixgl,

Looks like your model was only fit to the past year of data, but I see transactions going back more than three years in the CSV you provided. Excluding those previous years will bias results.

I also see some strong weekly trends in your graph, so you may want to aggregate by weeks rather than days instead. On that note, is there a reason why you didn't use clv.rfm_summary for this?

we still need 0 purchases for model fitting. Correct me if I'm wrong but we only drop them during Gamma-Gamma value fitting

This is correct. The transaction models assume a lot of non-repeat customers, so excluding them isn't recommended.

7 replies

lixgl Apr 10, 2025
Author

Hm, the model was actually fit for the whole data, not only the past year

P.S. Age here is "T"

so you may want to aggregate by weeks rather than days instead

I'm afraid that week aggregation might significantly reduce data size and under fit the model
Also, in case that is the only option - will I need to mark frequency on a weekly basis? Meaning if an order was placed by a Customer during that week it would count as 1 - regardless of how many days within this week he/she purchased?

is there a reason why you didn't use clv.rfm_summary for this?

I just used different models and it seemed that just slightly modifying the existing dataframe would match the syntax for PyMC library - but I don't think in that case it's a dealbreaker tbh as the results are ±the same within 'lifetimes' package

ColtAllen Apr 10, 2025
Maintainer

Hm, the model was actually fit for the whole data, not only the past year

So the first plot is just for the first year of results, then?

I'm afraid that week aggregation might significantly reduce data size and under fit the model

Aggregating by week would only rescale the T and recency variables (i.e., T=52 instead of T=365). It would not change the size of the dataset. To reduce the amount of effort required, I would suggest using clv.rfm_summary for this.

Answer selected by lixgl

lixgl Apr 10, 2025
Author

Got it, will try rescaling then!

So the first plot is just for the first year of results, then?

I used 365 days to check model accuracy - it already started deviating at 1 week. Extrapolating it to >1000 days gives even more error vs actual data

lixgl Apr 11, 2025
Author

Hey @ColtAllen,

The model looks better now after rescaling it to weeks instead of days - thank you!

However, I encountered the other issue

In original 'lifetimes' library the plotting is possible with using both training and test sets (called "holdout" in lifetimes)

I see PyMC also supports test and training split but not for plotting

Am I missing smth here? If not it would be great to have this kind of 'plug-and-play' solution for a quick visual inspection of a model fit

ColtAllen Apr 11, 2025
Maintainer

To plot train/test splits you can use clv.plotting.plot_expected_purchases_over_time.html and set the t_start_eval parameter as the start date of the testing period.

There's an open issue for the other plotting function from Lifetimes.

lixgl Apr 11, 2025
Author

Yap, I saw this parameter too but got confused

For example, if I split my data between 365 days for training and next 180 days for testing - should I set the t_start_eval = 365 ?

Thanks in advance once again!

stochastic1 · 2025-04-10T19:54:39Z

stochastic1
Apr 10, 2025

@lixgl Since you're using the whole data set on your training, are you validating your forecast against a holdout set that is not included in the data you uploaded?

@ColtAllen Thanks for this clarification here:

This is correct. The transaction models assume a lot of non-repeat customers, so excluding them isn't recommended.

The tutorials for pymc-marketing are consistent but don't necessarily call this out explicitly.
There are tutorials elsewhere (like this one from databricks) that say to exclude single-visit transactors, which doesn't align with the theory.

This may merit a separate conversation, but I've encountered situations where fitting a model is met with a divide by zero error and they are always fixed by removing customer records with 0 frequency or 0 monetary_value. Is there guidance on how to avoid the divide by zero error otherwise?

1 reply

ColtAllen Apr 11, 2025
Maintainer

I had a call with Databricks yesterday about updating that notebook, funny enough. I'm surprised their model even converged after non-repeat customers were excluded, and it certainly played a part in how poorly the model performed.

fitting a model is met with a divide by zero error and they are always fixed by removing customer records with 0 frequency or 0 monetary_value

This should only be true for the Gamma-Gamma spend model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model not fitting well #1626

{{title}}

Replies: 3 comments 9 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Model not fitting well #1626

lixgl Apr 10, 2025

Replies: 3 comments · 9 replies

stochastic1 Apr 10, 2025

lixgl Apr 10, 2025 Author

ColtAllen Apr 10, 2025 Maintainer

lixgl Apr 10, 2025 Author

ColtAllen Apr 10, 2025 Maintainer

lixgl Apr 10, 2025 Author

lixgl Apr 11, 2025 Author

ColtAllen Apr 11, 2025 Maintainer

lixgl Apr 11, 2025 Author

stochastic1 Apr 10, 2025

ColtAllen Apr 11, 2025 Maintainer

lixgl
Apr 10, 2025

Replies: 3 comments 9 replies

stochastic1
Apr 10, 2025

lixgl Apr 10, 2025
Author

ColtAllen
Apr 10, 2025
Maintainer

lixgl Apr 10, 2025
Author

ColtAllen Apr 10, 2025
Maintainer

lixgl Apr 10, 2025
Author

lixgl Apr 11, 2025
Author

ColtAllen Apr 11, 2025
Maintainer

lixgl Apr 11, 2025
Author

stochastic1
Apr 10, 2025

ColtAllen Apr 11, 2025
Maintainer