Skip to content

Latest commit

 

History

History
73 lines (49 loc) · 3.85 KB

README.md

File metadata and controls

73 lines (49 loc) · 3.85 KB

RetentionCast - Real-time E-commerce Analytics using Change Data Capture

Maintenance made-with-python GitHub latest commit GPLv3 license

Built using: Python, Postgres, Kafka, Debezium

Read the blog post about RetentionCast.

RetentionCast is a scalable, performant streaming, real-time analytics architecture intended to provide up-to-date predictions on when a customer is likely to be at-risk of churning based on their activity.

I was a big fan of lifetimes, before it was archived. PyMC-Marketing has incorporated its functionality, but this project was in part an excuse to go deeper on the basic workings of the library, as I reimplement from scratch some of the algorithms it uses.

Architecture

architecture-beta
    group api(cloud)[Ecommerce Simulation]

    service db(database)[Database] in api
    service simulator(disk)[Simulator] in api
    service consumer(disk)[Consumer] in api
    service debezium(server)[Debezium Connect] in api
    service kafka(server)[Kafka Topic] in api

    simulator:R --> L:db
    db:R --> L:debezium
    debezium:T --> B:kafka
    consumer:R -- L:kafka
    consumer:B --> T:db
Loading

This project uses Debezium to implement a Change Data Capture model. The idea is that as web shop activity happens it causes transactions to be inserted into the database. Those changes are then streamed into a Kafka topic by Debezium, and the consumer can operate on each individual change (in this case to calculate and update a table used to calculate churn likelihood)

The Simulator generates sales transactions to insert into the database. It's a simple script that creates random transactions and randomly causes the user who purchased to maybe churn.

The Consumer reads sale transactions from Kafka and either creates a new entry in the analytics table (Recency, Frequency & Monetary Value) table for that user or updates their entry. This table is used to calculate churn for users, and the store overall.

Survival analysis for churn can be done using SQL, as detailed here

Analytics table schema

  • first_purchase_date: When the customer first converted
  • last_purchase_date: When the customer made their most recent purchase
  • frequency: Number of repeat purchases
  • total_order_value
  • avg_order_value
  • retention_campaign_target_date: Updated byt the consumer.py, the date when the user's alive probability drops below 70%

Advantages of this approach

  • A scalable, performant real-time architecture
  • An approach to churn that is very flexible and derived from customer activity, but can support business-led churn definitions (i.e. "if we haven't seen someone in 30/60/90 days we count them as churned") as well.

Disadvantages of this approach

  • More difficult and time-consuming to manage than a batch architecture, a trade-off that only makes sense at a significant scale and if this service is a significant focus for the business.
  • Potentially more difficult to debug and fix errors in production
  • Real-time is a seductive concept but should be carefully applied

Progress

  • TODO: documentation
  • TODO: frontend
  • TODO: Performance benchmarks

Running

docker compose up
./setup.sh