Sales Demand Forecasting on OpenShift AI

A production ML pipeline that eliminates training-serving skew using Feast, Kubeflow, MLflow, and KServe

Overview

This project demonstrates train-serve consistency using Feast Feature Services:

Training: get_historical_features() via KubeRay (distributed PIT joins)
Inference: get_online_features() via Redis (low-latency lookups)
Same features, zero skew: Both use identical FeatureService definitions

Time
  │                    Monolithic (OOM)
  │                         X
  │                       /
  │     ────────────────/──── Feast+Ray
  │   /                /
  │  / Monolithic    /
  │/               /
  └─────────────────────────────► Data Size
    100K   1M   10M   100M   1B
                 ↑
         Crossover (~5-10M rows)

Real-World Comparison (100M rows, 50 features):

Approach	Time	Feasibility
Pandas	OOM	❌ Impossible
Spark	~45 min	✅ Works
Feast + Ray (4 nodes)	~30 min	✅ Works

Hidden Benefits: Feature versioning, train-serve consistency, cached materialization, MLflow tracking.

The Problem

Machine learning models that work in notebooks often fail silently in production due to training-serving skew — features computed differently during training vs inference.

Google's Hidden Technical Debt in ML Systems established that most ML failures stem from data inconsistencies. Companies like Uber, DoorDash, and Airbnb built feature platforms to solve this.

Failure Mode	Symptom	Impact
Stale features	Serving uses old data	Predictions drift
Different aggregations	Inconsistent rolling averages	Accuracy drops
Missing features	Serving omits a feature	Silent errors
Type mismatches	float64 vs float32	Numerical differences

Use Case: Retail Demand Forecasting

Predicting weekly sales for store-department combinations. According to IHL Group, retailers lose $1.77 trillion annually due to inventory distortion.

Metric	Value
Dataset	65,520 samples (45 stores × 14 depts × 104 weeks)
Features	22 (lag, rolling, temporal, economic, store)
Model	MLP [512, 256, 128, 64] with BatchNorm + Dropout
MAPE	~3-5%

Architecture

┌──────────────────────────────────────────────────────────────────────┐
│  Phase 1             Phase 2                  Phase 3                │
│  ┌────────────┐      ┌─────────────────┐      ┌─────────────────┐    │
│  │ Feast      │─────▶│ Kubeflow        │─────▶│ KServe          │    │
│  │ Apply +    │      │ TrainJob        │      │ InferenceService│    │
│  │ Materialize│      │ (PyTorch DDP)   │      │ + Feast SDK     │    │
│  └─────┬──────┘      └────────┬────────┘      └────────┬────────┘    │
│        │                      │                        │             │
│        ▼                      ▼                        ▼             │
│  ┌───────────┐         ┌───────────┐            ┌───────────┐        │
│  │PostgreSQL │         │ MLflow    │            │ Redis     │        │
│  │ Registry  │         │ Tracking  │            │ Online    │        │
│  └───────────┘         └───────────┘            └───────────┘        │
│        │                      │                        │             │
│        └──────────────────────┴────────────────────────┘             │
│                               │                                      │
│                        ┌──────┴──────┐                               │
│                        │   KubeRay   │                               │
│                        │ Offline PIT │                               │
│                        └─────────────┘                               │
└──────────────────────────────────────────────────────────────────────┘

Tech Stack

Component	Technology	Purpose
Feature Store	Feast Operator	PostgreSQL registry, Redis online, Ray offline
Training	Kubeflow Trainer	Multi-node PyTorch DDP orchestration
Experiment Tracking	MLflow Operator	Workspace isolation, model registry
Model Serving	KServe	Auto-scaling with Feast SDK integration
Platform	OpenShift AI	Managed ML infrastructure

Quick Start

1. Deploy Infrastructure

# Clone repository
git clone https://github.com/abhijeet-dhumal/sales-demand-forecasting.git
cd sales-demand-forecasting

# Deploy all components
oc apply -k manifests/

# Wait for pods
oc wait --for=condition=ready pod -l app=postgres -n feast-trainer-demo --timeout=120s
oc wait --for=condition=ready pod -l ray.io/node-type=head -n feast-trainer-demo --timeout=180s

2. Verify Deployment

oc get pods -n feast-trainer-demo

Expected output:

NAME                                    READY   STATUS    AGE
feast-ray-head-xxxxx                    1/1     Running   2m
feast-ray-worker-xxxxx                  1/1     Running   2m
feast-salesforecasting-server-xxxxx     4/4     Running   2m
postgres-xxxxx                          1/1     Running   2m
redis-xxxxx                             1/1     Running   2m
mlflow-xxxxx                            1/1     Running   2m

3. Create Workbench in OpenShift AI

Access OpenShift AI from console app launcher
Create Data Science Project: feast-trainer-demo
Create Workbench:
- Image: PyTorch | CUDA | Python 3.12
- Storage: Attach shared PVC with RWX
- Feature Store: Connect to salesforecasting

4. Run Notebooks

Notebook	Purpose	Time
`01_feature_store/01a-local.ipynb`	Generate data → Feast apply → Materialize	~2 min
`02_training/02-training.ipynb`	Distributed training with Kubeflow	~3 min
`03_inferencing/03-inference.ipynb`	Deploy model, test predictions	~1 min

Phase Details

Phase 1: Feature Engineering

What Feast Operator manages:

Component	Purpose
PostgreSQL Registry	Durable metadata for feature definitions
Redis Online Store	Low-latency serving (~5ms)
Ray Offline Store	Distributed historical queries
Client ConfigMaps	Auto-generated configuration

Two FeatureServices for consistency:

Service	Use Case	Includes Target?
`training_features`	Historical retrieval for training	✅ Yes
`inference_features`	Real-time lookup for predictions	❌ No

Phase 2: Distributed Training

Why Kubeflow Trainer:

Capability	Description
Multi-node	Scale across 2, 4, 8+ nodes
Multi-GPU	Utilize all GPUs per node
Multi-accelerator	NVIDIA (CUDA) and AMD (ROCm)
Auto-coordination	Environment variables handled automatically

MLflow Integration:

Phase 3: Model Serving

The serving pattern:

Step	Action
1	Client sends entity IDs (`store_id`, `dept_id`)
2	KServe receives request
3	Feast SDK fetches features from Redis
4	Model predicts
5	Return result

Why this matters:

Approach	Client Sends	Skew Risk
Without Feast	All features	⚠️ High
With Feast	Entity IDs only	✅ Zero

Project Structure

sales-demand-forecasting/
├── manifests/
│   ├── kustomization.yaml      # oc apply -k manifests/
│   ├── base/                   # Namespace, PVC
│   ├── databases/              # PostgreSQL, Redis
│   ├── ray/                    # RayCluster
│   ├── feast/                  # FeatureStore CR
│   └── mlflow/                 # MLflow Operator
├── notebooks/
│   ├── 01_feature_store/       # Feature engineering
│   │   ├── 01a-local.ipynb     # Admin: register features
│   │   └── 01b-remote.ipynb    # User: use existing features
│   ├── 02_training/            # Distributed training
│   └── 03_inferencing/         # Model serving
├── feature_repo/
│   └── features.py             # Feature definitions
└── docs/
    ├── diagrams/               # Architecture diagrams
    └── images/                 # Screenshots

Configuration

Setting	Default	Notes
Namespace	`feast-trainer-demo`	All components deployed here
Training nodes	2	PyTorch DDP workers
GPUs per node	1	Configurable in TrainJob
Redis latency	~5ms	Online feature lookup
Ray workers	2	Distributed offline queries

Results (Demo Dataset)

Metric	Value
Dataset	45 stores × 14 depts × 104 weeks
Features	22 engineered features
Model	MLP [512, 256, 128, 64]
MAPE	~3-5%
Training time	~45s (2 nodes × 1 GPU)
Inference latency	~50ms (including feature fetch)

Why This Architecture

Decision	Benefit
Feast as single source of truth	Same definitions for training + serving = no skew
Ray offline store	Scales PIT joins from thousands to millions of rows
Kubeflow Trainer	Declarative distributed training, no manual coordination
KServe + Feast SDK	Consistent feature retrieval at inference time
MLflow Operator	Workspace isolation, model versioning

Troubleshooting

Issue	Fix
Feature Store connection not found	Edit workbench → Add Feature Store → Restart
Pods not starting	`oc describe pod <name> -n feast-trainer-demo`
PVC not bound	Check storage class supports RWX
DDP timeout	Increase `RDZV_TIMEOUT` in TrainJob
Ray FileNotFoundError	Restart Ray cluster

Cleanup

oc delete namespace feast-trainer-demo

Resources

This Project:

Detailed Blog Post — Full technical write-up with code examples

Documentation:

Feast | Kubeflow Trainer | KServe | MLflow
OpenShift AI

Industry References:

Hidden Technical Debt in ML Systems — Google NIPS 2015
Uber Michelangelo | DoorDash Feature Store | Airbnb Chronon

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
docs		docs
feature_repo		feature_repo
images		images
manifests		manifests
notebooks		notebooks
.DS_Store		.DS_Store
.gitignore		.gitignore
BLOG_POST.md		BLOG_POST.md
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sales Demand Forecasting on OpenShift AI

Overview

The Problem

Use Case: Retail Demand Forecasting

Architecture

Tech Stack

Quick Start

1. Deploy Infrastructure

2. Verify Deployment

3. Create Workbench in OpenShift AI

4. Run Notebooks

Phase Details

Phase 1: Feature Engineering

Phase 2: Distributed Training

Phase 3: Model Serving

Project Structure

Configuration

Results (Demo Dataset)

Why This Architecture

Troubleshooting

Cleanup

Resources

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Sales Demand Forecasting on OpenShift AI

Overview

The Problem

Use Case: Retail Demand Forecasting

Architecture

Tech Stack

Quick Start

1. Deploy Infrastructure

2. Verify Deployment

3. Create Workbench in OpenShift AI

4. Run Notebooks

Phase Details

Phase 1: Feature Engineering

Phase 2: Distributed Training

Phase 3: Model Serving

Project Structure

Configuration

Results (Demo Dataset)

Why This Architecture

Troubleshooting

Cleanup

Resources

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages