An end-to-end dynamic pricing system for rideshare platforms, covering:
- Data validation and feature engineering
- Model benchmarking and diagnostics
- Model versioning and artifact management
- Counterfactual pricing policy simulation
- Probabilistic customer acceptance modeling
- SHAP-based explainability
- FastAPI model serving
- Logging and Prometheus-style monitoring
- Docker containerization
- Kubernetes-ready deployment configuration
This project is structured as a production-oriented machine learning system rather than a notebook-only analysis.
A detailed project report is available in:
Project Report.pdf
The report explains the complete system design, methodology, results, diagnostics, simulation assumptions, explainability layer, API deployment, Docker setup, Kubernetes-ready architecture, limitations, and future work.
It is the best place to start if you want a full narrative explanation of the project.
Key report sections include:
- Dataset and problem definition
- Feature engineering pipeline
- Model benchmarking and selection
- v1 vs v2 model comparison
- Counterfactual pricing simulation
- SHAP explainability
- FastAPI API deployment
- Docker and Kubernetes deployment
- Key results, limitations, and future extensions
This project predicts rideshare prices using contextual ride features such as demand, supply, customer information, booking time, vehicle type, and ride duration.
The final selected production model is:
Ridge Regression v1
Ridge v1 was selected because it provides:
- Strong predictive performance
- Stable coefficient shrinkage under correlated engineered features
- Interpretability
- Simpler deployment compared with the log-transformed v2 model
Although the v2 log-target experiment achieved a marginally lower RMSE with Lasso, the improvement was small and did not justify the additional deployment complexity.
The system follows this high-level flow:
Raw rideshare data
↓
CSV validation + FeaturePipeline
↓
Model benchmarking and diagnostics
↓
Versioned ModelBundle artifact
↓
Simulation + Explainability + API serving
↓
Docker + Kubernetes-ready deployment
Main deployed API flow:
API request
↓
Pydantic validation
↓
Saved FeaturePipeline transformation
↓
Ridge v1 model prediction
↓
Price response and optional explanation
Rideshare_Dynamic_Pricing/
├─ src/
│ └─ dynamic_pricing/
│ ├─ api/
│ │ ├─ main.py # FastAPI app entrypoint
│ │ ├─ routes.py # API routes: /health, /predict, /explain, /metrics
│ │ ├─ deps.py # ModelBundle loader
│ │ ├─ explain.py # Local linear explanation logic
│ │ └─ middleware.py # Logging + Prometheus metrics middleware
│ │
│ ├─ config/
│ │ ├─ settings.py # Environment-driven settings
│ │ └─ logging.py # Logging configuration
│ │
│ ├─ schemas/
│ │ ├─ api.py # API request/response schemas
│ │ ├─ features.py # Ride context schema
│ │ └─ responses.py # Prediction response schemas
│ │
│ ├─ features/
│ │ ├─ io.py # CSV loading + schema validation
│ │ └─ pipeline.py # FeaturePipeline: scaling, encoding, derived features
│ │
│ ├─ models/
│ │ ├─ bundle.py # ModelBundle dataclass
│ │ ├─ evaluate.py # Metrics and residual diagnostics
│ │ ├─ registry.py # Save/load model artifacts
│ │ └─ train.py # Model benchmarking + tuning
│ │
│ ├─ simulation/
│ │ ├─ acceptance.py # Probabilistic acceptance model
│ │ ├─ policies.py # Historical, Ridge, demand-supply pricing policies
│ │ └─ simulator.py # Counterfactual policy simulation engine
│ │
│ └─ explainability/
│ └─ shap_analysis.py # Offline SHAP analysis
│
├─ scripts/
│ ├─ smoke_features.py # Feature pipeline smoke test
│ ├─ train_benchmark.py # Train v1/v2 benchmark models
│ ├─ run_simulation.py # Run pricing policy simulation
│ └─ run_shap_analysis.py # Generate SHAP outputs
│
├─ docs/
│ ├─ data_dictionary.md
│ ├─ feature_pipeline.md
│ ├─ model_selection.md
│ ├─ simulation_methodology.md
│ └─ explainability.md
│
├─ reports/
│ ├─ model_benchmark_v1.md
│ ├─ model_benchmark_v2.md
│ ├─ simulation_results.md
│ ├─ shap_summary.md
│ └─ figures/
│ ├─ v1/
│ ├─ v2/
│ └─ shap/
│
├─ artifacts/
│ └─ model/
│ ├─ v1/
│ │ ├─ model_bundle.joblib
│ │ └─ metadata.json
│ └─ v2/
│ ├─ model_bundle.joblib
│ └─ metadata.json
│
├─ deploy/
│ └─ k8s/
│ └─ base/
│ ├─ deployment.yaml
│ ├─ service.yaml
│ └─ configmap.yaml
│
├─ Dockerfile
├─ docker-compose.yml
├─ pyproject.toml
├─ Project Report.pdf
└─ README.md
Place the raw dataset at:
data/raw/dynamic_pricing.csv
The CSV loader expects the following case-sensitive columns:
Number_of_Riders
Number_of_Drivers
Location_Category
Customer_Loyalty_Status
Number_of_Past_Rides
Average_Ratings
Time_of_Booking
Vehicle_Type
Expected_Ride_Duration
Historical_Cost_of_Ride
The loader performs:
- Strict column validation
- Header whitespace stripping
- Numeric type coercion
- Categorical whitespace trimming
The project uses a reusable FeaturePipeline class as the single source of truth for preprocessing.
The same pipeline is used for:
- Model training
- Simulation
- API inference
- Explainability
This prevents training-serving skew.
Number_of_Riders
Number_of_Drivers
Number_of_Past_Rides
Average_Ratings
Expected_Ride_Duration
Transformation:
StandardScaler
riders_per_driver = Number_of_Riders / max(Number_of_Drivers, 1)
driver_supply_gap = Number_of_Drivers - Number_of_Riders
log_riders = log1p(Number_of_Riders)
log_drivers = log1p(Number_of_Drivers)
log_duration = log1p(Expected_Ride_Duration)
Transformation:
StandardScaler
Location_Category
Customer_Loyalty_Status
Time_of_Booking
Vehicle_Type
Transformation:
OneHotEncoder(handle_unknown="ignore")
The following models were benchmarked:
- Ridge Regression
- Lasso Regression
- Decision Tree Regressor
- Random Forest Regressor
- Gradient Boosting Regressor
Hyperparameter tuning was performed using:
GridSearchCVRandomizedSearchCV
Evaluation metrics included:
- MAE
- RMSE
- R²
- QQ plots
- Residuals vs fitted plots
- Residual histograms
- Breusch-Pagan heteroskedasticity test
The v1 experiment directly predicts:
Historical_Cost_of_Ride
Best model:
Ridge Regression
Approximate benchmark result:
RMSE: 67.435
R²: 0.875
Artifacts:
artifacts/model/v1/
The v2 experiment trains models on:
log(Historical_Cost_of_Ride)
Predictions are back-transformed for evaluation.
Best model:
Lasso Regression
Approximate benchmark result:
RMSE: 67.201
R²: 0.876
Artifacts:
artifacts/model/v2/
The production candidate is:
Ridge Regression v1
Rationale:
- v2 only marginally improved RMSE
- Ridge v1 is simpler to deploy
- Ridge is stable under correlated engineered features
- No inverse target transformation is required
- Residual diagnostics were acceptable for the project objective
See:
docs/model_selection.md
reports/Project Report.pdf
Models are saved as ModelBundle objects using joblib.
Each bundle contains:
- Trained model
- Fitted
FeaturePipeline - Model name
- Model version
- Feature version
- Metadata
Example:
artifacts/model/v1/
├─ model_bundle.joblib
└─ metadata.json
This ensures API inference uses the exact same preprocessing pipeline as training.
The project includes a simulation framework that compares pricing policies.
Implemented policies:
-
HistoricalPolicy- Uses the original historical ride price
-
RidgeModelPolicy- Uses the selected Ridge v1 model to predict price
-
DemandSupplyPolicy- Applies a rule-based multiplier using rider/driver imbalance
Simulation flow:
Ride context
↓
Pricing policy
↓
Proposed price
↓
Acceptance model
↓
Expected revenue
Since the dataset does not contain true acceptance/rejection labels, customer acceptance is simulated using a logistic function:
P(accept) = 1 / (1 + exp(k * delta))
where:
delta = (proposed_price - base_price) / base_price
Expected revenue is then:
Expected Revenue = Proposed Price × P(accept)
Simulation outputs:
reports/simulation_results.md
reports/simulation_detail_Historical.csv
reports/simulation_detail_ridge_model.csv
reports/simulation_detail_demand_supply.csv
Important limitation:
The simulation is not causal. It is a controlled policy comparison framework, not a real-world revenue forecast.
See:
docs/simulation_methodology.md
Explainability is implemented in two forms:
The project generates SHAP-based global explanations for the selected Ridge v1 model.
Outputs:
reports/figures/shap/shap_feature_importance_bar.png
reports/figures/shap/shap_summary_beeswarm.png
reports/figures/shap/shap_feature_importance.csv
reports/shap_summary.md
The top feature importance results show that Expected_Ride_Duration is the dominant driver of predicted price, followed by vehicle type and demand-related engineered features.
The FastAPI service includes:
POST /explain
For linear models, local contribution is computed as:
contribution_j = transformed_feature_j × coefficient_j
This returns the top feature contributions for an individual prediction.
See:
docs/explainability.md
The model is deployed through a FastAPI service.
GET /health
GET /metrics
POST /predict
POST /explain
Checks whether the service is running.
Returns predicted ride price.
Returns predicted ride price plus top local feature contributions.
Returns Prometheus-compatible metrics, including:
- Request count by method, endpoint, and status
- Request latency histogram
{
"Number_of_Riders": 60,
"Number_of_Drivers": 25,
"Location_Category": "Urban",
"Customer_Loyalty_Status": "Gold",
"Number_of_Past_Rides": 10,
"Average_Ratings": 4.2,
"Time_of_Booking": "Evening",
"Vehicle_Type": "Economy",
"Expected_Ride_Duration": 40,
"Historical_Cost_of_Ride": 200.0
}Example /predict response:
{
"predicted_price": 132.66,
"model_name": "ridge",
"model_version": "v1",
"feature_version": "v1"
}Example /explain response:
{
"predicted_price": 132.66,
"model_name": "ridge",
"model_version": "v1",
"feature_version": "v1",
"top_contributions": [
{
"feature": "Expected_Ride_Duration",
"value": -1.2066,
"contribution": -215.1185
},
{
"feature": "Vehicle_Type_Economy",
"value": 1.0,
"contribution": -21.0976
}
]
}The API includes request logging middleware.
Each request logs:
- HTTP method
- Endpoint path
- Response status
- Request duration
Prometheus-style metrics are available at:
GET /metrics
Tracked metrics include:
dynamic_pricing_http_requests_totaldynamic_pricing_http_request_duration_seconds
The project includes a Dockerfile for containerized API serving.
Build the image:
docker build -t dynamic-pricing-api:v1 .Run the container:
docker run --rm -p 8000:8000 dynamic-pricing-api:v1Open:
http://127.0.0.1:8000/docs
Run locally with Docker Compose:
docker compose up --buildThe API will be available at:
http://127.0.0.1:8000/docs
Kubernetes manifests are stored in:
deploy/k8s/base/
Included manifests:
deployment.yaml
service.yaml
configmap.yaml
The Kubernetes setup defines:
- FastAPI deployment
- Multiple API replicas
- ClusterIP service
- ConfigMap for model/service configuration
This prepares the system for deployment using Minikube, kind, or a cloud Kubernetes cluster.
python -m venv .venv
source .venv/bin/activatepip install -e .For development dependencies:
pip install -e ".[dev]"For API dependencies:
pip install -e ".[api]"For explainability dependencies:
pip install -e ".[explain]"Or install all optional dependencies:
pip install -e ".[dev,api,explain]"python scripts/smoke_features.pypython scripts/train_benchmark.py --version v1python scripts/train_benchmark.py --version v2python scripts/run_simulation.pypython scripts/run_shap_analysis.pyuvicorn dynamic_pricing.api.main:app --reloadThen open:
http://127.0.0.1:8000/docs
pytestImportant generated outputs:
reports/Project Report.pdf
reports/model_benchmark_v1.md
reports/model_benchmark_v2.md
reports/simulation_results.md
reports/shap_summary.md
Important figures:
reports/figures/v1/
reports/figures/v2/
reports/figures/shap/
Important artifacts:
artifacts/model/v1/
artifacts/model/v2/
Detailed documentation is available in:
docs/data_dictionary.md
docs/feature_pipeline.md
docs/model_selection.md
docs/simulation_methodology.md
docs/explainability.md
Recommended reading order:
reports/Project Report.pdfdocs/feature_pipeline.mddocs/model_selection.mddocs/simulation_methodology.mddocs/explainability.md
Completed:
- Data validation
- Feature engineering pipeline
- Model benchmarking
- Hyperparameter tuning
- Residual diagnostics
- v1/v2 model versioning
- Ridge v1 production model selection
- Counterfactual pricing simulation
- Probabilistic acceptance model
- SHAP explainability
- FastAPI deployment
/predictendpoint/explainendpoint/metricsendpoint- Request logging
- Docker containerization
- Docker Compose setup
- Kubernetes-ready manifests
- The dataset does not include true customer acceptance or rejection outcomes.
- The simulation acceptance model is heuristic, not learned from observed conversion data.
- The pricing model predicts historical price rather than directly optimizing profit.
- The system does not currently model long-term customer behavior, driver response, competition, or retention effects.
- SHAP and local feature contributions explain model behavior, not causal relationships.
This project prioritizes:
- Reproducibility
- Versioned artifacts
- Strict data contracts
- Training-serving consistency
- Transparent model selection
- Explainability
- Deployment readiness
- Clear separation of concerns
The result is a complete applied machine learning system that goes beyond prediction and includes simulation, interpretability, monitoring, and deployment.