The purpose of the project

The aim of this project is to build a full end-to-end ML project. Important: The project main focus of the project is to show the MLOps flow and not to build the best model.

The underlying ML task is to predict bike ride duration given the start and end station, start time, bike type, and type of membership.

Potential use case is the following:

A customer takes a bike from a station and wants to know how long it will take to get to the destination station. They enter the destination station and the rest of the features are logged automatically. The request is sent to the web service that returns the predicted duration and the customer can decide if they want to take the bike or not.

Whole project structure

The project consists of 3 repos

This repo - contains the code for the data preparation, model training, and registering the model in the model registry
The web service repo - contains the code for the web service that serves the model
The web service infra repo - contains IAC code for the web service infrastructure created with Terraform

Data and Modelling flow

This project is using Weights and Biases for experiment tracking and model registry. Please check the project on WANDB, it's public https://wandb.ai/aaalex-lit/capitalbikeshare-mlops/overview

The data

The data is provided by Capital Bikeshare and contains information about bike rides in Washington DC. Downloadable files are available on the following link https://s3.amazonaws.com/capitalbikeshare-data/index.html The data used for the project is from April 2020 to Today (the scripts will get the new data automatically). The reason is that in April 2020 the data format changed and the scripts are not compatible with the old format.

The flow

Raw data download
Raw data combination
Data preparation
Modelling
1. Baseline model
2. Hyperparameter tuning using Weights and Biases Sweeps
3. Training the model with the best hyperparameters

The repository structure

The project structure is inspired by the Cookiecutter Data Science template (but not directly created from it).

Steps to reproduce:

Important: This project is intended to be easily reproducible, but for that you'll need a conda environment so you need a conda installation.

General

.env file needs to be created in the root of the project with the following content. Or set the same environment variables on the command line. Note: You can try using some other project name, but I haven't tested it properly and something might eventually break
```
WANDB_API_KEY=<your_wandb_api_key>
PROJECT_NAME=capitalbikeshare-mlops
```
Create a python-10 based conda environment
```
make create_environment
```
Activate the created environment
```
conda activate $(PROJECT_NAME)
```
Install the dependencies from requirements.txt:
```
make requirements
```
to be able to execute the code in packeges, the followint needs to be executed from the root of the project:
```
export PYTHONPATH="${PYTHONPATH}:$(pwd)"
```
Login to Prefect Cloud
```
prefect cloud login
```

Data downloading and preparation:

It's possible to run all the processes either as python scripts or as prefect deployments For prefect deployments please see a separate documentation in prefect.md

Download raw data:
```
python src/data/download_raw.py
```
Combine raw data into one file:
```
python src/data/combine_raw.py
```
Prepare data for modelling:
```
python src/data/prepare.py
```

Modelling

Baseline xgboost model
```
python src/models/xgb_baseline.py
```
Hyperparameter tuning for xgboost model using Weights and Biases Sweeps
```
python src/models/xgb_sweep.py
```
Retrain a model with the best parameters from a sweep and add it to the model registry
```
python src/models/register_best_model.py
```

Running tests

Run unit tests

make test

Lint, format, sort imports

make quality_checks

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
.prefect		.prefect
data		data
docs		docs
models		models
notebooks		notebooks
prefect-infra		prefect-infra
src		src
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.prefectignore		.prefectignore
.pylintrc		.pylintrc
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
prefect.yaml		prefect.yaml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The purpose of the project

Potential use case is the following:

Whole project structure

Data and Modelling flow

The data

The flow

The repository structure

Steps to reproduce:

General

Data downloading and preparation:

Modelling

Running tests

Lint, format, sort imports

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

The purpose of the project

Potential use case is the following:

Whole project structure

Data and Modelling flow

The data

The flow

The repository structure

Steps to reproduce:

General

Data downloading and preparation:

Modelling

Running tests

Lint, format, sort imports

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages