mlops-01-data-versioning-dvc mlops-02-DVC-model-tracking mlops-03-ci-cd-ml-pipeline mlops-04-docker-k8s-serving
cd project1 git add . git commit -m "update" git push
git submodule add https://github.com/yourusername/project1.git git submodule add https://github.com/yourusername/project2.git
| 🚀 Projects | 🧩 Task | 🎯 Objective | 🛠️ Prominent Techniques / Tools |
|---|---|---|---|
| 1.Mlflow Tracking Model Registry | Regression & experiment tracking | Learn MLflow tracking, inference, model registry, versioning | Python, Scikit‑learn, MLflow, Pandas, NumPy, MLflow PyFunc |
| 2.House Price Prediction (MLflow) | End‑to‑end regression | Train, tune, compare, and register best model | Random Forest, GridSearchCV, MLflow Tracking & Registry |
| 3.ANN with MLflow (End‑to‑End MLOps) | Neural network regression | Build production‑ready ANN with full ML lifecycle | Keras, TensorFlow, MLflow, Hyperopt (TPE), PyFunc |
| 4.ML Pipeline with DVC & MLflow | Reproducible ML pipeline | Version data, models, and experiments together | DVC, MLflow, Random Forest, Git, DagsHub |
| 5.Hello Docker Project | Containerization basics | Learn Docker image build & container execution | Docker, Dockerfile, Container Lifecycle |
| 6.Airflow Math Sequence DAG | Workflow orchestration | Learn DAGs, dependencies, and XComs | Apache Airflow 2.x, TaskFlow API, Astro CLI |
| 7.Airflow MLOps Pipeline | End‑to‑end MLOps workflow | Simulate real‑world ML pipeline with deploy decisions | Airflow, Python, MLOps Concepts, Astro CLI |
This project demonstrates an end-to-end machine learning workflow for house price prediction using the California Housing dataset and MLflow for experiment tracking, hyperparameter tuning, and model registration.
The main goals of this project are:
- Train a regression model with hyperparameter tuning
- Track all experiments, parameters, and metrics using MLflow
- Compare multiple runs in the MLflow UI
- Register the best-performing model in the MLflow Model Registry
- California Housing Dataset
- Source:
sklearn.datasets.fetch_california_housing - Number of samples: 20,640
- Features: - MedInc (Median Income), - HouseAge, - AveRooms, - AveBedrms
- Population, - AveOccup, - Latitude, - Longitude
- Target:
- Price (Median house value in units of $100,000)
- Algorithm: Random Forest Regressor
- Evaluation metric: Mean Squared Error (MSE)
- Hyperparameter tuning: GridSearchCV
- Dataset loaded using
fetch_california_housing - Converted into a pandas DataFrame
- Target variable added as
Price
- Independent variables (
X) created by dropping thePricecolumn - Dependent variable (
y) set asPrice - Train-test split performed (80% training / 20% testing)
- Hyperparameter tuning implemented using
GridSearchCV - Parameters tuned:
n_estimators, -max_depth, -min_samples_split, -min_samples_leaf, - 3-fold cross-validation used
- Scoring metric:
neg_mean_squared_error
- Best model selected from GridSearchCV
- Predictions generated on test data
- Mean Squared Error calculated
- MLflow tracking server used (
http://127.0.0.1:5000) - Logged to MLflow: Best hyperparameters, Mean Squared Error (MSE), Model artifacts
- Model input/output signature inferred using
infer_signature
- Best-performing model registered in MLflow Model Registry
- Registered model name:
Best Hyperparameters:
n_estimators: 200
max_depth: None
min_samples_split: 2
min_samples_leaf: 1
Mean Squared Error (MSE): ~0.25
Model successfully tracked and registered in MLflow
Production-oriented MLOps pipeline demonstrating how to train, tune, track, register, and serve an Artificial Neural Network (ANN) using MLflow.
The project covers the full ML lifecycle from experimentation to deployment-ready inference.
This project demonstrates hands-on experience with:
- ✅ Experiment tracking at scale
- ✅ Hyperparameter optimization
- ✅ Model registry & versioning
- ✅ Reproducible ML workflows
- ✅ Deployment-ready model artifacts
It reflects real-world ML engineering and MLOps practices, not just model training.
- ANN built with Keras
- Regression task (Wine Quality prediction)
- Feature normalization inside the model graph
- Metric-driven model selection (RMSE)
- Hyperopt + TPE for hyperparameter search
- Search space:
- Learning rate (log-uniform)
- Momentum (uniform)
- Best model selected automatically based on validation RMSE
- MLflow Experiments
- Parameters
- Metrics
- Model artifacts
- Nested runs for hyperparameter sweeps
- MLflow Model Registry
- Versioned models
- Promotion-ready artifacts
- Model loaded using MLflow PyFunc
- Serving input validated prior to deployment
- Compatible with:
- REST API serving
- Batch inference
- Cloud ML platforms
- MLflow-compatible model format
- Can be packaged into:
- Docker containers
- Cloud-native serving endpoints
- Clear separation of training, evaluation, and inference
- Python 3.10
- Keras / TensorFlow
- MLflow
- Hyperopt
- Scikit-learn
- Pandas / NumPy
- Automated experiment comparison
- Best-performing ANN selected and registered
- Fully reproducible ML pipeline
- Production-aligned workflow (training → registry → inference)
- MLOps & ML Engineering
- Experiment tracking & reproducibility
- Hyperparameter optimization
- Model versioning & governance
- Deployment-oriented ML design
📌 Project Overview
This project demonstrates how to build an end-to-end machine learning pipeline using DVC (Data Version Control) for data and model versioning and MLflow for experiment tracking. The pipeline trains and evaluates a Random Forest Classifier on the Pima Indians Diabetes Dataset, following best practices for reproducibility and MLOps.
The goal of this project is to show how data, code, models, and experiments can be tracked together in a structured and scalable way. (dagshub.com) 🚀 Key Features
✅ End-to-end ML pipeline
✅ Data and model versioning with DVC
✅ Experiment tracking with MLflow
✅ Reproducible pipeline stages
✅ Modular and scalable project structure
✅ Integration-ready with cloud storage (S3 / GCS / Azure)
🧰 Tech Stack
Python
Scikit-learn
DVC – data & pipeline versioning
MLflow – experiment tracking
Git / DagsHub – code, data, and collaboration
📂 Project Structure text
machinelearningpipeline/ │ ├── data/ # Raw and processed datasets (DVC tracked) ├── src/ # Source code for pipeline stages │ ├── data_preprocessing.py │ ├── train_model.py │ └── evaluate_model.py │ ├── models/ # Trained models (DVC tracked) ├── dvc.yaml # DVC pipeline definition ├── params.yaml # Model and pipeline parameters ├── requirements.txt # Python dependencies └── README.md # Project documentation
🔄 Pipeline Stages
Data Preprocessing
Cleans and prepares the dataset
Splits data into training and test sets
Model Training
Trains a Random Forest Classifier
Logs parameters and metrics to MLflow
Model Evaluation
Evaluates model performance
Tracks evaluation metrics for comparison
git clone https://dagshub.com/venky.sarangi/machinelearningpipeline cd machinelearningpipeline
2️⃣ Create Environment & Install Dependencies bash
pip install -r requirements.txt
3️⃣ Run the DVC Pipeline bash
dvc repro
📊 Experiment Tracking
All experiments, metrics, and model artifacts are logged using MLflow. You can compare runs and analyze performance directly from the DagsHub UI. ☁️ Data & Model Versioning
Datasets and models are tracked using DVC
Supports remote storage (AWS S3, GCS, Azure, or DagsHub Storage)
Ensures full reproducibility across environments
🎯 Use Cases
Learning MLOps fundamentals
Building reproducible ML workflows
Experiment tracking and model comparison
Production-ready ML pipeline templates
🤝 Contributing
Contributions are welcome! Feel free to open issues or submit pull requests to improve the pipeline. 📜 License
This project is intended for educational and learning purposes. ⭐ If you find this project useful, consider starring it!
If you want, I can also:
Tailor this README for GitHub vs DagsHub
Add badges (Python, DVC, MLflow)
Create a Beginner-friendly version
Generate a project architecture diagram
docker --version→ Verify Docker installationdocker build -t welcome-app .→ Build Docker imagedocker run -p 5000:5000 welcome-app→ Run the containerdocker ps→ Check running containers
This repository contains beginner‑to‑intermediate Apache Airflow examples designed to help you understand real‑world workflows and MLOps pipelines using modern Airflow 2.x (TaskFlow API).
✅ Can be run locally using Astronomer (Astro CLI) ✅ Easy to understand and production‑oriented ✅ Fully compatible with Airflow UI 📂 Project Structure text
. ├── dags/ │ ├── math_sequence_dag.py │ └── mlops_basic_pipeline.py ├── README.md └── requirements.txt
📌 Project 1: Math Sequence DAG 🔢 Description
This DAG demonstrates task dependencies and XCom value passing using a simple math workflow. ✅ Workflow Steps
Start with number 10
Add 5
Multiply by 2
Subtract 3
Square the final result
✅ Concepts Learned
DAG creation
TaskFlow API (@task)
Task dependencies
Automatic XComs
Logs in Airflow UI
✅ Final Result basic
10 → 15 → 30 → 27 → 729
📄 DAG File
dags/math_sequence_dag.py
🤖 Description
This project simulates a production‑style MLOps workflow commonly used in real companies. ✅ Workflow Steps
Extract data (mock)
Validate data
Train ML model
Evaluate model performance
Decide whether to deploy
✅ Concepts Learned
End‑to‑end ML pipelines
Data validation
Model training & evaluation
Deployment decision logic
Failure handling
📄 DAG File
dags/mlops_basic_pipeline.py
🔄 MLOps Workflow Diagram (Logical)
Extract Data ↓ Validate Data ↓ Train Model ↓ Evaluate Model ↓ Deploy Model (Yes / No)
🛠️ Prerequisites
Docker
Git
Astro CLI
✅ Install Astro CLI bash
curl -sSL https://install.astronomer.io | sudo bash
Verify: bash
astro version
git clone https://github.com/your-username/airflow-learning-projects.git cd airflow-learning-projects
2️⃣ Start Airflow with Astro bash
astro dev start
3️⃣ Open Airflow UI
Login:
Username: admin Password: admin
4️⃣ Trigger the DAGs
math_sequence_dag
mlops_basic_pipeline
🎯 What You Will Learn ✅ Apache Airflow
DAG lifecycle
TaskFlow API
Scheduling
Logging & retries
XComs
UI debugging
✅ MLOps Foundations
Pipeline design
Model validation
Deployment decision logic
Production‑style workflows
🧠 How This Maps to Real Production Systems Learning Example Real Production extract_data S3 / BigQuery / APIs validate_data Great Expectations train_model sklearn / PyTorch evaluate_model MLflow deploy_model Kubernetes / SageMaker 🚀 Future Enhancements
Planned improvements:
MLflow integration
Branching DAGs
Sensors for new data
CI/CD for DAGs
Feature engineering pipelines
Retraining schedules
🤝 Contributions
Contributions are welcome! Feel free to:
Open issues
Submit pull requests
Suggest improvements
📜 License
This project is licensed under the MIT License. ⭐ Support
If this repository helped you learn Airflow or MLOps:
⭐ Star the repo
🔁 Share it with others
Happy Learning & Orchestrating 🚀
📊 Project Summary Table Project Name Task Objective Prominent Techniques / Tools Project 1: Linear Regression with Iris Dataset (MLflow) Regression modeling & experiment tracking Learn MLflow fundamentals: tracking, inference, model registry, versioning Python, Scikit‑learn, MLflow, Pandas, NumPy, MLflow PyFunc Project 2: House Price Prediction (MLflow) End‑to‑end regression with tuning Train, tune, compare, and register best model using MLflow Random Forest, GridSearchCV, MLflow Tracking & Registry, sklearn Project 3: ANN with MLflow (End‑to‑End MLOps) Neural network regression Build production‑ready ANN with full ML lifecycle Keras/TensorFlow, MLflow, Hyperopt (TPE), PyFunc, Model Registry Project 4: ML Pipeline with DVC & MLflow Reproducible ML pipeline Version data, models, and experiments together DVC, MLflow, Random Forest, Git, DagsHub Project 5: Hello Docker Project Containerization basics Learn Docker image build & container execution Docker CLI, Dockerfile, Container lifecycle Project 6.1: Airflow Math Sequence DAG Workflow orchestration Learn DAGs, dependencies, and XComs Apache Airflow 2.x, TaskFlow API, Astro CLI Project 6.2: Airflow MLOps Pipeline MLOps workflow orchestration Simulate real‑world ML pipeline with deploy decisions Airflow, Python, MLOps concepts, Astro CLI 🎯 Skill Coverage Map
Machine Learning
Regression, Classification, ANN
Model evaluation (MSE, RMSE)
Feature handling & preprocessing
MLOps
MLflow tracking & registry
Model inference & validation
Hyperparameter optimization
Reproducibility & governance
Data & Pipeline Engineering
DVC pipelines
Versioned datasets & models
Parameterized workflows
DevOps / Platform
Docker fundamentals
Airflow orchestration
Astronomer (Astro CLI)
📌 Project Summary 🚀 Project 🧩 Task 🎯 Objective 🛠️ Prominent Techniques / Tools Linear Regression with Iris (MLflow) Regression & experiment tracking Learn MLflow tracking, inference, model registry, versioning Python, Scikit‑learn, MLflow, Pandas, NumPy, MLflow PyFunc House Price Prediction (MLflow) End‑to‑end regression Train, tune, compare, and register best model Random Forest, GridSearchCV, MLflow Tracking & Registry ANN with MLflow (End‑to‑End MLOps) Neural network regression Build production‑ready ANN with full ML lifecycle Keras, TensorFlow, MLflow, Hyperopt (TPE), PyFunc ML Pipeline with DVC & MLflow Reproducible ML pipeline Version data, models, and experiments together DVC, MLflow, Random Forest, Git, DagsHub Hello Docker Project Containerization basics Learn Docker image build & container execution Docker, Dockerfile, Container Lifecycle Airflow Math Sequence DAG Workflow orchestration Learn DAGs, dependencies, and XComs Apache Airflow 2.x, TaskFlow API, Astro CLI Airflow MLOps Pipeline End‑to‑end MLOps workflow Simulate real‑world ML pipeline with deploy decisions Airflow, Python, MLOps Concepts, Astro CLI
:---:
| Project Name | Task | Objective | Tools |
|---|---|---|---|
| Project A | Regression | Model training | Python, MLflow |
| Project B | Classification | Accuracy improvement | Sklearn |
| Project C | Pipeline | Automation | Airflow |
| 🚀 Project | 🧩 Task | 🎯 Objective | 🛠️ Prominent Techniques / Tools |
|---|---|---|---|
| Linear Regression with Iris (MLflow) | Regression & experiment tracking | Learn MLflow tracking, inference, model registry, versioning | Python, Scikit‑learn, MLflow, Pandas, NumPy, MLflow PyFunc |
| House Price Prediction (MLflow) | End‑to‑end regression | Train, tune, compare, and register best model | Random Forest, GridSearchCV, MLflow Tracking & Registry |
| ANN with MLflow (End‑to‑End MLOps) | Neural network regression | Build production‑ready ANN with full ML lifecycle | Keras, TensorFlow, MLflow, Hyperopt (TPE), PyFunc |
| ML Pipeline with DVC & MLflow | Reproducible ML pipeline | Version data, models, and experiments together | DVC, MLflow, Random Forest, Git, DagsHub |
| Hello Docker Project | Containerization basics | Learn Docker image build & container execution | Docker, Dockerfile, Container Lifecycle |
| Airflow Math Sequence DAG | Workflow orchestration | Learn DAGs, dependencies, and XComs | Apache Airflow 2.x, TaskFlow API, Astro CLI |
| Airflow MLOps Pipeline | End‑to‑end MLOps workflow | Simulate real‑world ML pipeline with deploy decisions | Airflow, Python, MLOps Concepts, Astro CLI |