Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
92 changes: 92 additions & 0 deletions docs/usecases/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# CMF Use Cases: Optimizing Machine Learning Pipelines

## 1. Overview

The **Common Metadata Framework (CMF)** is a metadata tracking and versioning system purpose-built for ML pipelines. It captures code versions, data artifacts, execution parameters, and performance metrics—providing a unified, queryable, and shareable record of every experiment across distributed teams.

This document describes real-world use cases for CMF, explains how ML pipelines operate **with** and **without** CMF, and provides a detailed comparison to help teams understand where CMF delivers value.

---

## 2. The Problem: ML Pipelines Without CMF

In a typical unmanaged ML workflow, teams rely on ad-hoc practices to track experiments, data, and models. This leads to a range of operational and reproducibility problems.

### 2.1 Typical Unmanaged ML Pipeline

```
Raw Data → Preprocessing → Feature Engineering → Model Training → Evaluation → Deployment
↓ ↓ ↓ ↓ ↓ ↓
manual manual manual manual manual manual
copies notebooks CSV files .pkl saves spreadsheets scripts
```

**Common pain points:**

- **No lineage**: Impossible to trace which data version produced which model.
- **Manual bookkeeping**: Teams use spreadsheets, notebooks, or comments to log parameters and metrics.
- **Reproducibility failure**: Re-running an "old experiment" is unreliable because data, code, or environment may have changed silently.
- **Collaboration gaps**: Metadata stays on individual laptops; team members cannot see each other's runs.
- **Storage waste**: Multiple copies of datasets are saved under ad-hoc names (`data_v1_final_REAL.csv`).
- **No distributed tracking**: Edge/cloud/datacenter nodes produce metadata that never gets consolidated.

---

## 3. The Solution: ML Pipelines With CMF

CMF introduces a **structured metadata layer** over the entire pipeline. Every stage, artifact, and parameter is automatically versioned and tracked in a queryable store.

### 3.1 CMF-Managed ML Pipeline

```
Raw Data → Preprocessing → Feature Engineering → Model Training → Evaluation → Deployment
↓ ↓ ↓ ↓ ↓ ↓
cmf.log_ cmf.log_ cmf.log_ cmf.log_ cmf.log_ cmf.log_
dataset() dataset() dataset() model() metrics() execution()
+ DVC hash + DVC hash + Git SHA + artifact
linkage
↓_______________↓_________________↓__________________↓___________↓___________↓
CMF Metadata Store (SQLite / PostgreSQL)
cmf metadata push/pull
CMF Server + Web UI
```

**What changes:**

- Artifacts are identified by **content hash** (via DVC), not file names.
- Code is tied to a **Git commit SHA** automatically.
- Execution parameters, environment info, and custom properties are logged per-run.
- All metadata is **queryable** and **syncable** across teams and environments.

---

## 4. Comparison Chart: ML Pipeline With vs. Without CMF

| Aspect | Without CMF | With CMF |
|--------|-------------|----------|
| **Artifact versioning** | Manual copies, ad-hoc naming | Content-addressable hashing via DVC |
| **Code versioning** | Developer must remember to note Git SHA | Automatically captured per execution |
| **Experiment tracking** | Spreadsheets, notebooks, comments | Structured metadata store (SQLite/PostgreSQL) |
| **Reproducibility** | Often broken; dependencies implicit | Full lineage: data + code + params per run |
| **Data lineage** | Not tracked | Input/output artifact graph per stage |
| **Collaboration** | Metadata siloed on individual machines | Push/pull metadata like Git branches |
| **Distributed execution** | Incompatible log formats per site; custom ETL required to consolidate metadata centrally | Metadata synced across CMF servers via the **Metahub** feature using push/pull metadata functionality |
| **Model traceability** | Hard to link model back to training data | Every model artifact links to exact dataset hash + execution |
| **Querying metadata** | Manual search through logs/files | `CmfQuery` API or Web UI for structured queries |
| **Visualization** | Custom scripts or none | Built-in lineage graphs, Web UI dashboards |
| **Storage efficiency** | Duplicate dataset files across runs | Deduplicated artifacts by content hash |
| **Audit & compliance** | Difficult to reconstruct history | Full immutable execution history |
| **API access** | None (or custom solutions) | REST API + MCP Server for AI assistant access |
| **Multi-stage pipelines** | Manual orchestration notes | Contexts and executions map each stage automatically |
| **Onboarding new members** | Share notebooks / explain verbally | Pull metadata + artifacts; instantly see full history |

---

## 5. Use Cases

1. [Reproducible Experiment Management](./usecase1.md)
2. [Multi-Stage Pipeline Tracking](./usecase2.md)
3. [Distributed / Federated ML](./usecase3.md)
4. [Dataset Deduplication and Storage Optimization](./usecase4.md)
54 changes: 54 additions & 0 deletions docs/usecases/usecase1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
**Scenario**: A data scientist runs multiple experiments with different hyperparameters (learning rate, tree depth, regularization) and needs to reproduce the best result six weeks later.

**Without CMF**: The best model file exists but the exact data split, preprocessing parameters, and random seed used are lost. Reproducing it requires guesswork.

```python
import pickle
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

# Load data — but which version? No record exists of how this file was generated.
df = pd.read_parquet("data/train_split.parquet")
X, y = df.drop("label", axis=1), df["label"]

# Parameters hardcoded in the script — never persisted alongside the model
model = RandomForestClassifier(n_estimators=200, random_state=42)
model.fit(X, y)

# Save model — filename carries no provenance information
pickle.dump(model, open("models/rf_v3.pkl", "wb"))

# Accuracy manually noted in a spreadsheet (or simply forgotten)
print("Accuracy:", model.score(X, y)) # 0.974 — but linked to nothing
```

Six weeks later, the problems become clear:

- The spreadsheet records `accuracy=0.974` but not which version of `train_split.parquet` was used.

- `train_split.parquet` may have been silently overwritten by a newer preprocessing run.

- The random seed (`42`) and `n_estimators` (`200`) were never saved next to `rf_v3.pkl`.

- There is no link from the model file back to the Git commit of the training script.

- Re-running the script may produce a slightly different model if the data changed.

**With CMF**:
```python
from cmflib.cmf import Cmf

metawriter = Cmf(filepath="mlmd", pipeline_name="fraud_detection")

context = metawriter.create_context("train")
execution = metawriter.create_execution(
"RandomForest",
custom_properties={"learning_rate": 0.01, "n_estimators": 200, "seed": 42}
)
metawriter.log_dataset("data/train_split.parquet", "input")
metawriter.log_model("models/rf_v3.pkl", "output", {"accuracy": 0.974})
```

Every run records dataset hash, parameters, and metrics. Reproduction is a single `cmf metadata pull` + re-run with logged parameters.

---
104 changes: 104 additions & 0 deletions docs/usecases/usecase2.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
**Scenario**: A five-stage NLP pipeline where each stage consumes outputs from the previous one.

**Pipeline stages:**

1. `parse` — splits raw corpus into train/test sets

2. `featurize` — generates TF-IDF or embedding features

3. `train` — train a machine learning model

4. `evaluate` — computes precision, recall, F1

**Without CMF**: There is no shared view of how stages connect. Stage outputs are saved as files with no record of which input version they came from. If featurization logic changes, there is no way to tell which downstream models were affected. Re-running a single stage risks using stale inputs from a different run. Cross-stage debugging requires manually correlating file timestamps and notebook outputs.

```python
import json, pickle
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

# ── Stage 1: parse ────────────────────────────────────────────────────────────
# Output written to disk — no record of which raw_corpus.jsonl version was used
with open("raw_corpus.jsonl") as f:
records = [json.loads(l) for l in f]
train_records = records[:8000]
test_records = records[8000:]
with open("train_raw.jsonl", "w") as f:
f.writelines(json.dumps(r) + "\n" for r in train_records)
with open("test_raw.jsonl", "w") as f:
f.writelines(json.dumps(r) + "\n" for r in test_records)
# Who knows if train_raw.jsonl on disk is from today's run or last week's?

# ── Stage 2: featurize ────────────────────────────────────────────────────────
# max_features=50000 is hardcoded — never recorded anywhere alongside the output
with open("train_raw.jsonl") as f:
train_texts = [json.loads(l)["text"] for l in f]
vec = TfidfVectorizer(max_features=50000)
X_train = vec.fit_transform(train_texts)
pickle.dump(X_train, open("train_feats.pkl", "wb"))
pickle.dump(vec, open("vectorizer.pkl", "wb"))
# If this logic changes, all downstream models are silently stale — no alert

# ── Stage 3: train ────────────────────────────────────────────────────────────
# No record of which train_feats.pkl this model was trained on
X = pickle.load(open("train_feats.pkl", "rb"))
y = [r["label"] for r in train_records] # train_records may be from a different run!
clf = LogisticRegression(max_iter=500)
clf.fit(X, y)
pickle.dump(clf, open("model.pkl", "wb"))

# ── Stage 4: evaluate ─────────────────────────────────────────────────────────
with open("test_raw.jsonl") as f:
test_texts = [json.loads(l)["text"] for l in f]
vec_loaded = pickle.load(open("vectorizer.pkl", "rb")) # may be from a different featurize run!
X_test = vec_loaded.transform(test_texts)
y_test = [r["label"] for r in test_records]
preds = clf.predict(X_test)
# Metrics printed to stdout — never stored, never linked to model.pkl
print(f"F1: {f1_score(y_test, preds, average='macro'):.3f}")
```

Problems visible above:

- `train_records` in Stage 3 may differ from what was written to `train_raw.jsonl` if any stage is re-run in isolation.

- `vectorizer.pkl` loaded in Stage 4 could be from an earlier featurize run with different `max_features`.

- F1 is printed but never persisted alongside `model.pkl`.

- There is no way to trace which `raw_corpus.jsonl` produced the model currently on disk.

**With CMF**, each stage is a `Context`, and each run is an `Execution`. Artifacts flow automatically:

```python
# Stage 1 - Parse
ctx_parse = metawriter.create_context("parse")
exec_parse = metawriter.create_execution("SplitData")
metawriter.log_dataset("raw_corpus.jsonl", "input")
metawriter.log_dataset("train_raw.jsonl", "output")
metawriter.log_dataset("test_raw.jsonl", "output")

# Stage 2 - Featurize
ctx_feat = metawriter.create_context("featurize")
exec_feat = metawriter.create_execution("TFIDFVectorizer", {"max_features": 50000})
metawriter.log_dataset("train_raw.jsonl", "input")
metawriter.log_dataset("train_feats.npz", "output")

# Stage 3 - Train
ctx_train = metawriter.create_context("train")
exec_train = metawriter.create_execution("BERTFinetune", {"epochs": 3, "lr": 2e-5})
metawriter.log_dataset("train_feats.npz", "input")
metawriter.log_model("bert_finetuned.pt", "output")

# Stage 4 - Evaluate
ctx_eval = metawriter.create_context("evaluate")
exec_eval = metawriter.create_execution("Evaluate")
metawriter.log_model("bert_finetuned.pt", "input")
metawriter.log_dataset("test_feats.npz", "input")
metawriter.log_execution_metrics("metrics", {"f1": 0.89, "precision": 0.91, "recall": 0.87})
```

The full lineage graph—from raw corpus to final F1 score—is automatically queryable through the Web UI or `CmfQuery`.

---
96 changes: 96 additions & 0 deletions docs/usecases/usecase3.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
**Scenario**: A company trains models at multiple edge locations (hospitals, factories) and needs to aggregate metadata centrally without sharing raw data.

**Without CMF**: Each site maintains local logs in incompatible formats. Merging them into a central view requires custom ETL.

```python
# ── Site A (Hospital) — logs to CSV ──────────────────────────────────────────
import csv, datetime, torch

def train_site_a(data_path, model_save_path):
# ... training code ...
auc = 0.91 # manually computed

# Logged to a local CSV — schema invented by Site A's engineer
with open("site_a_runs.csv", "a") as f:
csv.writer(f).writerow([
datetime.datetime.now().isoformat(),
model_save_path, # just a filename, no content hash
data_path, # just a path, could be overwritten tomorrow
auc
])
# Raw data never leaves Site A, but metadata is trapped here too

train_site_a("data/hospital_data.csv", "models/hospital_model.pt")


# ── Site B (Factory) — logs to JSONL with a different schema ─────────────────
import json

def train_site_b(checkpoint_path, input_file, metrics):
# ... training code ...

# Different key names, different nesting — incompatible with Site A
entry = {
"ts": str(datetime.datetime.now()), # "ts" vs "timestamp" in Site A
"checkpoint": checkpoint_path, # "checkpoint" vs "model_save_path"
"input_file": input_file, # "input_file" vs "data_path"
"results": metrics # nested dict vs flat columns
}
with open("site_b_log.jsonl", "a") as f:
f.write(json.dumps(entry) + "\n")

train_site_b("chkpt_factory.pt", "factory_data_may.csv", {"auc": 0.88, "loss": 0.34})


# ── Central team tries to merge ───────────────────────────────────────────────
import pandas as pd

site_a = pd.read_csv("site_a_runs.csv",
names=["timestamp", "model_save_path", "data_path", "auc"])

site_b_records = []
with open("site_b_log.jsonl") as f:
for line in f:
r = json.loads(line)
site_b_records.append({
"timestamp": r["ts"], # rename to match Site A
"model_save_path": r["checkpoint"], # rename to match Site A
"data_path": r["input_file"], # rename to match Site A
"auc": r["results"]["auc"] # unnest to match Site A
})
site_b = pd.DataFrame(site_b_records)

combined = pd.concat([site_a, site_b], ignore_index=True)
# ❌ No content hashes — cannot tell if hospital_data.csv and factory_data_may.csv
# are related, or whether sites accidentally trained on the same underlying data.
# ❌ When Site C joins, this ETL script must be updated again.
# ❌ No way to verify data_path still points to the same file used during training.
print(combined)
```

Problems visible above:

- Site A uses flat CSV columns; Site B uses nested JSONL — the central team writes a manual ETL every time a new site is added.

- Paths like `hospital_data.csv` are just strings; if the file is overwritten, provenance is permanently lost.

- No content hashes mean it is impossible to tell whether two sites trained on identical data.

- Metrics are stored separately from model artifacts with no automated link between them.

**With CMF**:

- Each site runs CMF locally and tracks metadata to a local SQLite database.

- Periodically, each site runs `cmf metadata push` to the central CMF Server.

- The server merges metadata from all sites into a unified PostgreSQL database.

- No raw data ever leaves the site—only metadata (hashes, metrics, parameters) is transferred.

```
Site A (Hospital) ─── cmf metadata push ──▶ ┐
Site B (Factory) ─── cmf metadata push ──▶ CMF Server ──▶ Unified Metadata DB
Site C (Research) ─── cmf metadata push ──▶ ┘ ↕
CMF Web UI / Query API
```
Loading