HewlettPackard · AyeshaSanadi · Feb 25, 2026 · Feb 25, 2026 · Mar 10, 2026 · Mar 12, 2026
diff --git a/docs/usecases/index.md b/docs/usecases/index.md
@@ -0,0 +1,92 @@
+# CMF Use Cases: Optimizing Machine Learning Pipelines
+
+## 1. Overview
+
+The **Common Metadata Framework (CMF)** is a metadata tracking and versioning system purpose-built for ML pipelines. It captures code versions, data artifacts, execution parameters, and performance metrics—providing a unified, queryable, and shareable record of every experiment across distributed teams.
+
+This document describes real-world use cases for CMF, explains how ML pipelines operate **with** and **without** CMF, and provides a detailed comparison to help teams understand where CMF delivers value.
+
+---
+
+## 2. The Problem: ML Pipelines Without CMF
+
+In a typical unmanaged ML workflow, teams rely on ad-hoc practices to track experiments, data, and models. This leads to a range of operational and reproducibility problems.
+
+### 2.1 Typical Unmanaged ML Pipeline
+
+```
+Raw Data → Preprocessing → Feature Engineering → Model Training → Evaluation → Deployment
+   ↓             ↓                  ↓                  ↓              ↓            ↓
+ manual        manual             manual             manual          manual      manual
+ copies      notebooks          CSV files          .pkl saves    spreadsheets   scripts
+```
+
+**Common pain points:**
+
+- **No lineage**: Impossible to trace which data version produced which model.
+- **Manual bookkeeping**: Teams use spreadsheets, notebooks, or comments to log parameters and metrics.
+- **Reproducibility failure**: Re-running an "old experiment" is unreliable because data, code, or environment may have changed silently.
+- **Collaboration gaps**: Metadata stays on individual laptops; team members cannot see each other's runs.
+- **Storage waste**: Multiple copies of datasets are saved under ad-hoc names (`data_v1_final_REAL.csv`).
+- **No distributed tracking**: Edge/cloud/datacenter nodes produce metadata that never gets consolidated.
+
+---
+
+## 3. The Solution: ML Pipelines With CMF
+
+CMF introduces a **structured metadata layer** over the entire pipeline. Every stage, artifact, and parameter is automatically versioned and tracked in a queryable store.
+
+### 3.1 CMF-Managed ML Pipeline
+
+```
+Raw Data → Preprocessing → Feature Engineering → Model Training → Evaluation → Deployment
+   ↓             ↓                  ↓                  ↓              ↓            ↓
+ cmf.log_    cmf.log_          cmf.log_           cmf.log_       cmf.log_     cmf.log_
+ dataset()   dataset()         dataset()           model()       metrics()    execution()
+             + DVC hash        + DVC hash          + Git SHA     + artifact
+                                                                   linkage
+       ↓_______________↓_________________↓__________________↓___________↓___________↓
+                                   CMF Metadata Store (SQLite / PostgreSQL)
+                                              ↕
+                                     cmf metadata push/pull
+                                              ↕
+                                       CMF Server + Web UI
+```
+
+**What changes:**
+
+- Artifacts are identified by **content hash** (via DVC), not file names.
+- Code is tied to a **Git commit SHA** automatically.
+- Execution parameters, environment info, and custom properties are logged per-run.
+- All metadata is **queryable** and **syncable** across teams and environments.
+
+---
+
+## 4. Comparison Chart: ML Pipeline With vs. Without CMF
+
+| Aspect | Without CMF | With CMF |
+|--------|-------------|----------|
+| **Artifact versioning** | Manual copies, ad-hoc naming | Content-addressable hashing via DVC |
+| **Code versioning** | Developer must remember to note Git SHA | Automatically captured per execution |
+| **Experiment tracking** | Spreadsheets, notebooks, comments | Structured metadata store (SQLite/PostgreSQL) |
+| **Reproducibility** | Often broken; dependencies implicit | Full lineage: data + code + params per run |
+| **Data lineage** | Not tracked | Input/output artifact graph per stage |
+| **Collaboration** | Metadata siloed on individual machines | Push/pull metadata like Git branches |
+| **Distributed execution** | Incompatible log formats per site; custom ETL required to consolidate metadata centrally | Metadata synced across CMF servers via the **Metahub** feature using push/pull metadata functionality |
+| **Model traceability** | Hard to link model back to training data | Every model artifact links to exact dataset hash + execution |
+| **Querying metadata** | Manual search through logs/files | `CmfQuery` API or Web UI for structured queries |
+| **Visualization** | Custom scripts or none | Built-in lineage graphs, Web UI dashboards |
+| **Storage efficiency** | Duplicate dataset files across runs | Deduplicated artifacts by content hash |
+| **Audit & compliance** | Difficult to reconstruct history | Full immutable execution history |
+| **API access** | None (or custom solutions) | REST API + MCP Server for AI assistant access |
+| **Multi-stage pipelines** | Manual orchestration notes | Contexts and executions map each stage automatically |
+| **Onboarding new members** | Share notebooks / explain verbally | Pull metadata + artifacts; instantly see full history |
+
+---
+
+## 5. Use Cases
+
+1. [Reproducible Experiment Management](./usecase1.md)
+2. [Multi-Stage Pipeline Tracking](./usecase2.md)
+3. [Distributed / Federated ML](./usecase3.md)
+4. [Dataset Deduplication and Storage Optimization](./usecase4.md)
diff --git a/docs/usecases/usecase1.md b/docs/usecases/usecase1.md
@@ -0,0 +1,54 @@
+**Scenario**: A data scientist runs multiple experiments with different hyperparameters (learning rate, tree depth, regularization) and needs to reproduce the best result six weeks later.
+
+**Without CMF**: The best model file exists but the exact data split, preprocessing parameters, and random seed used are lost. Reproducing it requires guesswork.
+
+```python
+import pickle
+import pandas as pd
+from sklearn.ensemble import RandomForestClassifier
+
+# Load data — but which version? No record exists of how this file was generated.
+df = pd.read_parquet("data/train_split.parquet")
+X, y = df.drop("label", axis=1), df["label"]
+
+# Parameters hardcoded in the script — never persisted alongside the model
+model = RandomForestClassifier(n_estimators=200, random_state=42)
+model.fit(X, y)
+
+# Save model — filename carries no provenance information
+pickle.dump(model, open("models/rf_v3.pkl", "wb"))
+
+# Accuracy manually noted in a spreadsheet (or simply forgotten)
+print("Accuracy:", model.score(X, y))  # 0.974 — but linked to nothing
+```
+
+Six weeks later, the problems become clear:
+
+- The spreadsheet records `accuracy=0.974` but not which version of `train_split.parquet` was used.
+
+- `train_split.parquet` may have been silently overwritten by a newer preprocessing run.
+
+- The random seed (`42`) and `n_estimators` (`200`) were never saved next to `rf_v3.pkl`.
+
+- There is no link from the model file back to the Git commit of the training script.
+
+- Re-running the script may produce a slightly different model if the data changed.
+
+**With CMF**:
+```python
+from cmflib.cmf import Cmf
+
+metawriter = Cmf(filepath="mlmd", pipeline_name="fraud_detection")
+
+context = metawriter.create_context("train")
+execution = metawriter.create_execution(
+    "RandomForest",
+    custom_properties={"learning_rate": 0.01, "n_estimators": 200, "seed": 42}
+)
+metawriter.log_dataset("data/train_split.parquet", "input")
+metawriter.log_model("models/rf_v3.pkl", "output", {"accuracy": 0.974})
+```
+
+Every run records dataset hash, parameters, and metrics. Reproduction is a single `cmf metadata pull` + re-run with logged parameters.
+
+---
diff --git a/docs/usecases/usecase2.md b/docs/usecases/usecase2.md
@@ -0,0 +1,104 @@
+**Scenario**: A five-stage NLP pipeline where each stage consumes outputs from the previous one.
+
+**Pipeline stages:**
+
+1. `parse` — splits raw corpus into train/test sets
+
+2. `featurize` — generates TF-IDF or embedding features
+
+3. `train` — train a machine learning model
+
+4. `evaluate` — computes precision, recall, F1
+
+**Without CMF**: There is no shared view of how stages connect. Stage outputs are saved as files with no record of which input version they came from. If featurization logic changes, there is no way to tell which downstream models were affected. Re-running a single stage risks using stale inputs from a different run. Cross-stage debugging requires manually correlating file timestamps and notebook outputs.
+
+```python
+import json, pickle
+from sklearn.feature_extraction.text import TfidfVectorizer
+from sklearn.linear_model import LogisticRegression
+from sklearn.metrics import f1_score
+
+# ── Stage 1: parse ────────────────────────────────────────────────────────────
+# Output written to disk — no record of which raw_corpus.jsonl version was used
+with open("raw_corpus.jsonl") as f:
+    records = [json.loads(l) for l in f]
+train_records = records[:8000]
+test_records  = records[8000:]
+with open("train_raw.jsonl", "w") as f:
+    f.writelines(json.dumps(r) + "\n" for r in train_records)
+with open("test_raw.jsonl", "w") as f:
+    f.writelines(json.dumps(r) + "\n" for r in test_records)
+# Who knows if train_raw.jsonl on disk is from today's run or last week's?
+
+# ── Stage 2: featurize ────────────────────────────────────────────────────────
+# max_features=50000 is hardcoded — never recorded anywhere alongside the output
+with open("train_raw.jsonl") as f:
+    train_texts = [json.loads(l)["text"] for l in f]
+vec = TfidfVectorizer(max_features=50000)
+X_train = vec.fit_transform(train_texts)
+pickle.dump(X_train, open("train_feats.pkl", "wb"))
+pickle.dump(vec,     open("vectorizer.pkl",  "wb"))
+# If this logic changes, all downstream models are silently stale — no alert
+
+# ── Stage 3: train ────────────────────────────────────────────────────────────
+# No record of which train_feats.pkl this model was trained on
+X      = pickle.load(open("train_feats.pkl", "rb"))
+y      = [r["label"] for r in train_records]  # train_records may be from a different run!
+clf    = LogisticRegression(max_iter=500)
+clf.fit(X, y)
+pickle.dump(clf, open("model.pkl", "wb"))
+
+# ── Stage 4: evaluate ─────────────────────────────────────────────────────────
+with open("test_raw.jsonl") as f:
+    test_texts = [json.loads(l)["text"] for l in f]
+vec_loaded = pickle.load(open("vectorizer.pkl", "rb"))  # may be from a different featurize run!
+X_test     = vec_loaded.transform(test_texts)
+y_test     = [r["label"] for r in test_records]
+preds      = clf.predict(X_test)
+# Metrics printed to stdout — never stored, never linked to model.pkl
+print(f"F1: {f1_score(y_test, preds, average='macro'):.3f}")
+```
+
+Problems visible above:
+
+- `train_records` in Stage 3 may differ from what was written to `train_raw.jsonl` if any stage is re-run in isolation.
+
+- `vectorizer.pkl` loaded in Stage 4 could be from an earlier featurize run with different `max_features`.
+
+- F1 is printed but never persisted alongside `model.pkl`.
+
+- There is no way to trace which `raw_corpus.jsonl` produced the model currently on disk.
+
+**With CMF**, each stage is a `Context`, and each run is an `Execution`. Artifacts flow automatically:
+
+```python
+# Stage 1 - Parse
+ctx_parse   = metawriter.create_context("parse")
+exec_parse  = metawriter.create_execution("SplitData")
+metawriter.log_dataset("raw_corpus.jsonl", "input")
+metawriter.log_dataset("train_raw.jsonl",  "output")
+metawriter.log_dataset("test_raw.jsonl",   "output")
+
+# Stage 2 - Featurize
+ctx_feat    = metawriter.create_context("featurize")
+exec_feat   = metawriter.create_execution("TFIDFVectorizer", {"max_features": 50000})
+metawriter.log_dataset("train_raw.jsonl",   "input")
+metawriter.log_dataset("train_feats.npz",   "output")
+
+# Stage 3 - Train
+ctx_train   = metawriter.create_context("train")
+exec_train  = metawriter.create_execution("BERTFinetune", {"epochs": 3, "lr": 2e-5})
+metawriter.log_dataset("train_feats.npz", "input")
+metawriter.log_model("bert_finetuned.pt", "output")
+
+# Stage 4 - Evaluate
+ctx_eval    = metawriter.create_context("evaluate")
+exec_eval   = metawriter.create_execution("Evaluate")
+metawriter.log_model("bert_finetuned.pt",   "input")
+metawriter.log_dataset("test_feats.npz",    "input")
+metawriter.log_execution_metrics("metrics", {"f1": 0.89, "precision": 0.91, "recall": 0.87})
+```
+
+The full lineage graph—from raw corpus to final F1 score—is automatically queryable through the Web UI or `CmfQuery`.
+
+---
diff --git a/docs/usecases/usecase3.md b/docs/usecases/usecase3.md
@@ -0,0 +1,96 @@
+**Scenario**: A company trains models at multiple edge locations (hospitals, factories) and needs to aggregate metadata centrally without sharing raw data.
+
+**Without CMF**: Each site maintains local logs in incompatible formats. Merging them into a central view requires custom ETL.
+
+```python
+# ── Site A (Hospital) — logs to CSV ──────────────────────────────────────────
+import csv, datetime, torch
+
+def train_site_a(data_path, model_save_path):
+    # ... training code ...
+    auc = 0.91  # manually computed
+
+    # Logged to a local CSV — schema invented by Site A's engineer
+    with open("site_a_runs.csv", "a") as f:
+        csv.writer(f).writerow([
+            datetime.datetime.now().isoformat(),
+            model_save_path,   # just a filename, no content hash
+            data_path,         # just a path, could be overwritten tomorrow
+            auc
+        ])
+    # Raw data never leaves Site A, but metadata is trapped here too
+
+train_site_a("data/hospital_data.csv", "models/hospital_model.pt")
+
+
+# ── Site B (Factory) — logs to JSONL with a different schema ─────────────────
+import json
+
+def train_site_b(checkpoint_path, input_file, metrics):
+    # ... training code ...
+
+    # Different key names, different nesting — incompatible with Site A
+    entry = {
+        "ts": str(datetime.datetime.now()),   # "ts" vs "timestamp" in Site A
+        "checkpoint": checkpoint_path,        # "checkpoint" vs "model_save_path"
+        "input_file": input_file,             # "input_file" vs "data_path"
+        "results": metrics                    # nested dict vs flat columns
+    }
+    with open("site_b_log.jsonl", "a") as f:
+        f.write(json.dumps(entry) + "\n")
+
+train_site_b("chkpt_factory.pt", "factory_data_may.csv", {"auc": 0.88, "loss": 0.34})
+
+
+# ── Central team tries to merge ───────────────────────────────────────────────
+import pandas as pd
+
+site_a = pd.read_csv("site_a_runs.csv",
+                     names=["timestamp", "model_save_path", "data_path", "auc"])
+
+site_b_records = []
+with open("site_b_log.jsonl") as f:
+    for line in f:
+        r = json.loads(line)
+        site_b_records.append({
+            "timestamp":       r["ts"],              # rename to match Site A
+            "model_save_path": r["checkpoint"],      # rename to match Site A
+            "data_path":       r["input_file"],      # rename to match Site A
+            "auc":             r["results"]["auc"]   # unnest to match Site A
+        })
+site_b = pd.DataFrame(site_b_records)
+
+combined = pd.concat([site_a, site_b], ignore_index=True)
+# ❌ No content hashes — cannot tell if hospital_data.csv and factory_data_may.csv
+#    are related, or whether sites accidentally trained on the same underlying data.
+# ❌ When Site C joins, this ETL script must be updated again.
+# ❌ No way to verify data_path still points to the same file used during training.
+print(combined)
+```
+
+Problems visible above:
+
+- Site A uses flat CSV columns; Site B uses nested JSONL — the central team writes a manual ETL every time a new site is added.
+
+- Paths like `hospital_data.csv` are just strings; if the file is overwritten, provenance is permanently lost.
+
+- No content hashes mean it is impossible to tell whether two sites trained on identical data.
+
+- Metrics are stored separately from model artifacts with no automated link between them.
+
+**With CMF**:
+
+- Each site runs CMF locally and tracks metadata to a local SQLite database.
+
+- Periodically, each site runs `cmf metadata push` to the central CMF Server.
+
+- The server merges metadata from all sites into a unified PostgreSQL database.
+
+- No raw data ever leaves the site—only metadata (hashes, metrics, parameters) is transferred.
+
+```
+Site A (Hospital) ─── cmf metadata push ──▶ ┐
+Site B (Factory)  ─── cmf metadata push ──▶ CMF Server ──▶ Unified Metadata DB
+Site C (Research) ─── cmf metadata push ──▶ ┘                  ↕
+                                                            CMF Web UI / Query API
+```