zenml-io · safoinme · Jun 23, 2025 · Jun 25, 2025 · Jun 25, 2025 · Jul 1, 2025
diff --git a/PROMPT_ABSTRACTION_LEARNINGS.md b/PROMPT_ABSTRACTION_LEARNINGS.md
@@ -0,0 +1,290 @@
+# ZenML Prompt Abstraction: Learnings and Vision
+
+Based on analysis of the prompt abstraction feature and working with ZenML's architecture, here are key insights and learnings for ZenML's evolution into a leading LLMOps platform.
+
+## **Key Learnings from Prompt Implementation**
+
+### **1. The Complexity of Prompt Management**
+
+**What I Learned:**
+- Prompts are deceptively simple but operationally complex
+- They exist at multiple abstraction levels: templates, instances, variants, evaluations
+- The lifecycle is non-linear: create → test → iterate → compare → deploy → monitor → drift
+
+**Implementation Challenge:**
+```python
+# This looks simple...
+prompt = Prompt(template="Answer: {question}")
+
+# But operationally requires:
+# - Version management
+# - A/B testing infrastructure  
+# - Performance tracking
+# - Lineage tracking
+# - Rollback capabilities
+# - Multi-model evaluation
+```
+
+### **2. ZenML Philosophy vs. LLMOps Reality**
+
+**ZenML's Core Strength (MLOps):**
+- "Everything is a pipeline step"
+- Artifacts are immutable and versioned
+- Reproducible, traceable workflows
+
+**LLMOps Challenge:**
+- Prompts are **both** code and data
+- Need real-time iteration (not just batch processing)
+- Require human-in-the-loop validation
+- Performance is subjective and context-dependent
+
+**The Tension:**
+```python
+# ZenML way (good for MLOps):
+@step
+def evaluate_prompt(prompt: Prompt, dataset: Dataset) -> Metrics:
+    # Batch evaluation, reproducible
+
+# LLMOps reality (also needed):
+def interactive_prompt_playground(prompt: Prompt):
+    # Real-time testing, human feedback
+    # Doesn't fit pipeline paradigm well
+```
+
+### **3. Artifacts vs. Entities Dilemma**
+
+**What We Discovered:**
+The current implementation suffers from **identity crisis**:
+
+- **As Artifacts**: Immutable, versioned, pipeline-native ✅
+- **As Entities**: Need CRUD operations, real-time updates ❌
+
+**Better Approach:**
+```python
+# Prompt Templates = Entities (mutable, managed)
+class PromptTemplate(BaseEntity):
+    template: str
+    metadata: Dict[str, Any]
+
+# Prompt Instances = Artifacts (immutable, versioned)  
+class PromptInstance(BaseArtifact):
+    template_id: UUID
+    variables: Dict[str, Any]
+    formatted_text: str
+```
+
+## **What Would Be Done Differently**
+
+### **1. Embrace the Dual Nature**
+
+**Current Problem:** Trying to force prompts into pure artifact model
+**Better Solution:** 
+```python
+# Management Layer (Entity-like)
+@step
+def create_prompt_template(template: str) -> PromptTemplate:
+    # Lives in ZenML server, has CRUD operations
+
+# Execution Layer (Artifact-like)  
+@step
+def instantiate_prompt(template: PromptTemplate, **vars) -> PromptInstance:
+    # Immutable, versioned, pipeline-native
+```
+
+### **2. Built-in Evaluation Framework**
+
+**Current:** Examples show manual evaluation steps
+**Better:** Native evaluation infrastructure:
+
+```python
+@prompt_evaluator(metrics=["accuracy", "relevance", "safety"])
+def evaluate_qa_prompt(prompt: PromptInstance, ground_truth: Dataset):
+    # Auto-tracked, comparable across experiments
+
+@pipeline 
+def prompt_optimization_pipeline():
+    variants = generate_prompt_variants(base_template)
+    results = evaluate_variants_parallel(variants)  # Built-in parallelization
+    best_prompt = select_optimal_variant(results)
+    deploy_prompt(best_prompt)  # Integrated deployment
+```
+
+### **3. Context-Aware Prompt Management**
+
+**Current:** Static prompt templates
+**Better:** Dynamic, context-aware prompts:
+
+```python
+class ContextualPrompt(BaseModel):
+    base_template: str
+    context_adapters: List[ContextAdapter]
+
+    def adapt_for_context(self, context: Context) -> str:
+        # Domain adaptation, user personalization, etc.
+```
+
+## **Vision for ZenML as Leading LLMOps Platform**
+
+### **1. Prompt-Native Architecture**
+
+**What This Means:**
+- Prompts are first-class citizens, not afterthoughts
+- Native prompt versioning, not generic artifact versioning
+- Built-in prompt evaluation, not custom step implementations
+
+**Implementation:**
+```python
+# Native prompt pipeline decorator
+@prompt_pipeline
+def optimize_customer_service_prompts():
+    # Auto-handles prompt-specific concerns:
+    # - A/B testing
+    # - Human evaluation collection
+    # - Performance monitoring
+    # - Automatic rollback on degradation
+```
+
+### **2. Multi-Modal Prompt Management**
+
+**Beyond Text:**
+```python
+class MultiModalPrompt(BaseModel):
+    text_component: str
+    image_components: List[ImagePrompt]
+    audio_components: List[AudioPrompt]
+
+    # Unified evaluation across modalities
+    def evaluate_multimodal_performance(self, test_cases: MultiModalDataset):
+        # Cross-modal consistency checking
+```
+
+### **3. Production-Ready LLMOps Features**
+
+**What's Missing (but needed for leadership):**
+
+```python
+# 1. Prompt Drift Detection
+@step
+def detect_prompt_drift(
+    current_prompt: PromptInstance,
+    production_logs: ConversationLogs
+) -> DriftReport:
+    # Automatic detection of performance degradation
+
+# 2. Prompt Security & Safety
+@step  
+def validate_prompt_safety(prompt: PromptInstance) -> SafetyReport:
+    # Built-in jailbreak detection, bias checking
+
+# 3. Cost Optimization
+@step
+def optimize_prompt_cost(
+    prompt: PromptInstance,
+    performance_threshold: float
+) -> OptimizedPrompt:
+    # Automatic prompt compression while maintaining quality
+```
+
+### **4. Human-in-the-Loop Integration**
+
+**Current Gap:** No native human feedback integration
+**Vision:**
+```python
+@human_evaluation_step
+def collect_human_feedback(
+    prompt_responses: List[Response]
+) -> HumanFeedback:
+    # Integrated UI for human evaluation
+    # Automatic feedback aggregation
+    # Bias detection in human evaluations
+```
+
+## **Specific Recommendations for ZenML Leadership**
+
+### **1. Architectural Changes**
+
+**Immediate (6 months):**
+- Split prompt management into Template (entity) + Instance (artifact)
+- Native prompt evaluation framework
+- Built-in A/B testing infrastructure
+
+**Medium-term (1 year):**
+- Multi-modal prompt support
+- Prompt drift detection
+- Cost optimization tools
+
+**Long-term (2+ years):**
+- AI-assisted prompt optimization
+- Cross-model prompt portability
+- Prompt marketplace/sharing
+
+### **2. Developer Experience**
+
+**What Would Make ZenML the Go-To LLMOps Platform:**
+
+```python
+# This should be possible with 5 lines of code:
+from zenml.llmops import PromptOptimizer
+
+optimizer = PromptOptimizer(
+    base_template="Summarize: {text}",
+    evaluation_dataset=my_dataset,
+    target_metrics=["accuracy", "conciseness"]
+)
+
+best_prompt = optimizer.optimize()  # Handles everything automatically
+```
+
+### **3. Integration Ecosystem**
+
+**Missing Pieces:**
+- Native LangChain/LlamaIndex integration
+- Built-in vector database connectors
+- Prompt sharing/marketplace
+- Model provider abstractions
+
+## **Core Insight: The Prompt Paradox**
+
+**The Challenge:** Prompts are simultaneously:
+- **Engineering artifacts** (need versioning, testing, deployment)
+- **Creative content** (need iteration, human judgment, contextual adaptation)
+- **Business logic** (need governance, compliance, performance monitoring)
+
+**ZenML's Opportunity:** Be the first platform to solve this paradox elegantly by:
+1. Embracing the complexity rather than oversimplifying
+2. Building prompt-native infrastructure, not generic artifact management
+3. Integrating human feedback as a first-class citizen
+4. Providing end-to-end prompt lifecycle management
+
+## **Critical Review Summary**
+
+### **Current Implementation Issues:**
+- **Architectural Inconsistency**: Can't decide if prompts are entities or artifacts
+- **Overcomplicated Core Class**: 434 lines of business logic in `Prompt` class
+- **Violation of ZenML Philosophy**: Logic that should be in steps is in the core class
+- **Poor Server Integration**: Generic artifact handling instead of prompt-specific logic
+
+### **Rating: 4/10** - Needs significant refactoring
+
+**Strengths:** 
+- Good conceptual foundation
+- Comprehensive examples
+- Solid utility functions
+
+**Critical Issues:**
+- Overcomplicated core class
+- Architectural inconsistency
+- Security vulnerabilities
+- Poor separation of concerns
+
+## **Conclusion**
+
+The current implementation is a good start, but to become the **leading LLMOps platform**, ZenML needs to think bigger and solve the unique challenges of prompt management, not just apply traditional MLOps patterns to a fundamentally different problem.
+
+The path forward requires:
+1. **Architectural clarity** - Choose entity vs artifact approach and stick to it
+2. **Prompt-native features** - Build for LLMOps, not generic MLOps
+3. **Human-in-the-loop integration** - Essential for prompt workflows
+4. **Production-ready tooling** - Drift detection, safety validation, cost optimization
+
+ZenML has the opportunity to define the LLMOps category the same way it helped define MLOps, but only if it embraces the unique challenges of prompt management rather than trying to force them into existing MLOps patterns.
diff --git a/docs/book/user-guide/README.md b/docs/book/user-guide/README.md
@@ -17,7 +17,7 @@ Step-by-step instructions to help you master ZenML concepts and features.
 Complete end-to-end implementations that showcase ZenML in real-world scenarios.\
 [See all projects in our website →](https://www.zenml.io/projects)
 
-<table data-view="cards"><thead><tr><th></th><th></th><th data-hidden data-card-cover data-type="files"></th><th data-hidden data-card-target data-type="content-ref"></th></tr></thead><tbody><tr><td><strong>ZenCoder</strong></td><td>Your Own MLOps Engineer</td><td><a href=".gitbook/assets/zencoder.jpg">zencoder.jpg</a></td><td><a href="https://www.zenml.io/projects/zencoder-your-own-mlops-engineer">https://www.zenml.io/projects/zencoder-your-own-mlops-engineer</a></td></tr><tr><td><strong>LLM-Complete Guide</strong></td><td>Production-ready RAG pipelines from basic retrieval to advanced LLMOps with embeddings finetuning and evals.</td><td><a href=".gitbook/assets/llm-complete-guide.jpg">llm-complete-guide.jpg</a></td><td><a href="https://github.com/zenml-io/zenml-projects/tree/main/llm-complete-guide">https://github.com/zenml-io/zenml-projects/tree/main/llm-complete-guide</a></td></tr><tr><td><strong>NightWatch</strong></td><td>AI Database Summaries While You Sleep</td><td><a href=".gitbook/assets/nightwatch.jpg">nightwatch.jpg</a></td><td><a href="https://www.zenml.io/projects/nightwatch-ai-database-summaries-while-you-sleep">https://www.zenml.io/projects/nightwatch-ai-database-summaries-while-you-sleep</a></td></tr><tr><td><strong>Research Radar</strong></td><td>Automates research paper discovery and classification for specialized research domains.</td><td><a href=".gitbook/assets/researchradar.jpg">researchradar.jpg</a></td><td></td></tr><tr><td><strong>Magic Photobooth</strong></td><td>A personalized AI image generation product that can create your avatars from a selfie.</td><td><a href=".gitbook/assets/magicphoto.jpg">magicphoto.jpg</a></td><td><a href="https://www.zenml.io/projects/magic-photobooth">https://www.zenml.io/projects/magic-photobooth</a></td></tr><tr><td><strong>Sign Language Detection with YOLOv5</strong></td><td>End-to-end computer vision pipeline</td><td><a href=".gitbook/assets/yolo.jpg">yolo.jpg</a></td><td><a href="https://www.zenml.io/projects/sign-language-detection-with-yolov5">https://www.zenml.io/projects/sign-language-detection-with-yolov5</a></td></tr><tr><td><strong>ZenML Support Agent</strong></td><td>A production-ready agent that can help you with your ZenML questions.</td><td><a href=".gitbook/assets/support.jpg">support.jpg</a></td><td><a href="https://www.zenml.io/projects/zenml-support-agent">https://www.zenml.io/projects/zenml-support-agent</a></td></tr><tr><td><strong>GameSense</strong></td><td>The LLM That Understands Gamers</td><td><a href=".gitbook/assets/gamesense.jpg">gamesense.jpg</a></td><td><a href="https://www.zenml.io/projects/gamesense-the-llm-that-understands-gamers">https://www.zenml.io/projects/gamesense-the-llm-that-understands-gamers</a></td></tr><tr><td><strong>EuroRate Predictor</strong></td><td>Turn European Central Bank data into actionable interest rate forecasts with this comprehensive MLOps solution.</td><td><a href=".gitbook/assets/eurorate.jpg">eurorate.jpg</a></td><td><a href="https://www.zenml.io/projects/eurorate-predictor">https://www.zenml.io/projects/eurorate-predictor</a></td></tr></tbody></table>
+<table data-view="cards"><thead><tr><th></th><th></th><th data-hidden data-card-cover data-type="files"></th><th data-hidden data-card-target data-type="content-ref"></th></tr></thead><tbody><tr><td><strong>ZenCoder</strong></td><td>Your Own MLOps Engineer</td><td><a href=".gitbook/assets/zencoder.jpg">zencoder.jpg</a></td><td><a href="https://www.zenml.io/projects/zencoder-your-own-mlops-engineer">https://www.zenml.io/projects/zencoder-your-own-mlops-engineer</a></td></tr><tr><td><strong>LLM-Complete Guide</strong></td><td>Production-ready RAG pipelines from basic retrieval to advanced LLMOps with embeddings finetuning and evals.</td><td><a href=".gitbook/assets/llm-complete-guide.jpg">llm-complete-guide.jpg</a></td><td><a href="https://github.com/zenml-io/zenml-projects/blob/main/llm-complete-guide/README.md">https://www.zenml.io/projects/llm-complete-guide</a></td></tr><tr><td><strong>NightWatch</strong></td><td>AI Database Summaries While You Sleep</td><td><a href=".gitbook/assets/nightwatch.jpg">nightwatch.jpg</a></td><td><a href="https://www.zenml.io/projects/nightwatch-ai-database-summaries-while-you-sleep">https://www.zenml.io/projects/nightwatch-ai-database-summaries-while-you-sleep</a></td></tr><tr><td><strong>Research Radar</strong></td><td>Automates research paper discovery and classification for specialized research domains.</td><td><a href=".gitbook/assets/researchradar.jpg">researchradar.jpg</a></td><td></td></tr><tr><td><strong>Magic Photobooth</strong></td><td>A personalized AI image generation product that can create your avatars from a selfie.</td><td><a href=".gitbook/assets/magicphoto.jpg">magicphoto.jpg</a></td><td><a href="https://www.zenml.io/projects/magic-photobooth">https://www.zenml.io/projects/magic-photobooth</a></td></tr><tr><td><strong>Sign Language Detection with YOLOv5</strong></td><td>End-to-end computer vision pipeline</td><td><a href=".gitbook/assets/yolo.jpg">yolo.jpg</a></td><td><a href="https://www.zenml.io/projects/sign-language-detection-with-yolov5">https://www.zenml.io/projects/sign-language-detection-with-yolov5</a></td></tr><tr><td><strong>ZenML Support Agent</strong></td><td>A production-ready agent that can help you with your ZenML questions.</td><td><a href=".gitbook/assets/support.jpg">support.jpg</a></td><td><a href="https://www.zenml.io/projects/zenml-support-agent">https://www.zenml.io/projects/zenml-support-agent</a></td></tr><tr><td><strong>GameSense</strong></td><td>The LLM That Understands Gamers</td><td><a href=".gitbook/assets/gamesense.jpg">gamesense.jpg</a></td><td><a href="https://www.zenml.io/projects/gamesense-the-llm-that-understands-gamers">https://www.zenml.io/projects/gamesense-the-llm-that-understands-gamers</a></td></tr><tr><td><strong>EuroRate Predictor</strong></td><td>Turn European Central Bank data into actionable interest rate forecasts with this comprehensive MLOps solution.</td><td><a href=".gitbook/assets/eurorate.jpg">eurorate.jpg</a></td><td><a href="https://www.zenml.io/projects/eurorate-predictor">https://www.zenml.io/projects/eurorate-predictor</a></td></tr></tbody></table>
 
 ## Examples
 

diff --git a/docs/book/user-guide/llmops-guide/README.md b/docs/book/user-guide/llmops-guide/README.md
@@ -7,6 +7,8 @@ icon: robot
 
 Welcome to the ZenML LLMOps Guide, where we dive into the exciting world of Large Language Models (LLMs) and how to integrate them seamlessly into your MLOps pipelines using ZenML. This guide is designed for ML practitioners and MLOps engineers looking to harness the potential of LLMs while maintaining the robustness and scalability of their workflows.
 
+From foundational prompt engineering practices to advanced RAG implementations, we cover the essential techniques for building production-ready LLM applications with ZenML's streamlined approach.
+
 <figure><img src="../../.gitbook/assets/rag-overview.png" alt=""><figcaption><p>ZenML simplifies the development and deployment of LLM-powered MLOps pipelines.</p></figcaption></figure>
 
 In this guide, we'll explore various aspects of working with LLMs in ZenML, including:
@@ -23,6 +25,10 @@ In this guide, we'll explore various aspects of working with LLMs in ZenML, incl
   * [Retrieval evaluation](evaluation/retrieval.md)
   * [Generation evaluation](evaluation/generation.md)
   * [Evaluation in practice](evaluation/evaluation-in-practice.md)
+* [Prompt engineering](prompt-engineering/)
+  * [Quick start](prompt-engineering/quick-start.md)
+  * [Understanding prompt management](prompt-engineering/understanding-prompt-management.md)
+  * [Best practices](prompt-engineering/best-practices.md)
 * [Reranking for better retrieval](reranking/)
   * [Understanding reranking](reranking/understanding-reranking.md)
   * [Implementing reranking in ZenML](reranking/implementing-reranking.md)

diff --git a/docs/book/user-guide/llmops-guide/prompt-engineering/README.md b/docs/book/user-guide/llmops-guide/prompt-engineering/README.md
@@ -0,0 +1,59 @@
+---
+description: Simple prompt engineering with ZenML - version control, A/B testing, and dashboard visualization.
+icon: edit
+---
+
+# Prompt Engineering
+
+ZenML's prompt engineering focuses on the three things teams actually need: **simple versioning**, **A/B testing**, and **dashboard visualization**.
+
+## Quick Start
+
+1. **Run the example**:
+   ```bash
+   cd examples/prompt_engineering
+   python run_simple_comparison.py
+   ```
+
+2. **Check your dashboard** to see prompt artifacts with rich visualizations
+
+## Core Features
+
+### Git-like Versioning
+```python
+prompt_v1 = Prompt(template="Answer: {question}", version="1.0")
+prompt_v2 = Prompt(template="Detailed answer: {question}", version="2.0") 
+```
+
+### A/B Testing
+```python
+# Pipeline automatically compares versions and determines winner
+result = simple_prompt_comparison()
+print(f"Winner: {result['winner']}")
+```
+
+### Dashboard Integration
+- Syntax-highlighted templates
+- Variable tables and validation
+- Version tracking across runs
+- Side-by-side comparisons
+
+## Why This Approach?
+
+User research shows teams with millions of daily requests use **simple Git-based versioning**, not complex management systems. This approach focuses on what actually works in production.
+
+## Documentation
+
+* [Quick Start](quick-start.md) - Working example walkthrough
+* [Understanding Prompt Management](understanding-prompt-management.md) - Research and philosophy  
+* [Best Practices](best-practices.md) - Production guidance
+
+## Example Structure
+
+The `examples/prompt_engineering/` directory demonstrates proper organization:
+- `pipelines/` - Pipeline definitions
+- `steps/` - Individual step implementations
+- `utils/` - Helper functions
+- Clean separation of concerns
+
+Start with the quick start example to see all features in action.