- 
                Notifications
    
You must be signed in to change notification settings  - Fork 43
 
Description
Task 8: Documentation
Overview
Create comprehensive documentation for PVC storage support, including user guides, API documentation, troubleshooting guides, and architecture diagrams.
Scope
- User documentation for PVC storage usage
 - Developer documentation for implementation details
 - Troubleshooting guide for common issues
 - Architecture diagrams
 - Migration guide from other storage types
 
Files to Create/Modify
docs/user-guide/storage/pvc-storage.md- User guidedocs/api/storage-types.md- API documentation updatedocs/troubleshooting/pvc-storage.md- Troubleshooting guidedocs/architecture/pvc-storage-flow.md- Architecture documentationREADME.md- Update main README with PVC storage mention
Documentation Content
User Guide: docs/user-guide/storage/pvc-storage.md
# PVC Storage for Models
OME supports using Kubernetes Persistent Volume Claims (PVCs) as storage for machine learning models. This allows you to leverage existing models stored in PVCs without copying them to object storage.
## Overview
PVC storage is ideal when you:
- Have models already stored in PVCs backed by NFS, block storage, or other CSI drivers
- Want to avoid duplicating large models across multiple storage systems
- Need to use enterprise storage systems exposed through Kubernetes PVCs
- Require specific storage performance characteristics
## Prerequisites
1. A Kubernetes cluster with PVC support
2. Pre-existing PVCs containing model files
3. Models must include a `config.json` file for automatic metadata extraction
## PVC URI Format
pvc://{pvc-name}/{sub-path}
Examples:
- `pvc://model-storage/llama2-7b`
- `pvc://shared-models/foundation/nlp/bert-large`
## Creating a BaseModel with PVC Storage
### Basic Example
```yaml
apiVersion: ome.io/v1beta1
kind: BaseModel
metadata:
  name: llama2-7b-pvc
  namespace: default
spec:
  storage:
    storageUri: "pvc://llama2-model-pvc/models/llama2-7b"
  modelFormat:
    name: "safetensors"
The model metadata (type, architecture, parameters) will be automatically extracted from the model's config.json file.
Manual Metadata Specification
If automatic extraction is not possible or desired, you can specify metadata manually:
apiVersion: ome.io/v1beta1
kind: BaseModel
metadata:
  name: llama2-7b-pvc
  namespace: default
spec:
  storage:
    storageUri: "pvc://llama2-model-pvc/models/llama2-7b"
  modelFormat:
    name: "safetensors"
  modelType: "llama"
  modelArchitecture: "LlamaForCausalLM"
  modelParameterSize: "7B"
  modelCapabilities:
  - "text-generation"
  - "chat"
  maxTokens: 4096PVC Access Modes
OME handles different PVC access modes appropriately:
ReadWriteMany (RWX)
- Multiple pods can mount the PVC simultaneously
 - Ideal for shared model repositories
 - InferenceService pods can run on any node
 
ReadWriteOnce (RWO)
- Only one pod can mount the PVC at a time
 - Suitable for high-performance block storage
 - InferenceService scheduling is handled by Kubernetes
 
ReadOnlyMany (ROX)
- Multiple pods can mount the PVC read-only
 - Good for immutable model storage
 - Similar behavior to RWX for model serving
 
Using PVC Models in InferenceService
apiVersion: ome.io/v1beta1
kind: InferenceService
metadata:
  name: llama2-serving
spec:
  model:
    name: llama2-7b-pvc
    runtime: vllm-runtime
  predictor:
    containers:
    - name: predictor
      args:
      - "--model"
      - "$(MODEL_PATH)"
      resources:
        limits:
          nvidia.com/gpu: "1"The InferenceService will automatically mount the PVC and set the MODEL_PATH environment variable.
ClusterBaseModel with PVC
For cluster-wide models, the PVC must exist in the configured cluster model namespace (typically ome-system):
apiVersion: ome.io/v1beta1
kind: ClusterBaseModel
metadata:
  name: shared-llama2-7b
spec:
  storage:
    storageUri: "pvc://cluster-models-pvc/foundation/llama2-7b"
  modelFormat:
    name: "safetensors"Preparing PVCs for Model Storage
Model Directory Structure
Your PVC should contain models with this structure:
/models/
├── llama2-7b/
│   ├── config.json          # Required for metadata extraction
│   ├── model.safetensors    # Model weights
│   ├── tokenizer.json       # Tokenizer configuration
│   └── tokenizer_config.json
└── bert-large/
    ├── config.json
    ├── pytorch_model.bin
    └── vocab.txt
Example: Populating a PVC
apiVersion: batch/v1
kind: Job
metadata:
  name: download-model-to-pvc
spec:
  template:
    spec:
      containers:
      - name: downloader
        image: huggingface/transformers-cli:latest
        command:
        - sh
        - -c
        - |
          # Download model from HuggingFace to PVC
          cd /models
          git lfs install
          git clone https://huggingface.co/meta-llama/Llama-2-7b-hf llama2-7b
      volumeMounts:
      - name: model-storage
        mountPath: /models
      volumes:
      - name: model-storage
        persistentVolumeClaim:
          claimName: model-storage-pvc
      restartPolicy: OnFailureMonitoring PVC Model Status
Check BaseModel status:
kubectl get basemodel llama2-7b-pvc -o yamlLook for:
status.state: Should be "Ready" when model is verifiedstatus.nodesReady: Nodes where model is available (for RWX PVCs)spec.modelType,spec.modelArchitecture: Auto-populated metadata
Best Practices
- Use RWX PVCs for shared models that multiple services will use
 - Pre-validate model files ensure config.json exists before creating BaseModel
 - Set resource limits on metadata extraction jobs if using large PVCs
 - Use subpaths to organize multiple models in a single PVC
 - Monitor PVC usage to ensure sufficient storage capacity
 
Limitations
- PVCs must be in the same namespace as BaseModel (or cluster namespace for ClusterBaseModel)
 - Cross-namespace PVC access is not supported
 - Model files must be readable by the OME service accounts
 - Metadata extraction requires 
config.jsonin standard format 
Migration from Other Storage Types
To migrate existing models to PVC storage:
- Copy model files to PVC maintaining directory structure
 - Ensure config.json is present
 - Update BaseModel storageUri to use 
pvc://format - Delete old model files from nodes if using host storage
 
### Troubleshooting Guide: `docs/troubleshooting/pvc-storage.md`
The troubleshooting guide should include:
**Common Issues**:
1. **Model Status Stuck in "MetadataPending"**
   - Symptoms and possible causes
   - Diagnostic commands
   - Solutions for each cause
2. **InferenceService Pod Fails to Start**
   - PVC mounting issues
   - Access mode conflicts
   - Node affinity problems
3. **"PVC Not Found" Errors**
   - Namespace mismatches
   - RBAC permission issues
   - URI typos
4. **Metadata Extraction Job Failures**
   - Missing config.json
   - JSON parsing errors
   - File permission issues
5. **Performance Issues**
   - Storage latency
   - I/O bottlenecks
   - Optimization strategies
**Debugging Tools**:
- kubectl commands for diagnostics
- Log analysis techniques
- Test pod creation for verification
- Job monitoring commands
**Error Reference Table**:
- Common error messages
- Root causes
- Quick solutions
**Advanced Debugging**:
- Creating debug pods
- Accessing PVC contents
- Verifying file permissions
- RBAC troubleshooting
### Architecture Documentation: `docs/architecture/pvc-storage-flow.md`
The architecture documentation should include:
**Overview**:
- Purpose of PVC storage support
- High-level architecture description
- Benefits over existing storage methods
**Architecture Diagram**:
- Sequence diagram showing the complete flow
- Component interactions
- Data flow between components
**Component Details**:
1. **Model Agent**:
   - Role in PVC verification
   - Why it doesn't mount PVCs
   - Status updates it performs
2. **BaseModel Controller**:
   - PVC-specific reconciliation logic
   - Job creation and monitoring
   - Status management
3. **Metadata Extraction Job**:
   - Purpose and lifecycle
   - Security context
   - Resource limits and timeouts
4. **InferenceService Controller**:
   - PVC volume handling
   - Scheduling considerations
   - Subpath support
**Design Decisions**:
- Why jobs for metadata extraction
- Why no PVC mounting in model agent
- Node selector implications
- Security boundaries
**Storage Type Comparison**:
- Table comparing all storage types
- Trade-offs and use cases
- Performance characteristics
**Security Model**:
- RBAC permissions required
- PVC access patterns
- Isolation boundaries
**Performance Analysis**:
- One-time vs recurring costs
- Caching behavior
- Scaling considerations
**Future Enhancements**:
- Cross-namespace support
- Metadata caching
- Dynamic provisioning
- Advanced scheduling
### README Update
Update the main README.md to include:
**In Supported Storage Types section**:
- Add PVC to the list of supported storage backends
- Brief description of PVC storage benefits
- Link to detailed documentation
**PVC Storage Summary**:
- Key features and benefits
- Simple example showing BaseModel with PVC URI
- Reference to the user guide
- Version information (when feature was added)
## Documentation Standards
1. **Clear Examples**: Every concept should have a working example
2. **Troubleshooting**: Common errors and solutions
3. **Diagrams**: Visual representation of complex flows
4. **Cross-references**: Link between related topics
5. **Version Notes**: Indicate when features were added
## Acceptance Criteria
- [ ] User guide covers all PVC storage scenarios
- [ ] Troubleshooting guide addresses common issues
- [ ] Architecture documentation explains design decisions
- [ ] API documentation is updated
- [ ] Examples are tested and working
- [ ] Documentation is reviewed by team
## Dependencies
- Implementation tasks completed
- E2E tests for validating examples
## Estimated Effort
4-5 hours