-
Notifications
You must be signed in to change notification settings - Fork 233
Open
Labels
enhancementImprovement to an existing featureImprovement to an existing feature
Description
Describe the enhancement
This enhancement proposes a monitoring infrastructure for collecting, storing, and providing access to render farm statistics. The system would address several current limitations:
- Limited Historical Data Access: The existing pycue API only provides 3 days of job history due to database recycling policies, limiting long-term analysis and memory prediction capabilities.
- Insufficient Real-time Visibility: Production Support and Resources (PSR) Teams lack comprehensive real-time views into farm operations for resource forecasting and troubleshooting.
- Fragmented Monitoring Solutions: Multiple disparate systems create maintenance overhead and data silos.
Proposed Architecture
The solution would introduce an event-driven architecture with the following components:
- Kafka Event Publishing: Real-time capture and publishing of Job/Layer/Frame/Host lifecycle events to Kafka topics
- Elasticsearch Storage: Long-term historical data storage (1-2 year retention) for analytical queries
- Enhanced Prometheus Metrics: Extended metrics for frame completion rates, runtime histograms, and memory usage patterns
- Extended pycue API: New methods for querying historical data beyond the current 3-day limitation
Proposed Features
| Feature | Description |
|---|---|
| Event Types | Job, Layer, Frame, Host, and Proc lifecycle events |
| Historical Queries | getJobHistory(), getFrameHistory(), getLayerMemoryHistory() |
| Prometheus Metrics | Frame/job completion counters, runtime/memory histograms |
| Configuration | Fully opt-in via properties (disabled by default) |
Use Cases
- Enhanced Memory Prediction: Access up to 1 year of historical job data for improved DCCs (e.g. Nuke) memory prediction accuracy
- Production Support and Resources (PSR) Teams Operational Dashboard: Real-time farm status, resource forecasting, and capacity planning
- Analytics: Long-term trend analysis for render farm optimization
Proposed Configuration Properties
# Kafka Event Publishing
monitoring.kafka.enabled=false
monitoring.kafka.bootstrap.servers=localhost:9092
# Elasticsearch Historical Storage
monitoring.elasticsearch.enabled=false
monitoring.elasticsearch.host=localhost
monitoring.elasticsearch.port=9200
Version Number
Additional context
Implementation Phases
- Phase 1 (Foundation): Kafka infrastructure, Cuebot event generation, basic Elasticsearch setup
- Phase 2 (Storage Integration): Elasticsearch schema, data ingestion pipelines, Prometheus integration
- Phase 3 (API Enhancement): Extended pycue API, performance optimization
- Phase 4 (Visualization): Grafana dashboard development, end-to-end testing
Metadata
Metadata
Assignees
Labels
enhancementImprovement to an existing featureImprovement to an existing feature