Skip to content

[cuebot/pycue/proto] Add render farm monitoring system with Kafka, Elasticsearch, and enhanced Prometheus metrics #2085

@ramonfigueiredo

Description

@ramonfigueiredo

Describe the enhancement
This enhancement proposes a monitoring infrastructure for collecting, storing, and providing access to render farm statistics. The system would address several current limitations:

  1. Limited Historical Data Access: The existing pycue API only provides 3 days of job history due to database recycling policies, limiting long-term analysis and memory prediction capabilities.
  2. Insufficient Real-time Visibility: Production Support and Resources (PSR) Teams lack comprehensive real-time views into farm operations for resource forecasting and troubleshooting.
  3. Fragmented Monitoring Solutions: Multiple disparate systems create maintenance overhead and data silos.

Proposed Architecture

The solution would introduce an event-driven architecture with the following components:

  • Kafka Event Publishing: Real-time capture and publishing of Job/Layer/Frame/Host lifecycle events to Kafka topics
  • Elasticsearch Storage: Long-term historical data storage (1-2 year retention) for analytical queries
  • Enhanced Prometheus Metrics: Extended metrics for frame completion rates, runtime histograms, and memory usage patterns
  • Extended pycue API: New methods for querying historical data beyond the current 3-day limitation

Proposed Features

Feature Description
Event Types Job, Layer, Frame, Host, and Proc lifecycle events
Historical Queries getJobHistory(), getFrameHistory(), getLayerMemoryHistory()
Prometheus Metrics Frame/job completion counters, runtime/memory histograms
Configuration Fully opt-in via properties (disabled by default)

Use Cases

  1. Enhanced Memory Prediction: Access up to 1 year of historical job data for improved DCCs (e.g. Nuke) memory prediction accuracy
  2. Production Support and Resources (PSR) Teams Operational Dashboard: Real-time farm status, resource forecasting, and capacity planning
  3. Analytics: Long-term trend analysis for render farm optimization

Proposed Configuration Properties

# Kafka Event Publishing
monitoring.kafka.enabled=false
monitoring.kafka.bootstrap.servers=localhost:9092

# Elasticsearch Historical Storage
monitoring.elasticsearch.enabled=false
monitoring.elasticsearch.host=localhost
monitoring.elasticsearch.port=9200

Version Number

Additional context

Implementation Phases

  1. Phase 1 (Foundation): Kafka infrastructure, Cuebot event generation, basic Elasticsearch setup
  2. Phase 2 (Storage Integration): Elasticsearch schema, data ingestion pipelines, Prometheus integration
  3. Phase 3 (API Enhancement): Extended pycue API, performance optimization
  4. Phase 4 (Visualization): Grafana dashboard development, end-to-end testing

Metadata

Metadata

Labels

enhancementImprovement to an existing feature

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions