Skip to content

fix progress#695

Open
xutongNV wants to merge 1 commit intofeature/PROJ-147-operator-redesignfrom
xutongr/progress
Open

fix progress#695
xutongNV wants to merge 1 commit intofeature/PROJ-147-operator-redesignfrom
xutongr/progress

Conversation

@xutongNV
Copy link
Contributor

@xutongNV xutongNV commented Mar 11, 2026

Description

Issue #147

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added Go-based backend listener (v2) as an alternative to Python implementation with configurable selection via feature flag
    • Introduced message worker service for handling operator messages from Redis streams
    • Added operator service with gRPC support for managing Kubernetes events and node conditions
    • Enabled event listener for tracking Pod events with deduplication and TTL-based filtering
    • Added node condition rules management and monitoring capabilities
  • Documentation

    • Updated Helm chart documentation for new services and configuration options
  • Tests

    • Expanded test coverage for node helpers, usage tracking, and pod status calculations

@xutongNV xutongNV requested a review from a team as a code owner March 11, 2026 23:55
@xutongNV xutongNV changed the base branch from main to feature/PROJ-147-operator-redesign March 11, 2026 23:55
@coderabbitai
Copy link

coderabbitai bot commented Mar 11, 2026

Caution

Review failed

Failed to post review comments

📝 Walkthrough

Walkthrough

This pull request introduces a comprehensive Go-based operator redesign with a new backend listener service (v2), message worker service, and gRPC-based streaming interfaces for pod and node updates. It includes extensive protobuf definitions, Kubernetes-native listeners, database schema extensions, Helm chart updates, and supporting build configuration across multiple architectures.

Changes

Cohort / File(s) Summary
Operator Service Core
src/service/operator/{main.go, listener_service/..., utils/...}
New gRPC operator service with streaming listeners for backend initialization, pod status, and node condition updates backed by Redis and PostgreSQL. Includes utilities for backend management, helpers, and comprehensive test coverage.
Go-based Backend Listener
src/operator/{listener.go, event_listener.go, node_listener.go, node_usage_listener.go, workflow_listener.go, backend_listener.py}
New Go implementation of backend listeners for Kubernetes events, node tracking, pod usage aggregation, and workflow pod updates. Replaces Python listener with v2 architecture using gRPC streaming and progress tracking.
Listener Utilities
src/operator/utils/{base_listener.go, helpers.go, instruments.go, k8s_helpers.go, listener_args.go, login.go, metric_attrs.go, node_condition_rules.go, node_helpers.go, node_usage_aggregator.go, otel.go, task_status.go, unack_messages.go}
Comprehensive utility library for listener implementations including Kubernetes client wrappers, OpenTelemetry metrics, gRPC connection management, node state tracking, usage aggregation, and unacknowledged message handling with full test coverage.
Message Worker Service
src/service/agent/{message_worker.py, message_worker.go, BUILD}
New Redis Stream-based message worker service for processing operator messages (pod updates, node changes, events) with consumer group semantics, abandoned message recovery, and backend operation handling.
Protocol Buffers
src/proto/operator/{messages.proto, services.proto, BUILD}
New protobuf definitions for operator messaging including pod/node/event updates, listener service RPCs (streaming and unary), initialization, and heartbeat messages with gRPC service definitions.
Build Configuration
src/operator/BUILD, src/service/operator/BUILD, src/service/agent/BUILD, src/utils/progress_check/BUILD, MODULE.bazel, BUILD_AND_TEST.md
Added Bazel targets for Go libraries, tests, binaries, and OCI images for operator/listener services and message worker across x86_64 and arm64 architectures with protobuf compilation rules.
Listener Tests
src/operator/{event_listener_test.go, node_condition_rule_listener_test.go, node_listener_test.go, node_usage_listener_test.go, workflow_listener_test.go, tests/*_test.go}
Comprehensive Go test suites covering event deduplication, pod state tracking, node helpers, resource aggregation, and listener initialization with JSON-driven test cases and benchmarks.
Helm Charts & Deployment
deployments/charts/{backend-operator, quick-start, service}/{README.md, values.yaml, templates/*.yaml}
Updated Helm charts for new v2 backend listener toggle, message worker, operator service configurations with scaling, resource limits, sidecars, and pod monitoring integration.
Database & Configuration
src/utils/connectors/postgres.py, deployments/charts/service/migrations/005_v6_3_0_schema.json, src/service/core/{config, tests, workflow}
Added database migration for last_updated/last_usage_updated timestamps on resources table; updated helpers to use context-based database access instead of singleton pattern; node condition rule validation.
Utilities & Infrastructure
src/lib/utils/{version.go, BUILD}, src/utils/{backoff.go, backend_messages.py, progress_check/progress.go, roles/action_registry*.go}
New utilities for version management, exponential backoff with jitter, protobuf message types for operator communication, atomic progress file writing, and gRPC action registry for listener service endpoints.
Development & Configuration
.github/workflows/docs-deploy.yaml, .gitignore, run/{start_backend_bazel.py, start_service_bazel.py, start_service_kind.py, minimal/*.yaml}, src/go.mod
Updated workflow triggers, gitignore patterns for protobuf, development scripts for platform-specific listener binaries and operator/message worker services, and Go module dependencies for gRPC/protobuf/Kubernetes/OpenTelemetry.
Documentation & Benchmarks
projects/PROJ-147-operator-redesign/benchmark.md, deployments/charts/service/README.md, deployments/charts/quick-start/README.md
Updated benchmarks showing improved latency (~8ms vs ~20ms), documentation for new message worker and operator services in quick-start and service charts with configuration options.

Sequence Diagram(s)

sequenceDiagram
    participant Backend as Backend Server
    participant Listener as Go Listener
    participant K8s as Kubernetes API
    participant Redis as Redis Stream
    participant MessageWorker as Message Worker
    participant Database as PostgreSQL
    
    Backend->>Listener: InitBackend(gRPC)
    Listener->>Database: Create/Update Backend
    Listener->>Backend: InitBackendResponse
    
    K8s->>Listener: Pod/Node Events
    Listener->>Listener: Build UpdatePod/UpdateNode Message
    Listener->>Redis: Push to Stream
    Listener->>Backend: Send ACK
    
    MessageWorker->>Redis: Consume from Stream
    MessageWorker->>MessageWorker: Parse Protobuf Message
    MessageWorker->>Database: Update Task/Node/Usage
    MessageWorker->>Redis: Acknowledge Message
    
    Backend->>Listener: NodeConditionStream (heartbeat)
    Listener->>Database: Fetch Node Conditions
    Listener->>Backend: Send Conditions + ACKs
    
    Listener->>Listener: Report Progress
Loading
sequenceDiagram
    participant Client as gRPC Client<br/>(Backend)
    participant LS as ListenerService
    participant Redis as Redis
    participant PG as PostgreSQL
    
    Client->>LS: ListenerStream(messages)
    activate LS
    LS->>Redis: pushMessageToRedis(ListenerMessage)
    LS->>Client: AckMessage
    LS->>LS: reportProgress()
    Client->>LS: More Messages...
    LS->>Redis: pushMessageToRedis(ListenerMessage)
    Client->>LS: Close Stream
    deactivate LS
    
    Client->>LS: InitBackend(InitBackendRequest)
    activate LS
    LS->>PG: CreateOrUpdateBackend
    LS->>Redis: Push backend operation notification
    LS->>Client: InitBackendResponse
    deactivate LS
    
    Client->>LS: NodeConditionStream(heartbeats)
    activate LS
    LS->>PG: FetchBackendNodeConditions
    LS->>Client: Send NodeConditionsMessage
    Client->>LS: Heartbeat
    LS->>PG: UpdateBackendLastHeartbeat
    LS->>Redis: BLPOP for node condition actions
    LS->>Client: Send updated conditions
    deactivate LS
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Poem

🐇 Hops of gopher code now bound,
Streaming listeners spinning 'round,
Redis queues and Kubernetes grace,
Messages race through backend space—
From Python nests to Go's bright light,
The operator leaps, node by node, just right!

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch xutongr/progress

@xutongNV xutongNV enabled auto-merge March 11, 2026 23:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants