| Author | Jeongseok Kang (jskang@lablup.com) |
|---|---|
| Status | Accepted |
| Created | 2025-06-27 |
| Created-Version | |
| Target-Version | |
| Implemented-Version |
This proposal aims to enhance the Backend.AI Model Service deployment process by providing more sophisticated, zero- or near-zero-downtime strategies. Specifically, it outlines how to implement rolling updates and blue-green deployments to ensure minimal service interruption. In addition, it proposes a way to dynamically update environment variables and to rollback when needed.
As model serving usage has been growing, Backend.AI introduced the “Model Service” in version 23.09 and Backend.AI FastTrack introduced the “Model Serving Task” in 25.09 to meet evolving market needs. However, the current deployment process does not fully support strategies for zero-downtime deployment—an important requirement for end user-facing services. To address this limitation, we propose: • A rolling update mechanism that updates services incrementally while keeping a subset of instances operational. • A blue-green deployment flow suitable for use cases requiring isolated testing or strict regulatory requirements. • A rollback capability that preserves service continuity in case of errors in the new version.
- Deployments only support simple replacement or upgrades, causing short service interruptions.
- Environment variables are fixed at deployment time and cannot be dynamically updated.
- The system lacks fine-grained versioning (e.g., a dedicated “version” field in the
Routingobject) for easier rollback or co-existence of multiple versions.
- Introduce a Rolling Update Strategy:
- When a user triggers a promotion or update, Backend.AI attempts to create new model service sessions incrementally.
- New instances are launched and must pass health checks before receiving a portion of traffic. Once a new instance is deemed healthy, traffic is gradually shifted. This continues until all instances run the new version without causing significant service downtime.
- In the event of a failure during rollout, rollback can occur automatically or manually. However, automatic rollback applies only to the newly deployed version, and specific rules (e.g., timeout, error rate thresholds, resource constraints) govern exactly when it triggers. This helps ensure issues are contained early, while giving users control over rollback policies.
- Manual rollback is always possible through versioned deployments, allowing users to revert to a previous stable version with minimal disruption.
- Introduce a Blue-Green Deployment Mode:
- A user initiates a deployment update via API (UI and CLI support planned).
- The existing active deployment is referred to as “Blue.” The new one is “Green.”
- The system creates new “Green” routings—each with an initial traffic ratio of 0.0—equal to the desired replica count.
- Once the new “Green” routings are all healthy (or have passed custom validation checks), the system updates
traffic_ratioto 1.0 for Green and to 0.0 for the old Blue. - Blue routings are retained temporarily for rollback purposes or removed entirely if deemed stable.
- The Model Service returns to a
HEALTHY(or “active”) status.
- Enable Updating of Environment Variables:
- Allow environment variables (envs) to be modified during rolling or blue-green updates, removing the need for redeployment solely for env changes.
- Provide graceful handling of env changes so that sessions have time to reload configurations and verify correctness.
- Enhance Versioning:
- Each
Routinggains a new field:version. - The system can track which version is currently in production (and previous stable ones), making rollbacks simpler.
Below is a unified architectural approach that supports both rolling updates and blue-green deployments. While the steps share a common foundation, they differ mainly in how traffic is shifted between the old and new versions.
flowchart TD
subgraph Rolling Update
A[Set small traffic_ratio for new version]
B[Wait for new instance to become HEALTHY]
C[Gradually increase traffic_ratio for new version]
D[Reduce traffic_ratio for old version]
E[Repeat until new version traffic_ratio=1.0 and all HEALTHY]
A --> B --> C --> D --> E
end
subgraph Blue-Green Update
F[Create Green with minimal traffic_ratio]
G[Green becomes HEALTHY]
H[Set Green traffic_ratio=1.0, Blue=0.0]
I[Green stays HEALTHY at 100% traffic_ratio]
F --> G --> H --> I
end
- User triggers or submits a promotion (API/UI/CLI) to deploy a new version of the Model Service.
- The system checks resource availability (e.g., replicas × required resources). If insufficient, the deployment request is rejected.
- New routing objects (pointing to the updated version) are created with an initial traffic ratio.
- The system monitors each new instance’s health. Once the health checks pass (or other validations are satisfied), traffic is gradually or fully diverted according to the chosen strategy (rolling or blue-green).
- If at any point validation fails, the process may pause or roll back to a stable version.
- Upon successful completion, the new version handles all (or the majority) of traffic, and the old version is either terminated or retained briefly for rollback.
- Set a small initial
traffic_ratioon the new version. - Wait for each new instance to become 'HEALTHY'.
- Gradually increase the
traffic_ratiofor the new version while reducing it for the old one. - Continue until the new version’s
traffic_ratioreaches 1.0 and all instances are 'HEALTHY'.
- Create a new version (Green) with a minimal
traffic_ratio(e.g., 0.0 or very low). - Once Green is 'HEALTHY', adjust
traffic_ratioto route traffic fully to Green. - Simultaneously reduce the old (Blue) version’s
traffic_ratioto 0.0. - When Green remains 'HEALTHY' at 100%
traffic_ratio, the update is complete.
- Both rolling and blue-green strategies can revert to a prior stable version if errors arise.
- Retaining the old version (with its associated environment variables and configuration) until final confirmation ensures a seamless fallback option.
- Introduce a subset of new routes (e.g., 5-10% traffic) to test the new version in production-like conditions.
- If error rates exceed a defined threshold, the canary is rolled back before full deployment.
- This feature would require additional monitoring and alerting capabilities to be fully effective.
Introducing rolling updates and blue-green deployments will help achieve near-zero downtime, minimize risk during releases, and allow for dynamic configuration changes. These enhancements will significantly improve the deployment experience for both customers and end users of the Backend.AI Model Service.
- Kubernetes - Performing a Rolling Update: https://kubernetes.io/docs/tutorials/kubernetes-basics/update/update-intro/
Rolling updates allow Deployments to take place with zero downtime by incrementally updating Pods instances with new ones.
- Kubernetes - ReplicaSet: https://kubernetes.io/docs/concepts/workloads/controllers/replicaset/