MEP: Channel Exclusive Mode for QueryCoord#11
MEP: Channel Exclusive Mode for QueryCoord#11weiliu1031 wants to merge 4 commits intomilvus-io:mainfrom
Conversation
issue: #47500, #47505 Add comprehensive design document for Channel Exclusive Mode feature in QueryCoord. This MEP describes: - Architecture overview and system components - Core data structures (ChannelNodeInfo, Replica, mutableReplica) - Channel exclusive mode lifecycle and activation conditions - Node assignment algorithm with even distribution - ChannelLevelScoreBalancer implementation details - Configuration parameters and examples - Complete data flow examples including node removal scenarios - Migration and rollback procedures - Rolling upgrade considerations - Resource impact warnings and best practices - Test plan and verification checklist The design enables channel-level resource isolation with automatic enable/disable based on cluster state and runtime configuration. Signed-off-by: Wei Liu <wei.liu@zilliz.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: weiliu1031 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Replace MEP format with the original design document format from the milvus repository. This version maintains the original chapter structure (1-10) and formatting without MEP metadata header. The document provides comprehensive coverage of: - Background and motivation - Architecture overview and system components - Core data structures with design rationale - Channel exclusive mode lifecycle - Node assignment algorithm - ChannelLevelScoreBalancer implementation - Configuration parameters and examples - Complete data flow examples - Migration and rollback procedures - Rolling upgrade considerations - Conclusion and references Signed-off-by: Wei Liu <wei.liu@zilliz.com>
…cklist Add the following enhancements to support complete documentation requirements: 1. Related Issues and PRs section at the beginning: - Reference to issue #47500 - Reference to implementation PR #47505 - Release version (2.6.0) 2. New Chapter 9: Test Plan and Verification - Unit tests for replica logic, balancer, and observer - Integration test scenarios for end-to-end validation - System tests for load testing and resource impact - Production verification checklist (pre/during/post-deployment) - Rolling upgrade verification steps - Regression testing critical paths - Performance benchmarks and baselines 3. Enhanced References (Chapter 11) - Added related issues and PRs section - Added documentation references - Added related work references 4. Updated metadata: - Document version 1.0 → 1.1 - Added status and last updated timestamp The document now provides: - ✓ Related issues and PRs explicitly referenced - ✓ Comprehensive test plan with detailed test cases - ✓ Production verification checklist - ✓ Performance benchmarks and impact assessment - ✓ All sections from checklist: Migration, Rollback, Rolling Upgrade, Performance Warnings, Test Plan Signed-off-by: Wei Liu <wei.liu@zilliz.com>
Design Review CommentsThanks for the comprehensive design document! I have two questions/suggestions: 1. How to avoid massive handoff when toggling exclusive mode?The document acknowledges that enabling/disabling channel exclusive mode triggers massive rebalancing with 50-80% CPU spikes and 2-5x latency increase. This could be a significant operational risk. Question: Has the design considered any strategies to reduce the handoff storm during mode transitions? Some potential approaches to consider:
2. Can the "remainder" nodes be shared across all channels?Current design (7 nodes, 3 channels): The first channel always gets the extra node(s), which seems unfair. Suggested alternative: Each channel gets 2 dedicated nodes, and node 7 becomes a "shared/overflow" node that all channels can use. Benefits:
Considerations:
Looking forward to your thoughts on these! |
|
|
||
| **Document Version**: 1.1 (with Test Plan and Verification) | ||
| **Date**: 2026-02-04 | ||
| **Author**: Milvus QueryCoord Team |
…n document Signed-off-by: Wei Liu <wei.liu@zilliz.com>
|
Thanks for the detailed review and thoughtful suggestions! Let me address both questions: Response to Question 1: Avoiding Massive Handoff During Mode TransitionsThe current implementation already includes several mechanisms to mitigate the handoff storm: Gradual Balancing (Already Implemented):
Existing Rate Limiting Capabilities:
Remaining Concerns:
The current design strikes a balance between:
Response to Question 2: Shared "Remainder" NodesThe shared node approach is not recommended for the following reasons: 1. Design Complexity and Philosophical InconsistencyHybrid Mode Challenges:
Loss of "Exclusive" Semantics:
2. Severe Query Performance BottleneckHotspot Problem:
The shared node becomes an artificial hotspot:
Defeat the Purpose:
3. Current Design RationaleWhy "First Channel Gets Extra Nodes" is Acceptable: The current design (e.g., 7 nodes / 3 channels →
Alternative Solutions for Better Fairness (if needed in the future):
However, these enhancements add complexity and are not critical for the initial implementation. Summary
Let me know if you'd like to discuss these trade-offs further or if there are other aspects of the design you'd like me to clarify! |
Summary
Add comprehensive design document for Channel Exclusive Mode feature in QueryCoord.
This MEP describes the architecture, implementation, and operational considerations for channel-level resource isolation in Milvus 2.6.
Key Features
Design Highlights
Architecture
Key Components
Configuration
Migration Path
Related
Document Structure
Checklist