Skip to content

enhance: Optimize shard serviceable mechanism #41937

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

weiliu1031
Copy link
Contributor

issue: #41690

  • Merge leader view and channel management into ChannelDistManager, allowing a channel to have multiple delegators.
  • Improve shard leader switching to ensure a single replica only has one shard leader per channel. The shard leader handles all resource loading and query requests.
  • Refine the serviceable mechanism: after QC completes loading, sync the query view to the delegator. The delegator then determines its serviceable status based on the query view.
  • When a delegator encounters forwarding query or deletion failures, mark the corresponding segment as offline and transition it to an unserviceable state.

@sre-ci-robot sre-ci-robot added size/XXL Denotes a PR that changes 1000+ lines. area/internal-api labels May 20, 2025
@sre-ci-robot sre-ci-robot requested review from aoiasd and bigsheeper May 20, 2025 03:23
@mergify mergify bot added dco-passed DCO check passed. kind/enhancement Issues or changes related to enhancement labels May 20, 2025
@weiliu1031 weiliu1031 force-pushed the optimize_shard_serviceable branch from e91be31 to cb3de2d Compare May 20, 2025 03:32
Copy link
Contributor

mergify bot commented May 20, 2025

@weiliu1031 go-sdk check failed, comment rerun go-sdk can trigger the job again.

Copy link

codecov bot commented May 20, 2025

Codecov Report

Attention: Patch coverage is 86.68122% with 61 lines in your changes missing coverage. Please review.

Project coverage is 80.44%. Comparing base (f20e085) to head (82eac09).
Report is 4 commits behind head on master.

Files with missing lines Patch % Lines
internal/querycoordv2/dist/dist_handler.go 75.24% 22 Missing and 3 partials ⚠️
internal/querycoordv2/meta/channel_dist_manager.go 88.07% 10 Missing and 3 partials ⚠️
internal/querynodev2/delegator/delegator.go 52.63% 6 Missing and 3 partials ⚠️
internal/querycoordv2/task/scheduler.go 83.87% 1 Missing and 4 partials ⚠️
internal/querycoordv2/task/executor.go 66.66% 3 Missing and 1 partial ⚠️
internal/querycoordv2/checkers/segment_checker.go 91.66% 1 Missing and 1 partial ⚠️
internal/querycoordv2/observers/target_observer.go 90.47% 1 Missing and 1 partial ⚠️
internal/querycoordv2/checkers/channel_checker.go 88.88% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #41937       +/-   ##
===========================================
+ Coverage   73.02%   80.44%    +7.41%     
===========================================
  Files         335     1536     +1201     
  Lines       30700   216544   +185844     
===========================================
+ Hits        22419   174200   +151781     
- Misses       8281    36064    +27783     
- Partials        0     6280     +6280     
Components Coverage Δ
Client 79.36% <ø> (∅)
Core 73.00% <ø> (-0.02%) ⬇️
Go 81.90% <86.68%> (∅)
Files with missing lines Coverage Δ
internal/querycoordv2/balance/report.go 97.77% <100.00%> (ø)
...al/querycoordv2/balance/rowcount_based_balancer.go 90.81% <100.00%> (ø)
...ernal/querycoordv2/balance/score_based_balancer.go 96.47% <100.00%> (ø)
internal/querycoordv2/checkers/balance_checker.go 96.27% <100.00%> (ø)
internal/querycoordv2/checkers/leader_checker.go 93.20% <100.00%> (ø)
internal/querycoordv2/dist/dist_controller.go 85.71% <100.00%> (ø)
internal/querycoordv2/handlers.go 75.40% <100.00%> (ø)
internal/querycoordv2/meta/dist_manager.go 81.81% <100.00%> (ø)
internal/querycoordv2/meta/replica.go 100.00% <ø> (ø)
internal/querycoordv2/meta/replica_manager.go 82.12% <ø> (ø)
... and 17 more

... and 1175 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

mergify bot commented May 20, 2025

@weiliu1031 E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

@weiliu1031 weiliu1031 force-pushed the optimize_shard_serviceable branch from cb3de2d to 17973bd Compare May 20, 2025 06:32
serviceable := checkDelegatorServiceable(ctx, dh, dmChannel.View)
// trigger pull next target until shard leader is ready
if !serviceable {
dh.lastUpdateTs = 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lastUpdateTs is not guarded by lock

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

each querynode has it's own dist_handler, and dist_handler pull data distribution in serial, the lock is unnecessary

zap.String("channel", view.Channel),
zap.Error(err))
if status := view.Status; status != nil {
if !status.GetServiceable() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to GetServiceable twice

@weiliu1031 weiliu1031 force-pushed the optimize_shard_serviceable branch from 17973bd to 1f7d364 Compare May 21, 2025 09:17
Copy link
Contributor

mergify bot commented May 21, 2025

@weiliu1031 E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

- Merge leader view and channel management into ChannelDistManager,
  allowing a channel to have multiple delegators.
- Improve shard leader switching to ensure a single replica only has
  one shard leader per channel. The shard leader handles all resource
  loading and query requests.
- Refine the serviceable mechanism: after QC completes loading, sync
  the query view to the delegator. The delegator then determines its
  serviceable status based on the query view.
- When a delegator encounters forwarding query or deletion failures,
  mark the corresponding segment as offline and transition it to an
  unserviceable state.

Signed-off-by: Wei Liu <[email protected]>
@weiliu1031 weiliu1031 force-pushed the optimize_shard_serviceable branch from 1f7d364 to 82eac09 Compare May 21, 2025 14:21
@mergify mergify bot added the ci-passed label May 21, 2025
@xiaofan-luan
Copy link
Collaborator

/lgtm
/approve

@sre-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: weiliu1031, xiaofan-luan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sre-ci-robot sre-ci-robot merged commit 7801026 into milvus-io:master May 22, 2025
19 of 20 checks passed
weiliu1031 added a commit to weiliu1031/milvus that referenced this pull request May 30, 2025
issue: milvus-io#42098 milvus-io#42404
related to: #milvus-io#42009 milvus-io#41937

Implement new method to handle partition removal from next target without
directly modifying current target.

Changes include:
- Add RemovePartitionFromNextTarget method and deprecate RemovePartition
- Update target_observer to use new method for ReleasePartition operations
- Add unit tests and mock methods for new functionality

This ensures that all changes to next target will propagates to
delegator's query view.

Signed-off-by: Wei Liu <[email protected]>
weiliu1031 added a commit to weiliu1031/milvus that referenced this pull request May 30, 2025
issue: milvus-io#42098 milvus-io#42404
related to: #milvus-io#42009 milvus-io#41937

Implement new method to handle partition removal from next target without
directly modifying current target.

Changes include:
- Add RemovePartitionFromNextTarget method and deprecate RemovePartition
- Update target_observer to use new method for ReleasePartition operations
- Add unit tests and mock methods for new functionality

This ensures that all changes to next target will propagates to
delegator's query view.

Signed-off-by: Wei Liu <[email protected]>
weiliu1031 added a commit to weiliu1031/milvus that referenced this pull request Jun 3, 2025
issue: milvus-io#42098 milvus-io#42404
related to: #milvus-io#42009 milvus-io#41937

Implement new method to handle partition removal from next target without
directly modifying current target.

Changes include:
- Add RemovePartitionFromNextTarget method and deprecate RemovePartition
- Update target_observer to use new method for ReleasePartition operations
- Add unit tests and mock methods for new functionality

This ensures that all changes to next target will propagates to
delegator's query view.

Signed-off-by: Wei Liu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved area/internal-api ci-passed dco-passed DCO check passed. kind/enhancement Issues or changes related to enhancement lgtm size/XXL Denotes a PR that changes 1000+ lines.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants