Skip to content

chore(shard-distirbutor): reuse the state for assignephemeral shards#8020

Open
eleonoradgr wants to merge 1 commit intocadence-workflow:masterfrom
eleonoradgr:reduce-call-to-etcd
Open

chore(shard-distirbutor): reuse the state for assignephemeral shards#8020
eleonoradgr wants to merge 1 commit intocadence-workflow:masterfrom
eleonoradgr:reduce-call-to-etcd

Conversation

@eleonoradgr
Copy link
Copy Markdown
Contributor

@eleonoradgr eleonoradgr commented Apr 28, 2026

What changed?
The fetchExecutorMetadata method (which made one GetExecutor storage call per unique assigned executor after writing new shard assignments) was replaced with a pure in-memory function
buildExecutorOwners. Instead of going back to storage, it now derives the executor metadata directly from the HeartbeatState already loaded by the preceding GetState call.
part of #6862

Why?
The GetState call already fetches the full executor heartbeat state, including the metadata needed for the response. The previous code discarded that data and then re-fetched it from storage
once per executor — wasting N extra round-trips per assignEphemeralBatch invocation. This change eliminates those redundant reads: fewer storage calls means lower latency, less load on etcd.

How did you test it?
go test -race ./service/sharddistributor/handler/...

Potential risks
tested in local development, do not expect any issue

Release notes
N/A

Documentation Changes
N/A

Signed-off-by: edigregorio <edigregorio@uber.com>
@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented Apr 28, 2026

CI failed: Integration tests are failing with fatal memory-related panics (WASM out of bounds and stack overflow) in the SQLite driver, likely due to resource exhaustion under test concurrency.

Overview

Integration tests for SQLite are consistently failing due to severe memory and stack exhaustion issues within the ncruces/go-sqlite3 (WASM) driver during test execution.

Failures

SQLite Integration Test Panics (confidence: medium)

  • Type: test
  • Affected jobs: 73400494362
  • Related to change: yes
  • Root cause: The tests are encountering wasm error: out of bounds memory access and fatal error: runtime: split stack overflow. These indicate that the test environment is hitting memory or stack limits provided by the WASM runtime for SQLite, likely exacerbated by the changes to shard-distributor state management.
  • Suggested fix: Investigate if the new shard state logic introduces resource leaks or higher memory pressure. Additionally, consider limiting the concurrency of integration tests in the Makefile or CI config to reduce the cumulative load on the WASM-based SQLite driver.

Summary

  • Change-related failures: 1 (Integration tests failing under load following shard-distributor modifications)
  • Infrastructure/flaky failures: 0
  • Recommended action: Review ephemeral_assigner.go for potential memory leaks or inefficient resource usage. If the code is correct, consider reducing test parallelism for the integration suite or checking if existing test mocks need better cleanup to avoid memory bloat.
Code Review ✅ Approved

Refactors the shard distributor to reuse state for assigned ephemeral shards. No issues found.

Rules ⚠️ 1/2 requirements met

Repository Rules

PR Description Quality Standards: The 'How did you test it?' section uses a generic command instead of explicit copyable invocations, and the 'Potential risks' section is vague rather than addressing specific risk categories.
GitHub Issue Linking Requirement: The PR description includes a valid GitHub issue link from the cadence-workflow organization.

1 rule not applicable. Show all rules by commenting gitar display:verbose.

Tip

Comment Gitar fix CI or enable auto-apply: gitar auto-apply:on

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

Auto-apply Compact
gitar auto-apply:on         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant