feat: [shard-distributor]Compress data when writing to ETCD #7366

gazi-yestemirova · 2025-10-23T05:55:53Z

What changed?
This PR implements snappy framed compression for all data written to etcd in the shard distributor store.

Implementation includes:

Compressing executor status, reported shards, and assigned state
DataCompression enabled/disabled flag in the config
Automatic detection of compressed vs uncompressed data via magic header

Why?
To reducing storage footprint

How did you test it?
unit-test and local testing

Potential risks

Release notes

Documentation Changes

Signed-off-by: Gaziza Yestemirova <[email protected]>

…ow#7357)  **What changed?** this is a mechanical removal of the clustersByRegion and refactor (except where otherwise remarked) as part of the cleanup to the clusterAttributes schema. It notably: - Chooses to backwards-compat migrate the regional fields into the cluster attributes schema - Catches a few minor missed earlier updates Behaviour changes: This PR modifies a few Active/Active unit tests as the refactor removes the old implmeentation of ClustersByRegion. This is a prerelease feature that has been simplified and no actual known customer for it exists in the wild. It's very likely unlikely any of these changes affect anyone in any material fashion. But for completeness: - DomainConfig counter: Because the clustersByRegion concept is implemented slightly differently to ClusterAttributes (in that the DomainConfig counter is updated for Cluster Attributes on domain failover, vs not in the ClustersByRegion Impl), this is therefore a slight behavioural change here - the field DomainUpdate is incremented on ClusterAttribute change and therefore the update is applied to the domain in every case. I don't see any downside to this behaviour change and it's clearer, but it's still worth flagging. - FailoverVersion not applied to cluster attributes: Because these are unrelated counters, cluster attributes no longer use or reference this and act independendently on domain cluster attributes being updated. **Testing** I manually tested it --------- Signed-off-by: David Porter <[email protected]> Signed-off-by: Gaziza Yestemirova <[email protected]>

)  **What changed?** This fixes a bug where the active-cluster flag is out of sync for AA domains with the failover version, probably leading to some undefined, or at least very confusing behaviour. This fixes it to be in line with normal active/passive domains.  **Why?**  **How did you test it?**  **Potential risks**  **Release notes**  **Documentation Changes** Signed-off-by: David Porter <[email protected]> Signed-off-by: Gaziza Yestemirova <[email protected]>

…updated (cadence-workflow#7320)  **What changed?** epic: cadence-workflow#6697 - Refactors the updateReplicationConfig handler a bit to try and simplify it and/or make it less of a nightmare to follow - Fixes some validation problems where it was possible to pass invalid clusters on update --------- Signed-off-by: David Porter <[email protected]> Signed-off-by: Gaziza Yestemirova <[email protected]>

…w#7365)  **What changed?** This does a bit more cleanup for some disused fields for the AA project. These fields are not in use. ClusterAttributes and CLi: - I'm not attempting to render the cluster-attributes in the normal domain `desc` command since I'm not sure it's really a worthwhile thing to render. It's visible already with the JSON output if details are needed and honestly, working with the JSON is a lot more tractable.  **Why?**  **How did you test it?**  **Potential risks**  **Release notes**  **Documentation Changes** --------- Signed-off-by: David Porter <[email protected]> Signed-off-by: Gaziza Yestemirova <[email protected]>

…rkflow#7371)  **What changed?** switches the Cluster attributes as they're stored in the DB to use snappy, since they're expected to potentially be somewhat large. I expect this to be a backwards compatible change due to the existing records being persisted with the encoding type, and that's what local testing showed to be true.  **Why?**  **How did you test it?** - [X] Tested manually - [X] Unit tests  **Potential risks**  **Release notes**  **Documentation Changes** Signed-off-by: David Porter <[email protected]> Signed-off-by: Gaziza Yestemirova <[email protected]>

Our direct peer provider caches connections between hosts, maintaining them until the host is no longer a peer. When that happens we close the connection, which makes any open requests to that host fail with an error including a status code of Cancelled. Since this can happen at any time with any of our peers, we should retry these requests. These retries are going to re-execute the entire operation, recalculating the host and connection to use. The other time we would see this error is during host shutdown, when we also close the connections. This is one of the last actions taken during shutdown, so it's unlikely for components to observe it. Even if there are components that have shutdown handling and currently rely on this error to stop an asynchronous process, marking this as retryable would only delay their incorrect shutdown rather than preventing it outright.  **What changed?** - Treat yarpc Cancelled as retryable  **Why?** - Retry transient failures  **How did you test it?** - Unit and integration tests.  **Potential risks** - Low. At most this could impact shutdown performance, delaying the completion of bad acting components that have no other shutdown mechanism.  **Release notes**  **Documentation Changes** Signed-off-by: Gaziza Yestemirova <[email protected]>

Signed-off-by: Gaziza Yestemirova <[email protected]>

**What changed?** adding a readme for using auth  **Why?** help users to onboard  **How did you test it?**  **Potential risks**  **Release notes**  **Documentation Changes** Signed-off-by: Gaziza Yestemirova <[email protected]>

…adence-workflow#7374)  **What changed?** Fix IsActiveIn method for active-active domains  **Why?** For active-active domains, we also need to check the domain level active cluster.  **How did you test it?** unit tests  **Potential risks**  **Release notes**  **Documentation Changes** Signed-off-by: Gaziza Yestemirova <[email protected]>

…#7375)  **What changed?** The UpdateDomain API is (optionally) undergoing changes where it's administrative functions are being protected by authorization checks and it's failover / user-facing functionality is being made available in the new endpoint `Failover`. This - if enforced - may affect users who're not aware of this change.  **Why?**  **How did you test it?**  **Potential risks**  **Release notes**  **Documentation Changes** Signed-off-by: David Porter <[email protected]> Signed-off-by: Gaziza Yestemirova <[email protected]>

**What changed?** Adds the domain_audit_log table.  **Why?** This is the first step of the persistence implementation for a persisted audit_log for domain changes. It will only be used for changes to the ReplicationConfig initially, replacing FailoverHistory in the domains metadata.  **How did you test it?** Unit tests, manual POC.  **Potential risks** Something is wrong with the schema definition (or the primary key definition) and we need to modify the table later. Ideally this is caught before it gets much further than this.  **Release notes** N/A  **Documentation Changes** N/A **Detailed Description** [In-depth description of the changes made to the schema or interfaces, specifying new fields, removed fields, or modified data structures] The domain_audit_log table has been added to the Cassandra schema. It will not exist in SQL etc. for now, and is planned to be added early next year. **Impact Analysis** - **Backward Compatibility**: [Analysis of backward compatibility] - **Forward Compatibility**: [Analysis of forward compatibility] N/A **Testing Plan** - **Unit Tests**: [Do we have unit test covering the change?] - **Persistence Tests**: [If the change is related to a data type which is persisted, do we have persistence tests covering the change?] - **Integration Tests**: [Do we have integration test covering the change?] - **Compatibility Tests**: [Have we done tests to test the backward and forward compatibility?] This should be covered by future integration & persistence tests - but is not yet. They will be added in a follow up PR. **Rollout Plan** - What is the rollout plan? - Does the order of deployment matter? No - Is it safe to rollback? Does the order of rollback matter? Yes, until applications start using it. - Is there a kill switch to mitigate the impact immediately? No. --- Signed-off-by: Gaziza Yestemirova <[email protected]>

Signed-off-by: Gaziza Yestemirova <[email protected]>

eleonoradgr · 2025-10-27T11:56:18Z

service/sharddistributor/store/etcd/executorstore/compression.go

+	}
+
+	decompressed, err := snappy.Decode(nil, data)
+	if err != nil {


I think we should return an error here and and handle the fallback at the higher level.
If we fail to decompress we could have a warning and try to unmarshal anyway in the decompressAndUnmarshal function. This way we don't fail but we don't hide the failure at the lower level

yeah, I agree! thank you! Updated the PR

Signed-off-by: Gaziza Yestemirova <[email protected]>

dkrotx · 2025-10-27T14:53:48Z

service/sharddistributor/store/etcd/executorstore/common/compression.go

+	if err != nil {
+		logger.Warn(fmt.Sprintf("failed to decompress %s, proceeding with unmarshaling..", errorContext), tag.Error(err))
+	}
+	if err := json.Unmarshal(decompressed, target); err != nil {
+		return fmt.Errorf("unmarshal %s: %w", errorContext, err)
+	}


This sequence looks unusual: we Unmarshal decompressed data even when it is ... nil?
In any way - can we rely on some magic sequence instead of decompressing error?

Otherwise we need to remember that not-compressed data causes this waning message [which should be ignored]. I'd prefer to not have this. And I think we agreed we actually want to optionally compress data, and sometimes explore raw state in etcd.

I am not sure I remember this: "And I think we agreed we actually want to optionally compress data, and sometimes explore raw state in etcd.".
Could you please elaborate on this?

Signed-off-by: Gaziza Yestemirova <[email protected]>

gazi-yestemirova requested review from 3vilhamster, Groxx, Shaddoll, davidporter-id-au, demirkayaender, dkrotx, jakobht, neil-xie, sankari165, shijiesheng and taylanisikdemir as code owners October 23, 2025 05:55

gazi-yestemirova changed the title ~~WIP: [shard-distributor]Compress data when writing to ETCD~~ feat: [shard-distributor]Compress data when writing to ETCD Oct 27, 2025

gazi-yestemirova and others added 13 commits October 27, 2025 09:48

[shard-distributor]Compress data when writing to ETCD

a0e9860

Signed-off-by: Gaziza Yestemirova <[email protected]>

Update decompress method

5772a46

Signed-off-by: Gaziza Yestemirova <[email protected]>

Resolve merge conflicts

b772b18

Signed-off-by: Gaziza Yestemirova <[email protected]>

gazi-yestemirova force-pushed the etcd/compress_data branch from a8740eb to 372dd4d Compare October 27, 2025 08:52

gazi-yestemirova added 2 commits October 27, 2025 09:55

Rebase

c056371

Signed-off-by: Gaziza Yestemirova <[email protected]>

small fix

616964b

Signed-off-by: Gaziza Yestemirova <[email protected]>

eleonoradgr reviewed Oct 27, 2025

View reviewed changes

Update decompress error & namespace cache

1aab84f

Signed-off-by: Gaziza Yestemirova <[email protected]>

dkrotx reviewed Oct 27, 2025

View reviewed changes

gazi-yestemirova added 5 commits October 28, 2025 21:56

Update compression with magic header

5215681

Signed-off-by: Gaziza Yestemirova <[email protected]>

Add data compression enabled flag

96efa32

Signed-off-by: Gaziza Yestemirova <[email protected]>

Fix etcd test

4d6c174

Signed-off-by: Gaziza Yestemirova <[email protected]>

Rebase

45cc3ca

Signed-off-by: Gaziza Yestemirova <[email protected]>

Fix configs

d9d4e26

Signed-off-by: Gaziza Yestemirova <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: [shard-distributor]Compress data when writing to ETCD #7366

feat: [shard-distributor]Compress data when writing to ETCD #7366

gazi-yestemirova commented Oct 23, 2025 •

edited

Loading

Uh oh!

eleonoradgr Oct 27, 2025

Uh oh!

gazi-yestemirova Oct 27, 2025

Uh oh!

dkrotx Oct 27, 2025

Uh oh!

gazi-yestemirova Oct 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

feat: [shard-distributor]Compress data when writing to ETCD #7366

Are you sure you want to change the base?

feat: [shard-distributor]Compress data when writing to ETCD #7366

Conversation

gazi-yestemirova commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eleonoradgr Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

gazi-yestemirova Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

dkrotx Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

gazi-yestemirova Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

gazi-yestemirova commented Oct 23, 2025 •

edited

Loading