Skip to content

Conversation

@gazi-yestemirova
Copy link
Contributor

@gazi-yestemirova gazi-yestemirova commented Oct 23, 2025

What changed?
This PR implements snappy framed compression for all data written to etcd in the shard distributor store.

Implementation includes:

  • Compressing executor status, reported shards, and assigned state
  • DataCompression enabled/disabled flag in the config
  • Automatic detection of compressed vs uncompressed data via magic header

Why?
To reducing storage footprint

How did you test it?
unit-test and local testing

Potential risks

Release notes

Documentation Changes

@gazi-yestemirova gazi-yestemirova changed the title WIP: [shard-distributor]Compress data when writing to ETCD feat: [shard-distributor]Compress data when writing to ETCD Oct 27, 2025
gazi-yestemirova and others added 13 commits October 27, 2025 09:48
Signed-off-by: Gaziza Yestemirova <[email protected]>
…ow#7357)

<!-- Describe what has changed in this PR -->
**What changed?**

this is a mechanical removal of the clustersByRegion and refactor
(except where otherwise remarked) as part of the cleanup to the
clusterAttributes schema.

It notably:
- Chooses to backwards-compat migrate the regional fields into the
cluster attributes schema
- Catches a few minor missed earlier updates

Behaviour changes:

This PR modifies a few Active/Active unit tests as the refactor removes
the old implmeentation of ClustersByRegion. This is a prerelease feature
that has been simplified and no actual known customer for it exists in
the wild. It's very likely unlikely any of these changes affect anyone
in any material fashion.

But for completeness:

- DomainConfig counter: Because the clustersByRegion concept is
implemented slightly differently to ClusterAttributes (in that the
DomainConfig counter is updated for Cluster Attributes on domain
failover, vs not in the ClustersByRegion Impl), this is therefore a
slight behavioural change here - the field DomainUpdate is incremented
on ClusterAttribute change and therefore the update is applied to the
domain in every case. I don't see any downside to this behaviour change
and it's clearer, but it's still worth flagging.
- FailoverVersion not applied to cluster attributes: Because these are
unrelated counters, cluster attributes no longer use or reference this
and act independendently on domain cluster attributes being updated.

**Testing** I manually tested it

---------

Signed-off-by: David Porter <[email protected]>
Signed-off-by: Gaziza Yestemirova <[email protected]>
)

<!-- Describe what has changed in this PR -->
**What changed?**

This fixes a bug where the active-cluster flag is out of sync for AA
domains with the failover version, probably leading to some undefined,
or at least very confusing behaviour.

This fixes it to be in line with normal active/passive domains.

<!-- Tell your future self why have you made these changes -->
**Why?**

<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->
**How did you test it?**

<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->
**Potential risks**

<!-- Is it notable for release? e.g. schema updates, configuration or
data migration required? If so, please mention it, and also update
CHANGELOG.md -->
**Release notes**

<!-- Is there any documentation updates should be made for config,
https://cadenceworkflow.io/docs/operation-guide/setup/ ? If so, please
open an PR in https://github.com/cadence-workflow/cadence-docs -->
**Documentation Changes**

Signed-off-by: David Porter <[email protected]>
Signed-off-by: Gaziza Yestemirova <[email protected]>
…updated (cadence-workflow#7320)

<!-- Describe what has changed in this PR -->
**What changed?**

epic: cadence-workflow#6697

- Refactors the updateReplicationConfig handler a bit to try and
simplify it and/or make it less of a nightmare to follow
- Fixes some validation problems where it was possible to pass invalid
clusters on update

---------

Signed-off-by: David Porter <[email protected]>
Signed-off-by: Gaziza Yestemirova <[email protected]>
…w#7365)

<!-- Describe what has changed in this PR -->
**What changed?**
This does a bit more cleanup for some disused fields for the AA project.
These fields are not in use.

ClusterAttributes and CLi:
- I'm not attempting to render the cluster-attributes in the normal
domain `desc` command since I'm not sure it's really a worthwhile thing
to render. It's visible already with the JSON output if details are
needed and honestly, working with the JSON is a lot more tractable.

<!-- Tell your future self why have you made these changes -->
**Why?**

<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->
**How did you test it?**

<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->
**Potential risks**

<!-- Is it notable for release? e.g. schema updates, configuration or
data migration required? If so, please mention it, and also update
CHANGELOG.md -->
**Release notes**

<!-- Is there any documentation updates should be made for config,
https://cadenceworkflow.io/docs/operation-guide/setup/ ? If so, please
open an PR in https://github.com/cadence-workflow/cadence-docs -->
**Documentation Changes**

---------

Signed-off-by: David Porter <[email protected]>
Signed-off-by: Gaziza Yestemirova <[email protected]>
…rkflow#7371)

<!-- Describe what has changed in this PR -->
**What changed?**
switches the Cluster attributes as they're stored in the DB to use
snappy, since they're expected to potentially be somewhat large.

I expect this to be a backwards compatible change due to the existing
records being persisted with the encoding type, and that's what local
testing showed to be true.

<!-- Tell your future self why have you made these changes -->
**Why?**

<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->
**How did you test it?**
- [X] Tested manually
- [X] Unit tests

<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->
**Potential risks**

<!-- Is it notable for release? e.g. schema updates, configuration or
data migration required? If so, please mention it, and also update
CHANGELOG.md -->
**Release notes**

<!-- Is there any documentation updates should be made for config,
https://cadenceworkflow.io/docs/operation-guide/setup/ ? If so, please
open an PR in https://github.com/cadence-workflow/cadence-docs -->
**Documentation Changes**

Signed-off-by: David Porter <[email protected]>
Signed-off-by: Gaziza Yestemirova <[email protected]>
Our direct peer provider caches connections between hosts, maintaining
them until the host is no longer a peer. When that happens we close the
connection, which makes any open requests to that host fail with an
error including a status code of Cancelled.

Since this can happen at any time with any of our peers, we should retry
these requests. These retries are going to re-execute the entire
operation, recalculating the host and connection to use.

The other time we would see this error is during host shutdown, when we
also close the connections. This is one of the last actions taken during
shutdown, so it's unlikely for components to observe it. Even if there
are components that have shutdown handling and currently rely on this
error to stop an asynchronous process, marking this as retryable would
only delay their incorrect shutdown rather than preventing it outright.

<!-- Describe what has changed in this PR -->
**What changed?**
- Treat yarpc Cancelled as retryable

<!-- Tell your future self why have you made these changes -->
**Why?**
- Retry transient failures

<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->
**How did you test it?**
- Unit and integration tests.

<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->
**Potential risks**
- Low. At most this could impact shutdown performance, delaying the
completion of bad acting components that have no other shutdown
mechanism.

<!-- Is it notable for release? e.g. schema updates, configuration or
data migration required? If so, please mention it, and also update
CHANGELOG.md -->
**Release notes**

<!-- Is there any documentation updates should be made for config,
https://cadenceworkflow.io/docs/operation-guide/setup/ ? If so, please
open an PR in https://github.com/cadence-workflow/cadence-docs -->
**Documentation Changes**

Signed-off-by: Gaziza Yestemirova <[email protected]>
Signed-off-by: Gaziza Yestemirova <[email protected]>
<!-- Describe what has changed in this PR -->
**What changed?**
adding a readme for using auth

<!-- Tell your future self why have you made these changes -->
**Why?**
help users to onboard

<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->
**How did you test it?**

<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->
**Potential risks**

<!-- Is it notable for release? e.g. schema updates, configuration or
data migration required? If so, please mention it, and also update
CHANGELOG.md -->
**Release notes**

<!-- Is there any documentation updates should be made for config,
https://cadenceworkflow.io/docs/operation-guide/setup/ ? If so, please
open an PR in https://github.com/cadence-workflow/cadence-docs -->
**Documentation Changes**

Signed-off-by: Gaziza Yestemirova <[email protected]>
…adence-workflow#7374)

<!-- Describe what has changed in this PR -->
**What changed?**
Fix IsActiveIn method for active-active domains

<!-- Tell your future self why have you made these changes -->
**Why?**
For active-active domains, we also need to check the domain level active
cluster.

<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->
**How did you test it?**
unit tests

<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->
**Potential risks**

<!-- Is it notable for release? e.g. schema updates, configuration or
data migration required? If so, please mention it, and also update
CHANGELOG.md -->
**Release notes**

<!-- Is there any documentation updates should be made for config,
https://cadenceworkflow.io/docs/operation-guide/setup/ ? If so, please
open an PR in https://github.com/cadence-workflow/cadence-docs -->
**Documentation Changes**

Signed-off-by: Gaziza Yestemirova <[email protected]>
…#7375)

<!-- Describe what has changed in this PR -->
**What changed?**

The UpdateDomain API is (optionally) undergoing changes where it's
administrative functions are being protected by authorization checks and
it's failover / user-facing functionality is being made available in the
new endpoint `Failover`. This - if enforced - may affect users who're
not aware of this change.

<!-- Tell your future self why have you made these changes -->
**Why?**

<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->
**How did you test it?**

<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->
**Potential risks**

<!-- Is it notable for release? e.g. schema updates, configuration or
data migration required? If so, please mention it, and also update
CHANGELOG.md -->
**Release notes**

<!-- Is there any documentation updates should be made for config,
https://cadenceworkflow.io/docs/operation-guide/setup/ ? If so, please
open an PR in https://github.com/cadence-workflow/cadence-docs -->
**Documentation Changes**

Signed-off-by: David Porter <[email protected]>
Signed-off-by: Gaziza Yestemirova <[email protected]>
<!-- Describe what has changed in this PR -->
**What changed?**

Adds the domain_audit_log table.

<!-- Tell your future self why have you made these changes -->
**Why?**

This is the first step of the persistence implementation for a persisted
audit_log for domain changes. It will only be used for changes to the
ReplicationConfig initially, replacing FailoverHistory in the domains
metadata.

<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->
**How did you test it?**

Unit tests, manual POC.

<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->
**Potential risks**

Something is wrong with the schema definition (or the primary key
definition) and we need to modify the table later. Ideally this is
caught before it gets much further than this.

<!-- Is it notable for release? e.g. schema updates, configuration or
data migration required? If so, please mention it, and also update
CHANGELOG.md -->
**Release notes**

N/A

<!-- Is there any documentation updates should be made for config,
https://cadenceworkflow.io/docs/operation-guide/setup/ ? If so, please
open an PR in https://github.com/cadence-workflow/cadence-docs -->
**Documentation Changes**

N/A

**Detailed Description**
[In-depth description of the changes made to the schema or interfaces,
specifying new fields, removed fields, or modified data structures]

The domain_audit_log table has been added to the Cassandra schema. It
will not exist in SQL etc. for now, and is planned to be added early
next year.

**Impact Analysis**
- **Backward Compatibility**: [Analysis of backward compatibility]
- **Forward Compatibility**: [Analysis of forward compatibility]

N/A

**Testing Plan**
- **Unit Tests**: [Do we have unit test covering the change?]
- **Persistence Tests**: [If the change is related to a data type which
is persisted, do we have persistence tests covering the change?]
- **Integration Tests**: [Do we have integration test covering the
change?]
- **Compatibility Tests**: [Have we done tests to test the backward and
forward compatibility?]

This should be covered by future integration & persistence tests - but
is not yet. They will be added in a follow up PR.

**Rollout Plan**
- What is the rollout plan?
- Does the order of deployment matter? No
- Is it safe to rollback? Does the order of rollback matter? Yes, until
applications start using it.
- Is there a kill switch to mitigate the impact immediately? No.
---

Signed-off-by: Gaziza Yestemirova <[email protected]>
Signed-off-by: Gaziza Yestemirova <[email protected]>
Signed-off-by: Gaziza Yestemirova <[email protected]>
}

decompressed, err := snappy.Decode(nil, data)
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should return an error here and and handle the fallback at the higher level.
If we fail to decompress we could have a warning and try to unmarshal anyway in the decompressAndUnmarshal function. This way we don't fail but we don't hide the failure at the lower level

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I agree! thank you! Updated the PR

Comment on lines 45 to 50
if err != nil {
logger.Warn(fmt.Sprintf("failed to decompress %s, proceeding with unmarshaling..", errorContext), tag.Error(err))
}
if err := json.Unmarshal(decompressed, target); err != nil {
return fmt.Errorf("unmarshal %s: %w", errorContext, err)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sequence looks unusual: we Unmarshal decompressed data even when it is ... nil?
In any way - can we rely on some magic sequence instead of decompressing error?

Otherwise we need to remember that not-compressed data causes this waning message [which should be ignored]. I'd prefer to not have this. And I think we agreed we actually want to optionally compress data, and sometimes explore raw state in etcd.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I remember this: "And I think we agreed we actually want to optionally compress data, and sometimes explore raw state in etcd.".
Could you please elaborate on this?

Signed-off-by: Gaziza Yestemirova <[email protected]>
Signed-off-by: Gaziza Yestemirova <[email protected]>
Signed-off-by: Gaziza Yestemirova <[email protected]>
Signed-off-by: Gaziza Yestemirova <[email protected]>
Signed-off-by: Gaziza Yestemirova <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants