-
Notifications
You must be signed in to change notification settings - Fork 867
feat: [shard-distributor]Compress data when writing to ETCD #7366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
feat: [shard-distributor]Compress data when writing to ETCD #7366
Conversation
Signed-off-by: Gaziza Yestemirova <[email protected]>
Signed-off-by: Gaziza Yestemirova <[email protected]>
…ow#7357) <!-- Describe what has changed in this PR --> **What changed?** this is a mechanical removal of the clustersByRegion and refactor (except where otherwise remarked) as part of the cleanup to the clusterAttributes schema. It notably: - Chooses to backwards-compat migrate the regional fields into the cluster attributes schema - Catches a few minor missed earlier updates Behaviour changes: This PR modifies a few Active/Active unit tests as the refactor removes the old implmeentation of ClustersByRegion. This is a prerelease feature that has been simplified and no actual known customer for it exists in the wild. It's very likely unlikely any of these changes affect anyone in any material fashion. But for completeness: - DomainConfig counter: Because the clustersByRegion concept is implemented slightly differently to ClusterAttributes (in that the DomainConfig counter is updated for Cluster Attributes on domain failover, vs not in the ClustersByRegion Impl), this is therefore a slight behavioural change here - the field DomainUpdate is incremented on ClusterAttribute change and therefore the update is applied to the domain in every case. I don't see any downside to this behaviour change and it's clearer, but it's still worth flagging. - FailoverVersion not applied to cluster attributes: Because these are unrelated counters, cluster attributes no longer use or reference this and act independendently on domain cluster attributes being updated. **Testing** I manually tested it --------- Signed-off-by: David Porter <[email protected]> Signed-off-by: Gaziza Yestemirova <[email protected]>
) <!-- Describe what has changed in this PR --> **What changed?** This fixes a bug where the active-cluster flag is out of sync for AA domains with the failover version, probably leading to some undefined, or at least very confusing behaviour. This fixes it to be in line with normal active/passive domains. <!-- Tell your future self why have you made these changes --> **Why?** <!-- How have you verified this change? Tested locally? Added a unit test? Checked in staging env? --> **How did you test it?** <!-- Assuming the worst case, what can be broken when deploying this change to production? --> **Potential risks** <!-- Is it notable for release? e.g. schema updates, configuration or data migration required? If so, please mention it, and also update CHANGELOG.md --> **Release notes** <!-- Is there any documentation updates should be made for config, https://cadenceworkflow.io/docs/operation-guide/setup/ ? If so, please open an PR in https://github.com/cadence-workflow/cadence-docs --> **Documentation Changes** Signed-off-by: David Porter <[email protected]> Signed-off-by: Gaziza Yestemirova <[email protected]>
…updated (cadence-workflow#7320) <!-- Describe what has changed in this PR --> **What changed?** epic: cadence-workflow#6697 - Refactors the updateReplicationConfig handler a bit to try and simplify it and/or make it less of a nightmare to follow - Fixes some validation problems where it was possible to pass invalid clusters on update --------- Signed-off-by: David Porter <[email protected]> Signed-off-by: Gaziza Yestemirova <[email protected]>
…w#7365) <!-- Describe what has changed in this PR --> **What changed?** This does a bit more cleanup for some disused fields for the AA project. These fields are not in use. ClusterAttributes and CLi: - I'm not attempting to render the cluster-attributes in the normal domain `desc` command since I'm not sure it's really a worthwhile thing to render. It's visible already with the JSON output if details are needed and honestly, working with the JSON is a lot more tractable. <!-- Tell your future self why have you made these changes --> **Why?** <!-- How have you verified this change? Tested locally? Added a unit test? Checked in staging env? --> **How did you test it?** <!-- Assuming the worst case, what can be broken when deploying this change to production? --> **Potential risks** <!-- Is it notable for release? e.g. schema updates, configuration or data migration required? If so, please mention it, and also update CHANGELOG.md --> **Release notes** <!-- Is there any documentation updates should be made for config, https://cadenceworkflow.io/docs/operation-guide/setup/ ? If so, please open an PR in https://github.com/cadence-workflow/cadence-docs --> **Documentation Changes** --------- Signed-off-by: David Porter <[email protected]> Signed-off-by: Gaziza Yestemirova <[email protected]>
…rkflow#7371) <!-- Describe what has changed in this PR --> **What changed?** switches the Cluster attributes as they're stored in the DB to use snappy, since they're expected to potentially be somewhat large. I expect this to be a backwards compatible change due to the existing records being persisted with the encoding type, and that's what local testing showed to be true. <!-- Tell your future self why have you made these changes --> **Why?** <!-- How have you verified this change? Tested locally? Added a unit test? Checked in staging env? --> **How did you test it?** - [X] Tested manually - [X] Unit tests <!-- Assuming the worst case, what can be broken when deploying this change to production? --> **Potential risks** <!-- Is it notable for release? e.g. schema updates, configuration or data migration required? If so, please mention it, and also update CHANGELOG.md --> **Release notes** <!-- Is there any documentation updates should be made for config, https://cadenceworkflow.io/docs/operation-guide/setup/ ? If so, please open an PR in https://github.com/cadence-workflow/cadence-docs --> **Documentation Changes** Signed-off-by: David Porter <[email protected]> Signed-off-by: Gaziza Yestemirova <[email protected]>
Our direct peer provider caches connections between hosts, maintaining them until the host is no longer a peer. When that happens we close the connection, which makes any open requests to that host fail with an error including a status code of Cancelled. Since this can happen at any time with any of our peers, we should retry these requests. These retries are going to re-execute the entire operation, recalculating the host and connection to use. The other time we would see this error is during host shutdown, when we also close the connections. This is one of the last actions taken during shutdown, so it's unlikely for components to observe it. Even if there are components that have shutdown handling and currently rely on this error to stop an asynchronous process, marking this as retryable would only delay their incorrect shutdown rather than preventing it outright. <!-- Describe what has changed in this PR --> **What changed?** - Treat yarpc Cancelled as retryable <!-- Tell your future self why have you made these changes --> **Why?** - Retry transient failures <!-- How have you verified this change? Tested locally? Added a unit test? Checked in staging env? --> **How did you test it?** - Unit and integration tests. <!-- Assuming the worst case, what can be broken when deploying this change to production? --> **Potential risks** - Low. At most this could impact shutdown performance, delaying the completion of bad acting components that have no other shutdown mechanism. <!-- Is it notable for release? e.g. schema updates, configuration or data migration required? If so, please mention it, and also update CHANGELOG.md --> **Release notes** <!-- Is there any documentation updates should be made for config, https://cadenceworkflow.io/docs/operation-guide/setup/ ? If so, please open an PR in https://github.com/cadence-workflow/cadence-docs --> **Documentation Changes** Signed-off-by: Gaziza Yestemirova <[email protected]>
Signed-off-by: Gaziza Yestemirova <[email protected]>
<!-- Describe what has changed in this PR --> **What changed?** adding a readme for using auth <!-- Tell your future self why have you made these changes --> **Why?** help users to onboard <!-- How have you verified this change? Tested locally? Added a unit test? Checked in staging env? --> **How did you test it?** <!-- Assuming the worst case, what can be broken when deploying this change to production? --> **Potential risks** <!-- Is it notable for release? e.g. schema updates, configuration or data migration required? If so, please mention it, and also update CHANGELOG.md --> **Release notes** <!-- Is there any documentation updates should be made for config, https://cadenceworkflow.io/docs/operation-guide/setup/ ? If so, please open an PR in https://github.com/cadence-workflow/cadence-docs --> **Documentation Changes** Signed-off-by: Gaziza Yestemirova <[email protected]>
…adence-workflow#7374) <!-- Describe what has changed in this PR --> **What changed?** Fix IsActiveIn method for active-active domains <!-- Tell your future self why have you made these changes --> **Why?** For active-active domains, we also need to check the domain level active cluster. <!-- How have you verified this change? Tested locally? Added a unit test? Checked in staging env? --> **How did you test it?** unit tests <!-- Assuming the worst case, what can be broken when deploying this change to production? --> **Potential risks** <!-- Is it notable for release? e.g. schema updates, configuration or data migration required? If so, please mention it, and also update CHANGELOG.md --> **Release notes** <!-- Is there any documentation updates should be made for config, https://cadenceworkflow.io/docs/operation-guide/setup/ ? If so, please open an PR in https://github.com/cadence-workflow/cadence-docs --> **Documentation Changes** Signed-off-by: Gaziza Yestemirova <[email protected]>
…#7375) <!-- Describe what has changed in this PR --> **What changed?** The UpdateDomain API is (optionally) undergoing changes where it's administrative functions are being protected by authorization checks and it's failover / user-facing functionality is being made available in the new endpoint `Failover`. This - if enforced - may affect users who're not aware of this change. <!-- Tell your future self why have you made these changes --> **Why?** <!-- How have you verified this change? Tested locally? Added a unit test? Checked in staging env? --> **How did you test it?** <!-- Assuming the worst case, what can be broken when deploying this change to production? --> **Potential risks** <!-- Is it notable for release? e.g. schema updates, configuration or data migration required? If so, please mention it, and also update CHANGELOG.md --> **Release notes** <!-- Is there any documentation updates should be made for config, https://cadenceworkflow.io/docs/operation-guide/setup/ ? If so, please open an PR in https://github.com/cadence-workflow/cadence-docs --> **Documentation Changes** Signed-off-by: David Porter <[email protected]> Signed-off-by: Gaziza Yestemirova <[email protected]>
<!-- Describe what has changed in this PR --> **What changed?** Adds the domain_audit_log table. <!-- Tell your future self why have you made these changes --> **Why?** This is the first step of the persistence implementation for a persisted audit_log for domain changes. It will only be used for changes to the ReplicationConfig initially, replacing FailoverHistory in the domains metadata. <!-- How have you verified this change? Tested locally? Added a unit test? Checked in staging env? --> **How did you test it?** Unit tests, manual POC. <!-- Assuming the worst case, what can be broken when deploying this change to production? --> **Potential risks** Something is wrong with the schema definition (or the primary key definition) and we need to modify the table later. Ideally this is caught before it gets much further than this. <!-- Is it notable for release? e.g. schema updates, configuration or data migration required? If so, please mention it, and also update CHANGELOG.md --> **Release notes** N/A <!-- Is there any documentation updates should be made for config, https://cadenceworkflow.io/docs/operation-guide/setup/ ? If so, please open an PR in https://github.com/cadence-workflow/cadence-docs --> **Documentation Changes** N/A **Detailed Description** [In-depth description of the changes made to the schema or interfaces, specifying new fields, removed fields, or modified data structures] The domain_audit_log table has been added to the Cassandra schema. It will not exist in SQL etc. for now, and is planned to be added early next year. **Impact Analysis** - **Backward Compatibility**: [Analysis of backward compatibility] - **Forward Compatibility**: [Analysis of forward compatibility] N/A **Testing Plan** - **Unit Tests**: [Do we have unit test covering the change?] - **Persistence Tests**: [If the change is related to a data type which is persisted, do we have persistence tests covering the change?] - **Integration Tests**: [Do we have integration test covering the change?] - **Compatibility Tests**: [Have we done tests to test the backward and forward compatibility?] This should be covered by future integration & persistence tests - but is not yet. They will be added in a follow up PR. **Rollout Plan** - What is the rollout plan? - Does the order of deployment matter? No - Is it safe to rollback? Does the order of rollback matter? Yes, until applications start using it. - Is there a kill switch to mitigate the impact immediately? No. --- Signed-off-by: Gaziza Yestemirova <[email protected]>
a8740eb to
372dd4d
Compare
Signed-off-by: Gaziza Yestemirova <[email protected]>
Signed-off-by: Gaziza Yestemirova <[email protected]>
| } | ||
|
|
||
| decompressed, err := snappy.Decode(nil, data) | ||
| if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should return an error here and and handle the fallback at the higher level.
If we fail to decompress we could have a warning and try to unmarshal anyway in the decompressAndUnmarshal function. This way we don't fail but we don't hide the failure at the lower level
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I agree! thank you! Updated the PR
Signed-off-by: Gaziza Yestemirova <[email protected]>
| if err != nil { | ||
| logger.Warn(fmt.Sprintf("failed to decompress %s, proceeding with unmarshaling..", errorContext), tag.Error(err)) | ||
| } | ||
| if err := json.Unmarshal(decompressed, target); err != nil { | ||
| return fmt.Errorf("unmarshal %s: %w", errorContext, err) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sequence looks unusual: we Unmarshal decompressed data even when it is ... nil?
In any way - can we rely on some magic sequence instead of decompressing error?
Otherwise we need to remember that not-compressed data causes this waning message [which should be ignored]. I'd prefer to not have this. And I think we agreed we actually want to optionally compress data, and sometimes explore raw state in etcd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I remember this: "And I think we agreed we actually want to optionally compress data, and sometimes explore raw state in etcd.".
Could you please elaborate on this?
Signed-off-by: Gaziza Yestemirova <[email protected]>
Signed-off-by: Gaziza Yestemirova <[email protected]>
Signed-off-by: Gaziza Yestemirova <[email protected]>
Signed-off-by: Gaziza Yestemirova <[email protected]>
Signed-off-by: Gaziza Yestemirova <[email protected]>
What changed?
This PR implements snappy framed compression for all data written to etcd in the shard distributor store.
Implementation includes:
Why?
To reducing storage footprint
How did you test it?
unit-test and local testing
Potential risks
Release notes
Documentation Changes