Skip to content

Conversation

AndreKurait
Copy link
Member

@AndreKurait AndreKurait commented Oct 10, 2025

Description

This PR updates the default configuration for the Reindex from Snapshot (RFS) worker to improve migration performance and reliability:

  1. Increased default worker resources: Doubled the default CPU allocation from 2048 (2 vCPUs) to 4096 (4 vCPUs) and memory from 4096 MB to 8192 MB (8 GB) to better handle concurrent operations
  2. Increased max-connections: Raised the default from 10 to 20 connections to improve throughput during document reindexing
  3. Added initial-lease-duration: Introduced --initial-lease-duration PT60M parameter with a 60-minute default to provide more table lease management during migration operations
  4. Increased EBS Throughput: Increased default RFS EBS throughput to 250MBps to support faster shard operations.

These changes apply to the "default" worker size configuration. The "maximum" worker size remains unchanged (16 vCPUs, 32 GB RAM, 100 connections).

Issues Resolved

Seeking to reduce OOM errors and improve performance in the majority of customers.

Testing

Check List

  • New functionality includes testing
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

- Increase default worker CPU from 2048 to 4096 and memory from 4096 to 8192
- Increase default max-connections from 10 to 20
- Add --initial-lease-duration PT60M parameter to RFS command
- Update all tests to reflect new default values

Signed-off-by: Andre Kurait <[email protected]>
command = appendArgIfNotInExtraArgs(command, extraArgsDict, "--lucene-dir", `"${storagePath}/lucene"`)
command = appendArgIfNotInExtraArgs(command, extraArgsDict, "--target-host", osClusterEndpoint)
command = appendArgIfNotInExtraArgs(command, extraArgsDict, "--max-shard-size-bytes", `${Math.ceil(maxShardSizeBytes)}`)
command = appendArgIfNotInExtraArgs(command, extraArgsDict, "--max-connections", props.reindexFromSnapshotWorkerSize === "maximum" ? "100" : "10")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the current minimum too low, shouldn't we advise customers with larger shards to set the size to maximum and avoid tweaking the min values?

command = appendArgIfNotInExtraArgs(command, extraArgsDict, "--max-shard-size-bytes", `${Math.ceil(maxShardSizeBytes)}`)
command = appendArgIfNotInExtraArgs(command, extraArgsDict, "--max-connections", props.reindexFromSnapshotWorkerSize === "maximum" ? "100" : "10")
command = appendArgIfNotInExtraArgs(command, extraArgsDict, "--max-connections", props.reindexFromSnapshotWorkerSize === "maximum" ? "100" : "20")
command = appendArgIfNotInExtraArgs(command, extraArgsDict, "--initial-lease-duration", "PT60M")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we update the default in the java code rather than cloud formation, this won't be the default on the EKS deployments of RFS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants