Skip to content

CBG-5357: removal of fast fail retry by default#8308

Open
gregns1 wants to merge 4 commits into
mainfrom
CBG-5357
Open

CBG-5357: removal of fast fail retry by default#8308
gregns1 wants to merge 4 commits into
mainfrom
CBG-5357

Conversation

@gregns1
Copy link
Copy Markdown
Contributor

@gregns1 gregns1 commented May 28, 2026

CBG-5357

Make best effort retry default for bucket readiness and index lookups
Adds config option to make it fast fail
Parameterise relevant tests

Pre-review checklist

  • Removed debug logging (fmt.Print, log.Print, ...)
  • Logging sensitive data? Make sure it's tagged (e.g. base.UD(docID), base.MD(dbName))
  • Updated relevant information in the API specifications (such as endpoint descriptions, schemas, ...) in docs/api

Dependencies (if applicable)

  • Link upstream PRs
  • Update Go module dependencies when merged

Integration Tests

Copilot AI review requested due to automatic review settings May 28, 2026 11:00
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 28, 2026

Redocly previews

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adjusts Sync Gateway’s GoCB retry behavior so readiness checks and index lookups no longer use fail-fast retry by default, with an unsupported.use_gocb_fast_fail_retry switch to re-enable fail-fast behavior when desired.

Changes:

  • Introduces unsupported.use_gocb_fast_fail_retry (config + flag + OpenAPI) and threads it through bootstrap/per-db connections.
  • Switches cluster/bucket WaitUntilReady and index lookup retry strategies between fail-fast and best-effort based on the new setting.
  • Updates affected tests to cover both modes and to reflect differing HTTP error classifications.

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
rest/server_context.go Propagates startup unsupported.use_gocb_fast_fail_retry into per-db BucketSpec.
rest/main.go Adds bootstrap option plumbing and passes retry-mode into CouchbaseClusterSpec.
rest/database_init_manager.go Passes retry-mode into NewClusterOnlyN1QLStore for index initialization.
rest/config_startup.go Adds UseGOCBFastFailRetry to UnsupportedConfig.
rest/config_flags.go Registers --unsupported.use_gocb_fast_fail_retry startup flag.
rest/adminapitest/admin_api_test.go Updates admin API tests to validate behavior/status differences in both retry modes.
docs/api/components/schemas.yaml Documents the new startup unsupported.use_gocb_fast_fail_retry property in OpenAPI schema.
db/indextest/util.go Updates N1QL store construction to include retry-mode parameter.
db/indextest/indextest_test.go Updates N1QL store construction to include retry-mode parameter.
db/indextest/indextest_dual_metadata_test.go Updates N1QL store construction to include retry-mode parameter.
base/gocb_utils.go Adds goCBRetryStrategy helper returning fail-fast vs best-effort strategy.
base/gocb_utils_test.go Adds unit test coverage for goCBRetryStrategy.
base/collection.go Makes cluster/bucket readiness retry strategy configurable via BucketSpec.UseGOCBFastFailRetry.
base/collection_n1ql.go Threads retry-mode into N1QL index manager creation.
base/collection_n1ql_common.go Uses configurable retry strategy for index enumeration (GetAllIndexes).
base/cluster_n1ql.go Extends NewClusterOnlyN1QLStore to store retry-mode and pass into index manager.
base/bucket.go Adds UseGOCBFastFailRetry to BucketSpec.
base/bucket_gocb_test.go Updates incorrect-login test expectations for both retry modes (fast-fail vs timeout).
base/bootstrap.go Adds retry-mode to CouchbaseClusterSpec/CouchbaseCluster, applies configurable readiness retry, and maps non-auth bucket readiness failures to 502.

Comment thread base/bootstrap.go Outdated
Copy link
Copy Markdown
Collaborator

@torcolvin torcolvin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To discuss with team, but my preference is to not make this configurable since I can't imagine people opting into this - the reason to opt in would be if you expect to make typos or credential problems and this is mostly useful when initially setting up Sync Gateway. In a production environment, you would want retry behavior to make sure that connections are robust to things like:

  1. one bad node in a connection string or in SRV records.
  2. If there is another issue like https://jira.issues.couchbase.com/browse/GOCBC-1812 where operator can surface nodes to connect to that aren't ideal.
  3. One node that is temporarily offline or flickering.

Comment thread base/bootstrap.go
err = b.WaitUntilReady(time.Second*10, &gocb.WaitUntilReadyOptions{
DesiredState: gocb.ClusterStateOnline,
RetryStrategy: &goCBv2FailFastRetryStrategy{},
RetryStrategy: goCBRetryStrategy(cc.useGOCBFastFailRetry),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nit but I find the capitalization on this to be somewhat sanity inducing:

I'd prefer: gocbRetryStrategy to goCBRetryStrategy.

I think for this function, I would do something like: func (cc *CouchbaseCluster) getRetryStrategy() to make this code easier to read.

Comment thread base/bucket.go
ViewQueryTimeoutSecs *uint32 // the view query timeout in seconds (default: 75 seconds)
MaxConcurrentQueryOps *int // maximum number of concurrent query operations (default: DefaultMaxConcurrentQueryOps)
BucketOpTimeout *time.Duration // How long bucket ops should block returning "operation timed out". If nil, uses GoCB default. GoCB buckets only.
UseGOCBFastFailRetry bool // When true, gocb readiness checks and index lookups fail fast instead of using the best-effort retry strategy
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about doing something like: InitialConnectionRetryStrategy *gocb.RetryStrategy with something like "If unset, use the default retry strategy for sync gateway.

The other idea if this is confusing is to create an enum for types of strategies:

  1. DefaultRetryStrategy
  2. FastFailOnInitialConnectRetryStrategy

Comment thread base/cluster_n1ql.go
bucketName: cl.bucketName,
scopeName: scopeName,
collectionName: collectionName,
useGOCBFastFailRetry: cl.useGOCBFastFailRetry,
Copy link
Copy Markdown
Collaborator

@torcolvin torcolvin May 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually don't need this, because the RetryStrategy is inherited from the cluster and/or Bucket, so we can remove the complexity of this code and only set this for the Cluster/Bucket which makes this simpler.

At this point of making index calls, we have already made a bucket connection and we don't care about surfacing an authenitcation error.

Comment thread base/collection_n1ql.go
bucketName: c.BucketName(),
collectionName: c.CollectionName(),
scopeName: c.ScopeName(),
useGOCBFastFailRetry: c.Bucket.Spec.UseGOCBFastFailRetry,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above, we should rely on RetryStrategy from the cluster/bucket.

func (im *indexManager) GetAllIndexes() ([]gocb.QueryIndex, error) {
opts := &gocb.GetAllQueryIndexesOptions{
RetryStrategy: &goCBv2FailFastRetryStrategy{},
RetryStrategy: goCBRetryStrategy(im.useGOCBFastFailRetry),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably wrote this code, but we should rely on the cluster/bucket parameters and we can drop this line entirely, regardless of the state of this PR.

type: boolean
default: false
use_gocb_fast_fail_retry:
description: When true, gocb cluster/bucket readiness checks and index lookups (on both the bootstrap and per-database connections) fail on the first error instead of retrying. When false, they use the best-effort retry strategy, retrying until their timeout when Couchbase Server is unavailable or failing over.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason you want to enable this is if you expect to have authentication errors. I think it's worthwhile to call this out.

When true, errors on initial connection to Couchbase Server will fail instantaneously. Enabling this will surface authentication errors quickly, but can cause some Sync Gateway operations to shut down databases with intermittent Couchbase Server connection errors.

Given this text, it seems low value to even expose this option and I can't imagine people really wanting to enable it. If we do have a flag, I'd be inclined to make this unsupported so we aren't committed to having this option for all eternity.

The reasons that this can be useful:

  1. If using persistent configuration with a single set of credentials, Sync Gateway will fail to start up in 1sec rather than 30sec. If you are using persistent configuration however, we actually would only care about this for the very initial bootstrap connection if you are not using custom bucket/database credentials. While custom credentials are supported, I have never seen them used in the wild.
  2. If you are not using persistent configuration, the credentials are at the database level, and basically you are protecting against typos in the configuration. In this case, the potential for failure would occur at startup only.

It is possible that the RBAC user would be able to make a cluster connection but then not have bucket permissions, so it is possible to fail under each bucket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants