CBG-5357: removal of fast fail retry by default by gregns1 · Pull Request #8308 · couchbase/sync_gateway

gregns1 · 2026-05-28T11:00:32Z

Make best effort retry default for bucket readiness and index lookups
Adds config option to make it fast fail
Parameterise relevant tests

Pre-review checklist

Removed debug logging (fmt.Print, log.Print, ...)
Logging sensitive data? Make sure it's tagged (e.g. base.UD(docID), base.MD(dbName))
Updated relevant information in the API specifications (such as endpoint descriptions, schemas, ...) in docs/api

Dependencies (if applicable)

Link upstream PRs
Update Go module dependencies when merged

Integration Tests

https://jenkins.sgwdev.com/job/SyncGatewayIntegration/704/

github-actions · 2026-05-28T11:00:44Z

Redocly previews

Copilot

Pull request overview

Adjusts Sync Gateway’s GoCB retry behavior so readiness checks and index lookups no longer use fail-fast retry by default, with an unsupported.use_gocb_fast_fail_retry switch to re-enable fail-fast behavior when desired.

Changes:

Introduces unsupported.use_gocb_fast_fail_retry (config + flag + OpenAPI) and threads it through bootstrap/per-db connections.
Switches cluster/bucket WaitUntilReady and index lookup retry strategies between fail-fast and best-effort based on the new setting.
Updates affected tests to cover both modes and to reflect differing HTTP error classifications.

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`rest/server_context.go`	Propagates startup `unsupported.use_gocb_fast_fail_retry` into per-db `BucketSpec`.
`rest/main.go`	Adds bootstrap option plumbing and passes retry-mode into `CouchbaseClusterSpec`.
`rest/database_init_manager.go`	Passes retry-mode into `NewClusterOnlyN1QLStore` for index initialization.
`rest/config_startup.go`	Adds `UseGOCBFastFailRetry` to `UnsupportedConfig`.
`rest/config_flags.go`	Registers `--unsupported.use_gocb_fast_fail_retry` startup flag.
`rest/adminapitest/admin_api_test.go`	Updates admin API tests to validate behavior/status differences in both retry modes.
`docs/api/components/schemas.yaml`	Documents the new startup `unsupported.use_gocb_fast_fail_retry` property in OpenAPI schema.
`db/indextest/util.go`	Updates N1QL store construction to include retry-mode parameter.
`db/indextest/indextest_test.go`	Updates N1QL store construction to include retry-mode parameter.
`db/indextest/indextest_dual_metadata_test.go`	Updates N1QL store construction to include retry-mode parameter.
`base/gocb_utils.go`	Adds `goCBRetryStrategy` helper returning fail-fast vs best-effort strategy.
`base/gocb_utils_test.go`	Adds unit test coverage for `goCBRetryStrategy`.
`base/collection.go`	Makes cluster/bucket readiness retry strategy configurable via `BucketSpec.UseGOCBFastFailRetry`.
`base/collection_n1ql.go`	Threads retry-mode into N1QL index manager creation.
`base/collection_n1ql_common.go`	Uses configurable retry strategy for index enumeration (`GetAllIndexes`).
`base/cluster_n1ql.go`	Extends `NewClusterOnlyN1QLStore` to store retry-mode and pass into index manager.
`base/bucket.go`	Adds `UseGOCBFastFailRetry` to `BucketSpec`.
`base/bucket_gocb_test.go`	Updates incorrect-login test expectations for both retry modes (fast-fail vs timeout).
`base/bootstrap.go`	Adds retry-mode to `CouchbaseClusterSpec`/`CouchbaseCluster`, applies configurable readiness retry, and maps non-auth bucket readiness failures to 502.

torcolvin

To discuss with team, but my preference is to not make this configurable since I can't imagine people opting into this - the reason to opt in would be if you expect to make typos or credential problems and this is mostly useful when initially setting up Sync Gateway. In a production environment, you would want retry behavior to make sure that connections are robust to things like:

one bad node in a connection string or in SRV records.
If there is another issue like https://jira.issues.couchbase.com/browse/GOCBC-1812 where operator can surface nodes to connect to that aren't ideal.
One node that is temporarily offline or flickering.

torcolvin · 2026-05-28T14:08:01Z

 	err = b.WaitUntilReady(time.Second*10, &gocb.WaitUntilReadyOptions{
 		DesiredState:  gocb.ClusterStateOnline,
-		RetryStrategy: &goCBv2FailFastRetryStrategy{},
+		RetryStrategy: goCBRetryStrategy(cc.useGOCBFastFailRetry),


This is a nit but I find the capitalization on this to be somewhat sanity inducing:

I'd prefer: gocbRetryStrategy to goCBRetryStrategy.

I think for this function, I would do something like: func (cc *CouchbaseCluster) getRetryStrategy() to make this code easier to read.

torcolvin · 2026-05-28T14:10:44Z

 	ViewQueryTimeoutSecs          *uint32        // the view query timeout in seconds (default: 75 seconds)
 	MaxConcurrentQueryOps         *int           // maximum number of concurrent query operations (default: DefaultMaxConcurrentQueryOps)
 	BucketOpTimeout               *time.Duration // How long bucket ops should block returning "operation timed out". If nil, uses GoCB default.  GoCB buckets only.
+	UseGOCBFastFailRetry          bool           // When true, gocb readiness checks and index lookups fail fast instead of using the best-effort retry strategy


What do you think about doing something like: InitialConnectionRetryStrategy *gocb.RetryStrategy with something like "If unset, use the default retry strategy for sync gateway.

The other idea if this is confusing is to create an enum for types of strategies:

DefaultRetryStrategy

FastFailOnInitialConnectRetryStrategy

torcolvin · 2026-05-28T14:12:14Z

+		bucketName:           cl.bucketName,
+		scopeName:            scopeName,
+		collectionName:       collectionName,
+		useGOCBFastFailRetry: cl.useGOCBFastFailRetry,


We actually don't need this, because the RetryStrategy is inherited from the cluster and/or Bucket, so we can remove the complexity of this code and only set this for the Cluster/Bucket which makes this simpler.

At this point of making index calls, we have already made a bucket connection and we don't care about surfacing an authenitcation error.

torcolvin · 2026-05-28T14:12:40Z

+		bucketName:           c.BucketName(),
+		collectionName:       c.CollectionName(),
+		scopeName:            c.ScopeName(),
+		useGOCBFastFailRetry: c.Bucket.Spec.UseGOCBFastFailRetry,


See above, we should rely on RetryStrategy from the cluster/bucket.

torcolvin · 2026-05-28T14:12:56Z

 func (im *indexManager) GetAllIndexes() ([]gocb.QueryIndex, error) {
 	opts := &gocb.GetAllQueryIndexesOptions{
-		RetryStrategy: &goCBv2FailFastRetryStrategy{},
+		RetryStrategy: goCBRetryStrategy(im.useGOCBFastFailRetry),


I probably wrote this code, but we should rely on the cluster/bucket parameters and we can drop this line entirely, regardless of the state of this PR.

torcolvin · 2026-05-28T14:22:13Z

          type: boolean
          default: false
+        use_gocb_fast_fail_retry:
+          description: When true, gocb cluster/bucket readiness checks and index lookups (on both the bootstrap and per-database connections) fail on the first error instead of retrying. When false, they use the best-effort retry strategy, retrying until their timeout when Couchbase Server is unavailable or failing over.


The only reason you want to enable this is if you expect to have authentication errors. I think it's worthwhile to call this out.

When true, errors on initial connection to Couchbase Server will fail instantaneously. Enabling this will surface authentication errors quickly, but can cause some Sync Gateway operations to shut down databases with intermittent Couchbase Server connection errors.

Given this text, it seems low value to even expose this option and I can't imagine people really wanting to enable it. If we do have a flag, I'd be inclined to make this unsupported so we aren't committed to having this option for all eternity.

The reasons that this can be useful:

If using persistent configuration with a single set of credentials, Sync Gateway will fail to start up in 1sec rather than 30sec. If you are using persistent configuration however, we actually would only care about this for the very initial bootstrap connection if you are not using custom bucket/database credentials. While custom credentials are supported, I have never seen them used in the wild.

If you are not using persistent configuration, the credentials are at the database level, and basically you are protecting against typos in the configuration. In this case, the potential for failure would occur at startup only.

It is possible that the RBAC user would be able to make a cluster connection but then not have bucket permissions, so it is possible to fail under each bucket.

gregns1 added 3 commits May 27, 2026 11:56

CBG-5357: removal of fast fail retry by default

cb9daaf

open api specs

691eca4

fix some tests on new behaviour

f9933cf

Copilot AI review requested due to automatic review settings May 28, 2026 11:00

Copilot started reviewing on behalf of gregns1 May 28, 2026 11:00 View session

Copilot AI reviewed May 28, 2026

View reviewed changes

Comment thread base/bootstrap.go Outdated

fix return on connectToBucket

90c6e9f

torcolvin reviewed May 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CBG-5357: removal of fast fail retry by default#8308

CBG-5357: removal of fast fail retry by default#8308
gregns1 wants to merge 4 commits into
mainfrom
CBG-5357

gregns1 commented May 28, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 28, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

torcolvin left a comment

Uh oh!

torcolvin May 28, 2026

Uh oh!

torcolvin May 28, 2026

Uh oh!

torcolvin May 28, 2026 •

edited

Loading

Uh oh!

torcolvin May 28, 2026

Uh oh!

torcolvin May 28, 2026

Uh oh!

torcolvin May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

gregns1 commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pre-review checklist

Dependencies (if applicable)

Integration Tests

Uh oh!

github-actions Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Redocly previews

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

torcolvin left a comment

Choose a reason for hiding this comment

Uh oh!

torcolvin May 28, 2026

Choose a reason for hiding this comment

Uh oh!

torcolvin May 28, 2026

Choose a reason for hiding this comment

Uh oh!

torcolvin May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

torcolvin May 28, 2026

Choose a reason for hiding this comment

Uh oh!

torcolvin May 28, 2026

Choose a reason for hiding this comment

Uh oh!

torcolvin May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gregns1 commented May 28, 2026 •

edited

Loading

github-actions Bot commented May 28, 2026 •

edited

Loading

torcolvin May 28, 2026 •

edited

Loading