Conversation
Redocly previews |
There was a problem hiding this comment.
Pull request overview
Adjusts Sync Gateway’s GoCB retry behavior so readiness checks and index lookups no longer use fail-fast retry by default, with an unsupported.use_gocb_fast_fail_retry switch to re-enable fail-fast behavior when desired.
Changes:
- Introduces
unsupported.use_gocb_fast_fail_retry(config + flag + OpenAPI) and threads it through bootstrap/per-db connections. - Switches cluster/bucket
WaitUntilReadyand index lookup retry strategies between fail-fast and best-effort based on the new setting. - Updates affected tests to cover both modes and to reflect differing HTTP error classifications.
Reviewed changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
rest/server_context.go |
Propagates startup unsupported.use_gocb_fast_fail_retry into per-db BucketSpec. |
rest/main.go |
Adds bootstrap option plumbing and passes retry-mode into CouchbaseClusterSpec. |
rest/database_init_manager.go |
Passes retry-mode into NewClusterOnlyN1QLStore for index initialization. |
rest/config_startup.go |
Adds UseGOCBFastFailRetry to UnsupportedConfig. |
rest/config_flags.go |
Registers --unsupported.use_gocb_fast_fail_retry startup flag. |
rest/adminapitest/admin_api_test.go |
Updates admin API tests to validate behavior/status differences in both retry modes. |
docs/api/components/schemas.yaml |
Documents the new startup unsupported.use_gocb_fast_fail_retry property in OpenAPI schema. |
db/indextest/util.go |
Updates N1QL store construction to include retry-mode parameter. |
db/indextest/indextest_test.go |
Updates N1QL store construction to include retry-mode parameter. |
db/indextest/indextest_dual_metadata_test.go |
Updates N1QL store construction to include retry-mode parameter. |
base/gocb_utils.go |
Adds goCBRetryStrategy helper returning fail-fast vs best-effort strategy. |
base/gocb_utils_test.go |
Adds unit test coverage for goCBRetryStrategy. |
base/collection.go |
Makes cluster/bucket readiness retry strategy configurable via BucketSpec.UseGOCBFastFailRetry. |
base/collection_n1ql.go |
Threads retry-mode into N1QL index manager creation. |
base/collection_n1ql_common.go |
Uses configurable retry strategy for index enumeration (GetAllIndexes). |
base/cluster_n1ql.go |
Extends NewClusterOnlyN1QLStore to store retry-mode and pass into index manager. |
base/bucket.go |
Adds UseGOCBFastFailRetry to BucketSpec. |
base/bucket_gocb_test.go |
Updates incorrect-login test expectations for both retry modes (fast-fail vs timeout). |
base/bootstrap.go |
Adds retry-mode to CouchbaseClusterSpec/CouchbaseCluster, applies configurable readiness retry, and maps non-auth bucket readiness failures to 502. |
torcolvin
left a comment
There was a problem hiding this comment.
To discuss with team, but my preference is to not make this configurable since I can't imagine people opting into this - the reason to opt in would be if you expect to make typos or credential problems and this is mostly useful when initially setting up Sync Gateway. In a production environment, you would want retry behavior to make sure that connections are robust to things like:
- one bad node in a connection string or in SRV records.
- If there is another issue like https://jira.issues.couchbase.com/browse/GOCBC-1812 where operator can surface nodes to connect to that aren't ideal.
- One node that is temporarily offline or flickering.
| err = b.WaitUntilReady(time.Second*10, &gocb.WaitUntilReadyOptions{ | ||
| DesiredState: gocb.ClusterStateOnline, | ||
| RetryStrategy: &goCBv2FailFastRetryStrategy{}, | ||
| RetryStrategy: goCBRetryStrategy(cc.useGOCBFastFailRetry), |
There was a problem hiding this comment.
This is a nit but I find the capitalization on this to be somewhat sanity inducing:
I'd prefer: gocbRetryStrategy to goCBRetryStrategy.
I think for this function, I would do something like: func (cc *CouchbaseCluster) getRetryStrategy() to make this code easier to read.
| ViewQueryTimeoutSecs *uint32 // the view query timeout in seconds (default: 75 seconds) | ||
| MaxConcurrentQueryOps *int // maximum number of concurrent query operations (default: DefaultMaxConcurrentQueryOps) | ||
| BucketOpTimeout *time.Duration // How long bucket ops should block returning "operation timed out". If nil, uses GoCB default. GoCB buckets only. | ||
| UseGOCBFastFailRetry bool // When true, gocb readiness checks and index lookups fail fast instead of using the best-effort retry strategy |
There was a problem hiding this comment.
What do you think about doing something like: InitialConnectionRetryStrategy *gocb.RetryStrategy with something like "If unset, use the default retry strategy for sync gateway.
The other idea if this is confusing is to create an enum for types of strategies:
- DefaultRetryStrategy
- FastFailOnInitialConnectRetryStrategy
| bucketName: cl.bucketName, | ||
| scopeName: scopeName, | ||
| collectionName: collectionName, | ||
| useGOCBFastFailRetry: cl.useGOCBFastFailRetry, |
There was a problem hiding this comment.
We actually don't need this, because the RetryStrategy is inherited from the cluster and/or Bucket, so we can remove the complexity of this code and only set this for the Cluster/Bucket which makes this simpler.
At this point of making index calls, we have already made a bucket connection and we don't care about surfacing an authenitcation error.
| bucketName: c.BucketName(), | ||
| collectionName: c.CollectionName(), | ||
| scopeName: c.ScopeName(), | ||
| useGOCBFastFailRetry: c.Bucket.Spec.UseGOCBFastFailRetry, |
There was a problem hiding this comment.
See above, we should rely on RetryStrategy from the cluster/bucket.
| func (im *indexManager) GetAllIndexes() ([]gocb.QueryIndex, error) { | ||
| opts := &gocb.GetAllQueryIndexesOptions{ | ||
| RetryStrategy: &goCBv2FailFastRetryStrategy{}, | ||
| RetryStrategy: goCBRetryStrategy(im.useGOCBFastFailRetry), |
There was a problem hiding this comment.
I probably wrote this code, but we should rely on the cluster/bucket parameters and we can drop this line entirely, regardless of the state of this PR.
| type: boolean | ||
| default: false | ||
| use_gocb_fast_fail_retry: | ||
| description: When true, gocb cluster/bucket readiness checks and index lookups (on both the bootstrap and per-database connections) fail on the first error instead of retrying. When false, they use the best-effort retry strategy, retrying until their timeout when Couchbase Server is unavailable or failing over. |
There was a problem hiding this comment.
The only reason you want to enable this is if you expect to have authentication errors. I think it's worthwhile to call this out.
When true, errors on initial connection to Couchbase Server will fail instantaneously. Enabling this will surface authentication errors quickly, but can cause some Sync Gateway operations to shut down databases with intermittent Couchbase Server connection errors.
Given this text, it seems low value to even expose this option and I can't imagine people really wanting to enable it. If we do have a flag, I'd be inclined to make this unsupported so we aren't committed to having this option for all eternity.
The reasons that this can be useful:
- If using persistent configuration with a single set of credentials, Sync Gateway will fail to start up in 1sec rather than 30sec. If you are using persistent configuration however, we actually would only care about this for the very initial bootstrap connection if you are not using custom bucket/database credentials. While custom credentials are supported, I have never seen them used in the wild.
- If you are not using persistent configuration, the credentials are at the database level, and basically you are protecting against typos in the configuration. In this case, the potential for failure would occur at startup only.
It is possible that the RBAC user would be able to make a cluster connection but then not have bucket permissions, so it is possible to fail under each bucket.
CBG-5357
Make best effort retry default for bucket readiness and index lookups
Adds config option to make it fast fail
Parameterise relevant tests
Pre-review checklist
fmt.Print,log.Print, ...)base.UD(docID),base.MD(dbName))docs/apiDependencies (if applicable)
Integration Tests