-
Notifications
You must be signed in to change notification settings - Fork 95
docs: fatal codes, re-init, and retry policy #1818
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
toddbaert
wants to merge
8
commits into
main
Choose a base branch
from
docs/provider-spec-updates
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
b4cc836
docs: fatal codes, re-init, and retry policy
toddbaert 8a0b6f1
fixup: json
toddbaert f749674
Update docs/reference/specifications/providers.md
toddbaert 18363a9
fixup: typo
toddbaert 48a46ea
Update docs/reference/specifications/providers.md
toddbaert 393eaf0
Apply suggestion from @alexandraoberaigner
toddbaert 0a8fd9c
fixup: pr review changes
toddbaert 7cb3e07
Merge branch 'main' into docs/provider-spec-updates
aepfli File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -64,18 +64,21 @@ stateDiagram-v2 | |
| NOT_READY --> ERROR: initialize | ||
| READY --> ERROR: disconnected, disconnected period == 0 | ||
| READY --> STALE: disconnected, disconnect period < retry grace period | ||
| READY --> NOT_READY: shutdown | ||
| STALE --> ERROR: disconnect period >= retry grace period | ||
| STALE --> NOT_READY: shutdown | ||
| ERROR --> READY: reconnected | ||
| ERROR --> [*]: shutdown | ||
| ERROR --> NOT_READY: shutdown | ||
| ERROR --> [*]: Error code == PROVIDER_FATAL | ||
|
|
||
| note right of STALE | ||
| note left of STALE | ||
| stream disconnected, attempting to reconnect, | ||
| resolve from cache* | ||
| resolve from flag set rules** | ||
| STALE emitted | ||
| end note | ||
|
|
||
| note right of READY | ||
| note left of READY | ||
| stream connected, | ||
| evaluation cache active*, | ||
| flag set rules stored**, | ||
|
|
@@ -84,7 +87,7 @@ stateDiagram-v2 | |
| CHANGE emitted with stream messages | ||
| end note | ||
|
|
||
| note right of ERROR | ||
| note left of ERROR | ||
| stream disconnected, attempting to reconnect, | ||
| evaluation cache purged*, | ||
| ERROR emitted | ||
|
|
@@ -101,25 +104,51 @@ stateDiagram-v2 | |
|
|
||
| ### Stream Reconnection | ||
|
|
||
| When either stream (sync or event) disconnects, whether due to the associated deadline being exceeded, network error or any other cause, the provider attempts to re-establish the stream immediately, and then retries with an exponential back-off. | ||
| We always rely on the [integrated functionality of GRPC for reconnection](https://github.com/grpc/grpc/blob/master/doc/connection-backoff.md) and utilize [Wait-for-Ready](https://grpc.io/docs/guides/wait-for-ready/) to re-establish the stream. | ||
| We are configuring the underlying reconnection mechanism whenever we can, based on our configuration. (not all GRPC implementations support this) | ||
| When either stream (sync or event) fails or completes, whether due to the associated deadline being exceeded, network error or any other cause, the provider attempts to re-establish the stream. | ||
| Both the RPC and sync streams will forever attempt to be re-established unless the stream response indicates a [fatal status code](#fatal-status-codes). | ||
| This is distinct from the [gRPC retry-policy](#grpc-retry-policy), which automatically retries *all RPCs* (streams or otherwise) a limited number of times to make the provider resilient to transient errors. | ||
| It's also distinct from the [gRPC layer 4 reconnection mechanism](https://grpc.github.io/grpc/core/md_doc_connection-backoff.html) which only reconnects the TCP connection, but not any streams. | ||
| When the stream is reconnecting, providers transition to the [STALE](https://openfeature.dev/docs/reference/concepts/events/#provider_stale) state, and after `retryGracePeriod`, transition to the ERROR state, emitting the respective events during these transitions. | ||
|
|
||
| | language/property | min connect timeout | max backoff | initial backoff | jitter | multiplier | | ||
| |-------------------|-----------------------------------|--------------------------|--------------------------|--------|------------| | ||
| | GRPC property | grpc.initial_reconnect_backoff_ms | max_reconnect_backoff_ms | min_reconnect_backoff_ms | 0.2 | 1.6 | | ||
| | Flagd property | deadlineMs | retryBackoffMaxMs | retryBackoffMs | 0.2 | 1.6 | | ||
| | --- | --- | --- | --- | --- | --- | | ||
| | default [^1] | ✅ | ✅ | ✅ | 0.2 | 1.6 | | ||
| | js | ✅ | ✅ | ❌ | 0.2 | 1.6 | | ||
| | java | ❌ | ❌ | ❌ | 0.2 | 1.6 | | ||
| ## gRPC Retry Policy | ||
|
|
||
| [^1] : C++, Python, Ruby, Objective-C, PHP, C#, js(deprecated) | ||
| flagd leverages gRPC built-in retry mechanism for all RPCs. | ||
| In short, the retry policy attempts to retry all RPCs which return `UNAVAILABLE` or `UNKNOWN` status codes 3 times, with a 1s, 2s, 4s, backoff respectively. | ||
| No other status codes are retried. | ||
| The flagd gRPC retry policy is specified below: | ||
|
|
||
| When disconnected, if the time since disconnection is less than `retryGracePeriod`, the provider emits `STALE` when it disconnects. | ||
toddbaert marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| While the provider is in state `STALE` the provider resolves values from its cache or stored flag set rules, depending on its resolver mode. | ||
| When the time since the last disconnect first exceeds `retryGracePeriod`, the provider emits `ERROR`. | ||
| The provider attempts to reconnect indefinitely, with a maximum interval of `retryBackoffMaxMs`. | ||
| ```json | ||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is standard retryPolicy, accepted in this JSON format by most gRPC implementations. |
||
| { | ||
| "methodConfig": [ | ||
| { | ||
| "name": [ | ||
| { | ||
| "service": "flagd.evaluation.v1.Service" | ||
| }, | ||
| { | ||
| "service": "flagd.sync.v1.FlagSyncService" | ||
| } | ||
| ], | ||
| "retryPolicy": { | ||
| "MaxAttempts": 4, | ||
| "InitialBackoff": "1s", | ||
| "MaxBackoff": $FLAGD_RETRY_BACKOFF_MAX_MS, // from provider options | ||
| "BackoffMultiplier": 2.0, | ||
| "RetryableStatusCodes": [ | ||
| "UNAVAILABLE", | ||
| "UNKNOWN" | ||
| ] | ||
| } | ||
| } | ||
| ] | ||
| } | ||
| ``` | ||
|
|
||
| ## Fatal Status Codes | ||
|
|
||
| Providers accept an option for defining fatal gRPC status codes which, when received in the RPC or sync streams, transition the provider to the PROVIDER_FATAL state. | ||
| This configuration is useful for situations wherein these codes indicate to a client that their configuration is invalid and must be changed (i.e., the error is non-transient). | ||
| Examples for this include status codes such as `UNAUTHENTICATED` or `PERMISSION_DENIED`. | ||
|
|
||
| ## RPC Resolver | ||
|
|
||
|
|
@@ -262,28 +291,29 @@ precedence. | |
|
|
||
| Below are the supported configuration parameters (note that not all apply to both resolver modes): | ||
|
|
||
| | Option name | Environment variable name | Explanation | Type & Values | Default | Compatible resolver | | ||
| | --------------------- | ------------------------------ | ---------------------------------------------------------------------- | ---------------------------- | ----------------------------- | ----------------------- | | ||
| | resolver | FLAGD_RESOLVER | mode of operation | String - `rpc`, `in-process` | rpc | rpc & in-process | | ||
| | host | FLAGD_HOST | remote host | String | localhost | rpc & in-process | | ||
| | port | FLAGD_PORT | remote port | int | 8013 (rpc), 8015 (in-process) | rpc & in-process | | ||
| | targetUri | FLAGD_TARGET_URI | alternative to host/port, supporting custom name resolution | string | null | rpc & in-process | | ||
| | tls | FLAGD_TLS | connection encryption | boolean | false | rpc & in-process | | ||
| | socketPath | FLAGD_SOCKET_PATH | alternative to host port, unix socket | String | null | rpc & in-process | | ||
| | certPath | FLAGD_SERVER_CERT_PATH | tls cert path | String | null | rpc & in-process | | ||
| | deadlineMs | FLAGD_DEADLINE_MS | deadline for unary calls, and timeout for initialization | int | 500 | rpc & in-process & file | | ||
| | streamDeadlineMs | FLAGD_STREAM_DEADLINE_MS | deadline for streaming calls, useful as an application-layer keepalive | int | 600000 | rpc & in-process | | ||
| | retryBackoffMs | FLAGD_RETRY_BACKOFF_MS | initial backoff for stream retry | int | 1000 | rpc & in-process | | ||
| | retryBackoffMaxMs | FLAGD_RETRY_BACKOFF_MAX_MS | maximum backoff for stream retry | int | 120000 | rpc & in-process | | ||
| | retryGracePeriod | FLAGD_RETRY_GRACE_PERIOD | period in seconds before provider moves from STALE to ERROR state | int | 5 | rpc & in-process & file | | ||
| | keepAliveTime | FLAGD_KEEP_ALIVE_TIME_MS | http 2 keepalive | long | 0 | rpc & in-process | | ||
| | cache | FLAGD_CACHE | enable cache of static flags | String - `lru`, `disabled` | lru | rpc | | ||
| | maxCacheSize | FLAGD_MAX_CACHE_SIZE | max size of static flag cache | int | 1000 | rpc | | ||
| | selector | FLAGD_SOURCE_SELECTOR | selects a single sync source to retrieve flags from only that source | string | null | in-process | | ||
| | providerId | FLAGD_PROVIDER_ID | A unique identifier for flagd(grpc client) initiating the request. | string | null | in-process | | ||
| | offlineFlagSourcePath | FLAGD_OFFLINE_FLAG_SOURCE_PATH | offline, file-based flag definitions, overrides host/port/targetUri | string | null | file | | ||
| | offlinePollIntervalMs | FLAGD_OFFLINE_POLL_MS | poll interval for reading offlineFlagSourcePath | int | 5000 | file | | ||
| | contextEnricher | - | sync-metadata to evaluation context mapping function | function | identity function | in-process | | ||
| | Option name | Environment variable name | Explanation | Type & Values | Default | Compatible resolver | | ||
| | --------------------- | ------------------------------ | --------------------------------------------------------------------------------------------------------------- | ---------------------------- | ----------------------------- | ----------------------- | | ||
| | resolver | FLAGD_RESOLVER | mode of operation | string - `rpc`, `in-process` | rpc | rpc & in-process | | ||
| | host | FLAGD_HOST | remote host | string | localhost | rpc & in-process | | ||
| | port | FLAGD_PORT | remote port | int | 8013 (rpc), 8015 (in-process) | rpc & in-process | | ||
| | targetUri | FLAGD_TARGET_URI | alternative to host/port, supporting custom name resolution | string | null | rpc & in-process | | ||
| | tls | FLAGD_TLS | connection encryption | boolean | false | rpc & in-process | | ||
| | socketPath | FLAGD_SOCKET_PATH | alternative to host port, unix socket | string | null | rpc & in-process | | ||
| | certPath | FLAGD_SERVER_CERT_PATH | tls cert path | string | null | rpc & in-process | | ||
| | deadlineMs | FLAGD_DEADLINE_MS | deadline for unary calls, and timeout for initialization | int | 500 | rpc & in-process & file | | ||
| | streamDeadlineMs | FLAGD_STREAM_DEADLINE_MS | deadline for streaming calls, useful as an application-layer keepalive | int | 600000 | rpc & in-process | | ||
| | retryBackoffMs | FLAGD_RETRY_BACKOFF_MS | initial backoff for stream retry | int | 1000 | rpc & in-process | | ||
| | retryBackoffMaxMs | FLAGD_RETRY_BACKOFF_MAX_MS | maximum backoff for stream retry | int | 120000 | rpc & in-process | | ||
| | retryGracePeriod | FLAGD_RETRY_GRACE_PERIOD | period in seconds before provider moves from STALE to ERROR state | int | 5 | rpc & in-process & file | | ||
| | keepAliveTime | FLAGD_KEEP_ALIVE_TIME_MS | http 2 keepalive | long | 0 | rpc & in-process | | ||
| | cache | FLAGD_CACHE | enable cache of static flags | string - `lru`, `disabled` | lru | rpc | | ||
| | maxCacheSize | FLAGD_MAX_CACHE_SIZE | max size of static flag cache | int | 1000 | rpc | | ||
| | selector | FLAGD_SOURCE_SELECTOR | selects a single sync source to retrieve flags from only that source | string | null | in-process | | ||
| | providerId | FLAGD_PROVIDER_ID | A unique identifier for flagd(grpc client) initiating the request. | string | null | in-process | | ||
| | offlineFlagSourcePath | FLAGD_OFFLINE_FLAG_SOURCE_PATH | offline, file-based flag definitions, overrides host/port/targetUri | string | null | file | | ||
| | offlinePollIntervalMs | FLAGD_OFFLINE_POLL_MS | poll interval for reading offlineFlagSourcePath | int | 5000 | file | | ||
| | contextEnricher | - | sync-metadata to evaluation context mapping function | function | identity function | in-process | | ||
| | fatalStatusCodes | FLAGD_FATAL_STATUS_CODES | a list of gRPC status codes, which will cause streams to give up and put the provider in a PROVIDER_FATAL state | array | [] | rpc & in-process | | ||
|
|
||
| ### Custom Name Resolution | ||
|
|
||
|
|
||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
old:
new:
The main different is we make it clear transitions are possible from non-fatal
ERROR, back toNOT_READY... many implementations already support this, but not all.I think it makes sense to specify this so we can be consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had the impression that
PROVIDER_FATALcan only happen during initialization, where the error can be surfaced and handled by the caller.With the current proposal,
PROVIDER_FATALcan be a result of a failing sync. As a user, it seems that I'll get the default value and an error. Am I supposed to handle this error and exit the program?Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tangenti I need to make some updates to reflect the discussion here.
We decided the best path forward is to provide an option to enumerate the status codes that a user considers FATAL. In the case those are received, whether it's the initial connection or not, the program can exit (or rebuild a new provider). We believed this was the best trade-off between usability and complexity, and it's easy to understand: select what you want to consider FATAL, and take the action you want when those codes are received; by marking a code is FATAL you are telling the provider that this code represents a non-transient error state.
I will make the related updates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've included this.