-
Notifications
You must be signed in to change notification settings - Fork 95
docs: fatal codes, re-init, and retry policy #1818
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 1 commit
b4cc836
8a0b6f1
f749674
18363a9
48a46ea
393eaf0
0a8fd9c
7cb3e07
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -64,18 +64,21 @@ stateDiagram-v2 | |
| NOT_READY --> ERROR: initialize | ||
| READY --> ERROR: disconnected, disconnected period == 0 | ||
| READY --> STALE: disconnected, disconnect period < retry grace period | ||
| READY --> NOT_READY: shutdown | ||
| STALE --> ERROR: disconnect period >= retry grace period | ||
| STALE --> NOT_READY: shutdown | ||
| ERROR --> READY: reconnected | ||
| ERROR --> [*]: shutdown | ||
| ERROR --> NOT_READY: shutdown | ||
| ERROR --> [*]: Error code == PROVIDER_FATAL | ||
|
|
||
| note right of STALE | ||
| note left of STALE | ||
| stream disconnected, attempting to reconnect, | ||
| resolve from cache* | ||
| resolve from flag set rules** | ||
| STALE emitted | ||
| end note | ||
|
|
||
| note right of READY | ||
| note left of READY | ||
| stream connected, | ||
| evaluation cache active*, | ||
| flag set rules stored**, | ||
|
|
@@ -84,7 +87,7 @@ stateDiagram-v2 | |
| CHANGE emitted with stream messages | ||
| end note | ||
|
|
||
| note right of ERROR | ||
| note left of ERROR | ||
| stream disconnected, attempting to reconnect, | ||
| evaluation cache purged*, | ||
| ERROR emitted | ||
|
|
@@ -101,25 +104,47 @@ stateDiagram-v2 | |
|
|
||
| ### Stream Reconnection | ||
|
|
||
| When either stream (sync or event) disconnects, whether due to the associated deadline being exceeded, network error or any other cause, the provider attempts to re-establish the stream immediately, and then retries with an exponential back-off. | ||
| We always rely on the [integrated functionality of GRPC for reconnection](https://github.com/grpc/grpc/blob/master/doc/connection-backoff.md) and utilize [Wait-for-Ready](https://grpc.io/docs/guides/wait-for-ready/) to re-establish the stream. | ||
| We are configuring the underlying reconnection mechanism whenever we can, based on our configuration. (not all GRPC implementations support this) | ||
| When either stream (sync or event) disconnects, whether due to the associated deadline being exceeded, network error or any other cause, the provider attempts to re-establish the stream immediately. | ||
toddbaert marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| Both the RPC and sync streams will forever attempt to reconnect unless the stream response indicates a [fatal status code](#fatal-status-codes). | ||
| This is distinct from the [gRPC retry-policy](#grpc-retry-policy), which automatically retries *all RPCs* (streams or otherwise) a limited number of times to make the provider resilient to transient errors. | ||
|
|
||
| | language/property | min connect timeout | max backoff | initial backoff | jitter | multiplier | | ||
| |-------------------|-----------------------------------|--------------------------|--------------------------|--------|------------| | ||
| | GRPC property | grpc.initial_reconnect_backoff_ms | max_reconnect_backoff_ms | min_reconnect_backoff_ms | 0.2 | 1.6 | | ||
| | Flagd property | deadlineMs | retryBackoffMaxMs | retryBackoffMs | 0.2 | 1.6 | | ||
| | --- | --- | --- | --- | --- | --- | | ||
| | default [^1] | ✅ | ✅ | ✅ | 0.2 | 1.6 | | ||
| | js | ✅ | ✅ | ❌ | 0.2 | 1.6 | | ||
| | java | ❌ | ❌ | ❌ | 0.2 | 1.6 | | ||
| ## gRPC Retry Policy | ||
|
|
||
| [^1] : C++, Python, Ruby, Objective-C, PHP, C#, js(deprecated) | ||
| flagd leverages gRPC built-in retry mechanism for all RPCs. | ||
| In short, the retry policy attempts to retry all RPCs which return `UNAVAILABLE` or `UNKNOWN` status codes 3 times, with a 1s, 2s, 4s, backoff respectively. | ||
| No other status codes are retried. | ||
| The flagd gRPC retry policy is specified below: | ||
|
|
||
| When disconnected, if the time since disconnection is less than `retryGracePeriod`, the provider emits `STALE` when it disconnects. | ||
toddbaert marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| While the provider is in state `STALE` the provider resolves values from its cache or stored flag set rules, depending on its resolver mode. | ||
| When the time since the last disconnect first exceeds `retryGracePeriod`, the provider emits `ERROR`. | ||
| The provider attempts to reconnect indefinitely, with a maximum interval of `retryBackoffMaxMs`. | ||
| ```json | ||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is standard retryPolicy, accepted in this JSON format by most gRPC implementations. |
||
| { | ||
| "methodConfig": [ | ||
| { | ||
| "name": [ | ||
| { | ||
| "service": "flagd.sync.v1.FlagSyncService", | ||
| "service": "flagd.evaluation.v1.Service", | ||
| } | ||
toddbaert marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| ], | ||
| "retryPolicy": { | ||
| "MaxAttempts": 3, | ||
toddbaert marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| "InitialBackoff": "1s", | ||
| "MaxBackoff": $FLAGD_RETRY_BACKOFF_MAX_MS, // from provider options | ||
| "BackoffMultiplier": 2.0, | ||
| "RetryableStatusCodes": [ | ||
| "UNAVAILABLE", | ||
| "UNKNOWN" | ||
| ] | ||
| } | ||
| } | ||
| ] | ||
| } | ||
| ``` | ||
|
|
||
| ## Fatal Status Codes | ||
|
|
||
| Providers accept an option for defining fatal gRPC status codes which, when received in the RPC or sync streams, transition the provider to the PROVIDER_FATAL state. | ||
| This configuration is useful for situations wherein these codes indicate to a client that their configuration is invalid and must be changed (ie: the error is non-transient). | ||
toddbaert marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| Examples for this include status codes such as `UNAUTHENTICATED` or `PERMISSION_DENIED`. | ||
|
|
||
| ## RPC Resolver | ||
|
|
||
|
|
@@ -262,28 +287,29 @@ precedence. | |
|
|
||
| Below are the supported configuration parameters (note that not all apply to both resolver modes): | ||
|
|
||
| | Option name | Environment variable name | Explanation | Type & Values | Default | Compatible resolver | | ||
| | --------------------- | ------------------------------ | ---------------------------------------------------------------------- | ---------------------------- | ----------------------------- | ----------------------- | | ||
| | resolver | FLAGD_RESOLVER | mode of operation | String - `rpc`, `in-process` | rpc | rpc & in-process | | ||
| | host | FLAGD_HOST | remote host | String | localhost | rpc & in-process | | ||
| | port | FLAGD_PORT | remote port | int | 8013 (rpc), 8015 (in-process) | rpc & in-process | | ||
| | targetUri | FLAGD_TARGET_URI | alternative to host/port, supporting custom name resolution | string | null | rpc & in-process | | ||
| | tls | FLAGD_TLS | connection encryption | boolean | false | rpc & in-process | | ||
| | socketPath | FLAGD_SOCKET_PATH | alternative to host port, unix socket | String | null | rpc & in-process | | ||
| | certPath | FLAGD_SERVER_CERT_PATH | tls cert path | String | null | rpc & in-process | | ||
| | deadlineMs | FLAGD_DEADLINE_MS | deadline for unary calls, and timeout for initialization | int | 500 | rpc & in-process & file | | ||
| | streamDeadlineMs | FLAGD_STREAM_DEADLINE_MS | deadline for streaming calls, useful as an application-layer keepalive | int | 600000 | rpc & in-process | | ||
| | retryBackoffMs | FLAGD_RETRY_BACKOFF_MS | initial backoff for stream retry | int | 1000 | rpc & in-process | | ||
| | retryBackoffMaxMs | FLAGD_RETRY_BACKOFF_MAX_MS | maximum backoff for stream retry | int | 120000 | rpc & in-process | | ||
| | retryGracePeriod | FLAGD_RETRY_GRACE_PERIOD | period in seconds before provider moves from STALE to ERROR state | int | 5 | rpc & in-process & file | | ||
| | keepAliveTime | FLAGD_KEEP_ALIVE_TIME_MS | http 2 keepalive | long | 0 | rpc & in-process | | ||
| | cache | FLAGD_CACHE | enable cache of static flags | String - `lru`, `disabled` | lru | rpc | | ||
| | maxCacheSize | FLAGD_MAX_CACHE_SIZE | max size of static flag cache | int | 1000 | rpc | | ||
| | selector | FLAGD_SOURCE_SELECTOR | selects a single sync source to retrieve flags from only that source | string | null | in-process | | ||
| | providerId | FLAGD_PROVIDER_ID | A unique identifier for flagd(grpc client) initiating the request. | string | null | in-process | | ||
| | offlineFlagSourcePath | FLAGD_OFFLINE_FLAG_SOURCE_PATH | offline, file-based flag definitions, overrides host/port/targetUri | string | null | file | | ||
| | offlinePollIntervalMs | FLAGD_OFFLINE_POLL_MS | poll interval for reading offlineFlagSourcePath | int | 5000 | file | | ||
| | contextEnricher | - | sync-metadata to evaluation context mapping function | function | identity function | in-process | | ||
| | Option name | Environment variable name | Explanation | Type & Values | Default | Compatible resolver | | ||
| | --------------------- | ------------------------------ | --------------------------------------------------------------------------------------------------------------- | ---------------------------- | ----------------------------- | ----------------------- | | ||
| | resolver | FLAGD_RESOLVER | mode of operation | string - `rpc`, `in-process` | rpc | rpc & in-process | | ||
| | host | FLAGD_HOST | remote host | string | localhost | rpc & in-process | | ||
| | port | FLAGD_PORT | remote port | int | 8013 (rpc), 8015 (in-process) | rpc & in-process | | ||
| | targetUri | FLAGD_TARGET_URI | alternative to host/port, supporting custom name resolution | string | null | rpc & in-process | | ||
| | tls | FLAGD_TLS | connection encryption | boolean | false | rpc & in-process | | ||
| | socketPath | FLAGD_SOCKET_PATH | alternative to host port, unix socket | string | null | rpc & in-process | | ||
| | certPath | FLAGD_SERVER_CERT_PATH | tls cert path | string | null | rpc & in-process | | ||
| | deadlineMs | FLAGD_DEADLINE_MS | deadline for unary calls, and timeout for initialization | int | 500 | rpc & in-process & file | | ||
| | streamDeadlineMs | FLAGD_STREAM_DEADLINE_MS | deadline for streaming calls, useful as an application-layer keepalive | int | 600000 | rpc & in-process | | ||
| | retryBackoffMs | FLAGD_RETRY_BACKOFF_MS | initial backoff for stream retry | int | 1000 | rpc & in-process | | ||
| | retryBackoffMaxMs | FLAGD_RETRY_BACKOFF_MAX_MS | maximum backoff for stream retry | int | 120000 | rpc & in-process | | ||
| | retryGracePeriod | FLAGD_RETRY_GRACE_PERIOD | period in seconds before provider moves from STALE to ERROR state | int | 5 | rpc & in-process & file | | ||
| | keepAliveTime | FLAGD_KEEP_ALIVE_TIME_MS | http 2 keepalive | long | 0 | rpc & in-process | | ||
| | cache | FLAGD_CACHE | enable cache of static flags | string - `lru`, `disabled` | lru | rpc | | ||
| | maxCacheSize | FLAGD_MAX_CACHE_SIZE | max size of static flag cache | int | 1000 | rpc | | ||
| | selector | FLAGD_SOURCE_SELECTOR | selects a single sync source to retrieve flags from only that source | string | null | in-process | | ||
| | providerId | FLAGD_PROVIDER_ID | A unique identifier for flagd(grpc client) initiating the request. | string | null | in-process | | ||
| | offlineFlagSourcePath | FLAGD_OFFLINE_FLAG_SOURCE_PATH | offline, file-based flag definitions, overrides host/port/targetUri | string | null | file | | ||
| | offlinePollIntervalMs | FLAGD_OFFLINE_POLL_MS | poll interval for reading offlineFlagSourcePath | int | 5000 | file | | ||
| | contextEnricher | - | sync-metadata to evaluation context mapping function | function | identity function | in-process | | ||
| | fatalStatusCodes | - | a list of gRPC status codes, which will cause streams to give up and put the provider in a PROVIDER_FATAL state | array | [] | rpc & in-process | | ||
|
||
|
|
||
| ### Custom Name Resolution | ||
|
|
||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
old:
new:
The main different is we make it clear transitions are possible from non-fatal
ERROR, back toNOT_READY... many implementations already support this, but not all.I think it makes sense to specify this so we can be consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had the impression that
PROVIDER_FATALcan only happen during initialization, where the error can be surfaced and handled by the caller.With the current proposal,
PROVIDER_FATALcan be a result of a failing sync. As a user, it seems that I'll get the default value and an error. Am I supposed to handle this error and exit the program?Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tangenti I need to make some updates to reflect the discussion here.
We decided the best path forward is to provide an option to enumerate the status codes that a user considers FATAL. In the case those are received, whether it's the initial connection or not, the program can exit (or rebuild a new provider). We believed this was the best trade-off between usability and complexity, and it's easy to understand: select what you want to consider FATAL, and take the action you want when those codes are received; by marking a code is FATAL you are telling the provider that this code represents a non-transient error state.
I will make the related updates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've included this.