docs: fatal codes, re-init, and retry policy

toddbaert · toddbaert · commit b4cc836ba4b1 · 2025-10-30T12:19:01.000-04:00
Signed-off-by: Todd Baert &lt;todd.baert@dynatrace.com&gt;
diff --git a/docs/reference/specifications/providers.md b/docs/reference/specifications/providers.md
@@ -64,18 +64,21 @@ stateDiagram-v2
     NOT_READY --> ERROR: initialize
     READY --> ERROR: disconnected, disconnected period == 0
     READY --> STALE: disconnected, disconnect period < retry grace period
+    READY --> NOT_READY: shutdown
     STALE --> ERROR: disconnect period >= retry grace period
+    STALE --> NOT_READY: shutdown
     ERROR --> READY: reconnected
-    ERROR --> [*]: shutdown
+    ERROR --> NOT_READY: shutdown
+    ERROR --> [*]: Error code == PROVIDER_FATAL
 
-    note right of STALE
+    note left of STALE
         stream disconnected, attempting to reconnect,
         resolve from cache*
         resolve from flag set rules**
         STALE emitted
     end note
 
-    note right of READY
+    note left of READY
         stream connected,
         evaluation cache active*,
         flag set rules stored**,
@@ -84,7 +87,7 @@ stateDiagram-v2
         CHANGE emitted with stream messages
     end note
 
-    note right of ERROR
+    note left of ERROR
         stream disconnected, attempting to reconnect,
         evaluation cache purged*,
         ERROR emitted
@@ -101,25 +104,47 @@ stateDiagram-v2
 
 ### Stream Reconnection
 
-When either stream (sync or event) disconnects, whether due to the associated deadline being exceeded, network error or any other cause, the provider attempts to re-establish the stream immediately, and then retries with an exponential back-off.
-We always rely on the [integrated functionality of GRPC for reconnection](https://github.com/grpc/grpc/blob/master/doc/connection-backoff.md) and utilize [Wait-for-Ready](https://grpc.io/docs/guides/wait-for-ready/) to re-establish the stream.
-We are configuring the underlying reconnection mechanism whenever we can, based on our configuration. (not all GRPC implementations support this)
+When either stream (sync or event) disconnects, whether due to the associated deadline being exceeded, network error or any other cause, the provider attempts to re-establish the stream immediately.
+Both the RPC and sync streams will forever attempt to reconnect unless the stream response indicates a [fatal status code](#fatal-status-codes).
+This is distinct from the [gRPC retry-policy](#grpc-retry-policy), which automatically retries *all RPCs* (streams or otherwise) a limited number of times to make the provider resilient to transient errors.
 
-| language/property | min connect timeout               | max backoff              | initial backoff          | jitter | multiplier |
-|-------------------|-----------------------------------|--------------------------|--------------------------|--------|------------|
-| GRPC property     | grpc.initial_reconnect_backoff_ms | max_reconnect_backoff_ms | min_reconnect_backoff_ms | 0.2    | 1.6        |
-| Flagd property    | deadlineMs                        | retryBackoffMaxMs        | retryBackoffMs           | 0.2    | 1.6        |
-| ---               | ---                               | ---                      | ---                      | ---    | ---        |
-| default [^1]      | ✅                                 | ✅                        | ✅                        | 0.2    | 1.6        |
-| js                | ✅                                 | ✅                        | ❌                        | 0.2    | 1.6        |
-| java              | ❌                                 | ❌                        | ❌                        | 0.2    | 1.6        |
+## gRPC Retry Policy
 
-[^1] : C++, Python, Ruby, Objective-C, PHP, C#, js(deprecated)
+flagd leverages gRPC built-in retry mechanism for all RPCs.
+In short, the retry policy attempts to retry all RPCs which return `UNAVAILABLE` or `UNKNOWN` status codes 3 times, with a 1s, 2s, 4s, backoff respectively.
+No other status codes are retried.
+The flagd gRPC retry policy is specified below:
 
-When disconnected, if the time since disconnection is less than `retryGracePeriod`, the provider emits `STALE` when it disconnects.
-While the provider is in state `STALE` the provider resolves values from its cache or stored flag set rules, depending on its resolver mode.
-When the time since the last disconnect first exceeds `retryGracePeriod`, the provider emits `ERROR`.
-The provider attempts to reconnect indefinitely, with a maximum interval of `retryBackoffMaxMs`.
+```json
+{
+    "methodConfig": [
+        {
+            "name": [
+            {
+                "service": "flagd.sync.v1.FlagSyncService",
+                "service": "flagd.evaluation.v1.Service",
+            }
+            ],
+            "retryPolicy": {
+                "MaxAttempts": 3,
+                "InitialBackoff": "1s",
+                "MaxBackoff": $FLAGD_RETRY_BACKOFF_MAX_MS, // from provider options
+                "BackoffMultiplier": 2.0,
+                "RetryableStatusCodes": [
+                    "UNAVAILABLE",
+                    "UNKNOWN"
+                ]
+            }
+        }
+    ]
+}
+```
+
+## Fatal Status Codes
+
+Providers accept an option for defining fatal gRPC status codes which, when received in the RPC or sync streams, transition the provider to the PROVIDER_FATAL state.
+This configuration is useful for situations wherein these codes indicate to a client that their configuration is invalid and must be changed (ie: the error is non-transient).
+Examples for this include status codes such as `UNAUTHENTICATED` or `PERMISSION_DENIED`.
 
 ## RPC Resolver
 
@@ -262,28 +287,29 @@ precedence.
 
 Below are the supported configuration parameters (note that not all apply to both resolver modes):
 
-| Option name           | Environment variable name      | Explanation                                                            | Type & Values                | Default                       | Compatible resolver     |
-| --------------------- | ------------------------------ | ---------------------------------------------------------------------- | ---------------------------- | ----------------------------- | ----------------------- |
-| resolver              | FLAGD_RESOLVER                 | mode of operation                                                      | String - `rpc`, `in-process` | rpc                           | rpc & in-process        |
-| host                  | FLAGD_HOST                     | remote host                                                            | String                       | localhost                     | rpc & in-process        |
-| port                  | FLAGD_PORT                     | remote port                                                            | int                          | 8013 (rpc), 8015 (in-process) | rpc & in-process        |
-| targetUri             | FLAGD_TARGET_URI               | alternative to host/port, supporting custom name resolution            | string                       | null                          | rpc & in-process        |
-| tls                   | FLAGD_TLS                      | connection encryption                                                  | boolean                      | false                         | rpc & in-process        |
-| socketPath            | FLAGD_SOCKET_PATH              | alternative to host port, unix socket                                  | String                       | null                          | rpc & in-process        |
-| certPath              | FLAGD_SERVER_CERT_PATH         | tls cert path                                                          | String                       | null                          | rpc & in-process        |
-| deadlineMs            | FLAGD_DEADLINE_MS              | deadline for unary calls, and timeout for initialization               | int                          | 500                           | rpc & in-process & file |
-| streamDeadlineMs      | FLAGD_STREAM_DEADLINE_MS       | deadline for streaming calls, useful as an application-layer keepalive | int                          | 600000                        | rpc & in-process        |
-| retryBackoffMs        | FLAGD_RETRY_BACKOFF_MS         | initial backoff for stream retry                                       | int                          | 1000                          | rpc & in-process        |
-| retryBackoffMaxMs     | FLAGD_RETRY_BACKOFF_MAX_MS     | maximum backoff for stream retry                                       | int                          | 120000                        | rpc & in-process        |
-| retryGracePeriod      | FLAGD_RETRY_GRACE_PERIOD       | period in seconds before provider moves from STALE to ERROR state      | int                          | 5                             | rpc & in-process & file |
-| keepAliveTime         | FLAGD_KEEP_ALIVE_TIME_MS       | http 2 keepalive                                                       | long                         | 0                             | rpc & in-process        |
-| cache                 | FLAGD_CACHE                    | enable cache of static flags                                           | String - `lru`, `disabled`   | lru                           | rpc                     |
-| maxCacheSize          | FLAGD_MAX_CACHE_SIZE           | max size of static flag cache                                          | int                          | 1000                          | rpc                     |
-| selector              | FLAGD_SOURCE_SELECTOR          | selects a single sync source to retrieve flags from only that source   | string                       | null                          | in-process              |
-| providerId            | FLAGD_PROVIDER_ID              | A unique identifier for flagd(grpc client) initiating the request.     | string                       | null                          | in-process              |
-| offlineFlagSourcePath | FLAGD_OFFLINE_FLAG_SOURCE_PATH | offline, file-based flag definitions, overrides host/port/targetUri    | string                       | null                          | file                    |
-| offlinePollIntervalMs | FLAGD_OFFLINE_POLL_MS          | poll interval for reading offlineFlagSourcePath                        | int                          | 5000                          | file                    |
-| contextEnricher       | -                              | sync-metadata to evaluation context mapping function                   | function                     | identity function             | in-process              |
+| Option name           | Environment variable name      | Explanation                                                                                                     | Type & Values                | Default                       | Compatible resolver     |
+| --------------------- | ------------------------------ | --------------------------------------------------------------------------------------------------------------- | ---------------------------- | ----------------------------- | ----------------------- |
+| resolver              | FLAGD_RESOLVER                 | mode of operation                                                                                               | string - `rpc`, `in-process` | rpc                           | rpc & in-process        |
+| host                  | FLAGD_HOST                     | remote host                                                                                                     | string                       | localhost                     | rpc & in-process        |
+| port                  | FLAGD_PORT                     | remote port                                                                                                     | int                          | 8013 (rpc), 8015 (in-process) | rpc & in-process        |
+| targetUri             | FLAGD_TARGET_URI               | alternative to host/port, supporting custom name resolution                                                     | string                       | null                          | rpc & in-process        |
+| tls                   | FLAGD_TLS                      | connection encryption                                                                                           | boolean                      | false                         | rpc & in-process        |
+| socketPath            | FLAGD_SOCKET_PATH              | alternative to host port, unix socket                                                                           | string                       | null                          | rpc & in-process        |
+| certPath              | FLAGD_SERVER_CERT_PATH         | tls cert path                                                                                                   | string                       | null                          | rpc & in-process        |
+| deadlineMs            | FLAGD_DEADLINE_MS              | deadline for unary calls, and timeout for initialization                                                        | int                          | 500                           | rpc & in-process & file |
+| streamDeadlineMs      | FLAGD_STREAM_DEADLINE_MS       | deadline for streaming calls, useful as an application-layer keepalive                                          | int                          | 600000                        | rpc & in-process        |
+| retryBackoffMs        | FLAGD_RETRY_BACKOFF_MS         | initial backoff for stream retry                                                                                | int                          | 1000                          | rpc & in-process        |
+| retryBackoffMaxMs     | FLAGD_RETRY_BACKOFF_MAX_MS     | maximum backoff for stream retry                                                                                | int                          | 120000                        | rpc & in-process        |
+| retryGracePeriod      | FLAGD_RETRY_GRACE_PERIOD       | period in seconds before provider moves from STALE to ERROR state                                               | int                          | 5                             | rpc & in-process & file |
+| keepAliveTime         | FLAGD_KEEP_ALIVE_TIME_MS       | http 2 keepalive                                                                                                | long                         | 0                             | rpc & in-process        |
+| cache                 | FLAGD_CACHE                    | enable cache of static flags                                                                                    | string - `lru`, `disabled`   | lru                           | rpc                     |
+| maxCacheSize          | FLAGD_MAX_CACHE_SIZE           | max size of static flag cache                                                                                   | int                          | 1000                          | rpc                     |
+| selector              | FLAGD_SOURCE_SELECTOR          | selects a single sync source to retrieve flags from only that source                                            | string                       | null                          | in-process              |
+| providerId            | FLAGD_PROVIDER_ID              | A unique identifier for flagd(grpc client) initiating the request.                                              | string                       | null                          | in-process              |
+| offlineFlagSourcePath | FLAGD_OFFLINE_FLAG_SOURCE_PATH | offline, file-based flag definitions, overrides host/port/targetUri                                             | string                       | null                          | file                    |
+| offlinePollIntervalMs | FLAGD_OFFLINE_POLL_MS          | poll interval for reading offlineFlagSourcePath                                                                 | int                          | 5000                          | file                    |
+| contextEnricher       | -                              | sync-metadata to evaluation context mapping function                                                            | function                     | identity function             | in-process              |
+| fatalStatusCodes      | -                              | a list of gRPC status codes, which will cause streams to give up and put the provider in a PROVIDER_FATAL state | array                        | []                            | rpc & in-process        |
 
 ### Custom Name Resolution