Skip to content
Open
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
112 changes: 70 additions & 42 deletions docs/reference/specifications/providers.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,18 +64,21 @@ stateDiagram-v2
NOT_READY --> ERROR: initialize
READY --> ERROR: disconnected, disconnected period == 0
READY --> STALE: disconnected, disconnect period < retry grace period
READY --> NOT_READY: shutdown
STALE --> ERROR: disconnect period >= retry grace period
STALE --> NOT_READY: shutdown
ERROR --> READY: reconnected
ERROR --> [*]: shutdown
ERROR --> NOT_READY: shutdown
ERROR --> [*]: Error code == PROVIDER_FATAL

note right of STALE
note left of STALE
Comment on lines +69 to +74
Copy link
Member Author

@toddbaert toddbaert Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

old:

Image

new:

Image

The main different is we make it clear transitions are possible from non-fatal ERROR, back to NOT_READY... many implementations already support this, but not all.
I think it makes sense to specify this so we can be consistent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the impression that PROVIDER_FATAL can only happen during initialization, where the error can be surfaced and handled by the caller.

With the current proposal, PROVIDER_FATAL can be a result of a failing sync. As a user, it seems that I'll get the default value and an error. Am I supposed to handle this error and exit the program?

Copy link
Member Author

@toddbaert toddbaert Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tangenti I need to make some updates to reflect the discussion here.

We decided the best path forward is to provide an option to enumerate the status codes that a user considers FATAL. In the case those are received, whether it's the initial connection or not, the program can exit (or rebuild a new provider). We believed this was the best trade-off between usability and complexity, and it's easy to understand: select what you want to consider FATAL, and take the action you want when those codes are received; by marking a code is FATAL you are telling the provider that this code represents a non-transient error state.

I will make the related updates.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've included this.

stream disconnected, attempting to reconnect,
resolve from cache*
resolve from flag set rules**
STALE emitted
end note

note right of READY
note left of READY
stream connected,
evaluation cache active*,
flag set rules stored**,
Expand All @@ -84,7 +87,7 @@ stateDiagram-v2
CHANGE emitted with stream messages
end note

note right of ERROR
note left of ERROR
stream disconnected, attempting to reconnect,
evaluation cache purged*,
ERROR emitted
Expand All @@ -101,25 +104,49 @@ stateDiagram-v2

### Stream Reconnection

When either stream (sync or event) disconnects, whether due to the associated deadline being exceeded, network error or any other cause, the provider attempts to re-establish the stream immediately, and then retries with an exponential back-off.
We always rely on the [integrated functionality of GRPC for reconnection](https://github.com/grpc/grpc/blob/master/doc/connection-backoff.md) and utilize [Wait-for-Ready](https://grpc.io/docs/guides/wait-for-ready/) to re-establish the stream.
We are configuring the underlying reconnection mechanism whenever we can, based on our configuration. (not all GRPC implementations support this)
When either stream (sync or event) disconnects, whether due to the associated deadline being exceeded, network error or any other cause, the provider attempts to re-establish the stream immediately.
Both the RPC and sync streams will forever attempt to reconnect unless the stream response indicates a [fatal status code](#fatal-status-codes).
This is distinct from the [gRPC retry-policy](#grpc-retry-policy), which automatically retries *all RPCs* (streams or otherwise) a limited number of times to make the provider resilient to transient errors.

| language/property | min connect timeout | max backoff | initial backoff | jitter | multiplier |
|-------------------|-----------------------------------|--------------------------|--------------------------|--------|------------|
| GRPC property | grpc.initial_reconnect_backoff_ms | max_reconnect_backoff_ms | min_reconnect_backoff_ms | 0.2 | 1.6 |
| Flagd property | deadlineMs | retryBackoffMaxMs | retryBackoffMs | 0.2 | 1.6 |
| --- | --- | --- | --- | --- | --- |
| default [^1] | ✅ | ✅ | ✅ | 0.2 | 1.6 |
| js | ✅ | ✅ | ❌ | 0.2 | 1.6 |
| java | ❌ | ❌ | ❌ | 0.2 | 1.6 |
## gRPC Retry Policy

[^1] : C++, Python, Ruby, Objective-C, PHP, C#, js(deprecated)
flagd leverages gRPC built-in retry mechanism for all RPCs.
In short, the retry policy attempts to retry all RPCs which return `UNAVAILABLE` or `UNKNOWN` status codes 3 times, with a 1s, 2s, 4s, backoff respectively.
No other status codes are retried.
The flagd gRPC retry policy is specified below:

When disconnected, if the time since disconnection is less than `retryGracePeriod`, the provider emits `STALE` when it disconnects.
While the provider is in state `STALE` the provider resolves values from its cache or stored flag set rules, depending on its resolver mode.
When the time since the last disconnect first exceeds `retryGracePeriod`, the provider emits `ERROR`.
The provider attempts to reconnect indefinitely, with a maximum interval of `retryBackoffMaxMs`.
```json
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is standard retryPolicy, accepted in this JSON format by most gRPC implementations.

{
"methodConfig": [
{
"name": [
{
"service": "flagd.evaluation.v1.Service"
},
{
"service": "flagd.sync.v1.FlagSyncService"
}
],
"retryPolicy": {
"MaxAttempts": 4,
"InitialBackoff": "1s",
"MaxBackoff": $FLAGD_RETRY_BACKOFF_MAX_MS, // from provider options
"BackoffMultiplier": 2.0,
"RetryableStatusCodes": [
"UNAVAILABLE",
"UNKNOWN"
]
}
}
]
}
```

## Fatal Status Codes

Providers accept an option for defining fatal gRPC status codes which, when received in the RPC or sync streams, transition the provider to the PROVIDER_FATAL state.
This configuration is useful for situations wherein these codes indicate to a client that their configuration is invalid and must be changed (i.e., the error is non-transient).
Examples for this include status codes such as `UNAUTHENTICATED` or `PERMISSION_DENIED`.

## RPC Resolver

Expand Down Expand Up @@ -262,28 +289,29 @@ precedence.

Below are the supported configuration parameters (note that not all apply to both resolver modes):

| Option name | Environment variable name | Explanation | Type & Values | Default | Compatible resolver |
| --------------------- | ------------------------------ | ---------------------------------------------------------------------- | ---------------------------- | ----------------------------- | ----------------------- |
| resolver | FLAGD_RESOLVER | mode of operation | String - `rpc`, `in-process` | rpc | rpc & in-process |
| host | FLAGD_HOST | remote host | String | localhost | rpc & in-process |
| port | FLAGD_PORT | remote port | int | 8013 (rpc), 8015 (in-process) | rpc & in-process |
| targetUri | FLAGD_TARGET_URI | alternative to host/port, supporting custom name resolution | string | null | rpc & in-process |
| tls | FLAGD_TLS | connection encryption | boolean | false | rpc & in-process |
| socketPath | FLAGD_SOCKET_PATH | alternative to host port, unix socket | String | null | rpc & in-process |
| certPath | FLAGD_SERVER_CERT_PATH | tls cert path | String | null | rpc & in-process |
| deadlineMs | FLAGD_DEADLINE_MS | deadline for unary calls, and timeout for initialization | int | 500 | rpc & in-process & file |
| streamDeadlineMs | FLAGD_STREAM_DEADLINE_MS | deadline for streaming calls, useful as an application-layer keepalive | int | 600000 | rpc & in-process |
| retryBackoffMs | FLAGD_RETRY_BACKOFF_MS | initial backoff for stream retry | int | 1000 | rpc & in-process |
| retryBackoffMaxMs | FLAGD_RETRY_BACKOFF_MAX_MS | maximum backoff for stream retry | int | 120000 | rpc & in-process |
| retryGracePeriod | FLAGD_RETRY_GRACE_PERIOD | period in seconds before provider moves from STALE to ERROR state | int | 5 | rpc & in-process & file |
| keepAliveTime | FLAGD_KEEP_ALIVE_TIME_MS | http 2 keepalive | long | 0 | rpc & in-process |
| cache | FLAGD_CACHE | enable cache of static flags | String - `lru`, `disabled` | lru | rpc |
| maxCacheSize | FLAGD_MAX_CACHE_SIZE | max size of static flag cache | int | 1000 | rpc |
| selector | FLAGD_SOURCE_SELECTOR | selects a single sync source to retrieve flags from only that source | string | null | in-process |
| providerId | FLAGD_PROVIDER_ID | A unique identifier for flagd(grpc client) initiating the request. | string | null | in-process |
| offlineFlagSourcePath | FLAGD_OFFLINE_FLAG_SOURCE_PATH | offline, file-based flag definitions, overrides host/port/targetUri | string | null | file |
| offlinePollIntervalMs | FLAGD_OFFLINE_POLL_MS | poll interval for reading offlineFlagSourcePath | int | 5000 | file |
| contextEnricher | - | sync-metadata to evaluation context mapping function | function | identity function | in-process |
| Option name | Environment variable name | Explanation | Type & Values | Default | Compatible resolver |
| --------------------- | ------------------------------ | --------------------------------------------------------------------------------------------------------------- | ---------------------------- | ----------------------------- | ----------------------- |
| resolver | FLAGD_RESOLVER | mode of operation | string - `rpc`, `in-process` | rpc | rpc & in-process |
| host | FLAGD_HOST | remote host | string | localhost | rpc & in-process |
| port | FLAGD_PORT | remote port | int | 8013 (rpc), 8015 (in-process) | rpc & in-process |
| targetUri | FLAGD_TARGET_URI | alternative to host/port, supporting custom name resolution | string | null | rpc & in-process |
| tls | FLAGD_TLS | connection encryption | boolean | false | rpc & in-process |
| socketPath | FLAGD_SOCKET_PATH | alternative to host port, unix socket | string | null | rpc & in-process |
| certPath | FLAGD_SERVER_CERT_PATH | tls cert path | string | null | rpc & in-process |
| deadlineMs | FLAGD_DEADLINE_MS | deadline for unary calls, and timeout for initialization | int | 500 | rpc & in-process & file |
| streamDeadlineMs | FLAGD_STREAM_DEADLINE_MS | deadline for streaming calls, useful as an application-layer keepalive | int | 600000 | rpc & in-process |
| retryBackoffMs | FLAGD_RETRY_BACKOFF_MS | initial backoff for stream retry | int | 1000 | rpc & in-process |
| retryBackoffMaxMs | FLAGD_RETRY_BACKOFF_MAX_MS | maximum backoff for stream retry | int | 120000 | rpc & in-process |
| retryGracePeriod | FLAGD_RETRY_GRACE_PERIOD | period in seconds before provider moves from STALE to ERROR state | int | 5 | rpc & in-process & file |
| keepAliveTime | FLAGD_KEEP_ALIVE_TIME_MS | http 2 keepalive | long | 0 | rpc & in-process |
| cache | FLAGD_CACHE | enable cache of static flags | string - `lru`, `disabled` | lru | rpc |
| maxCacheSize | FLAGD_MAX_CACHE_SIZE | max size of static flag cache | int | 1000 | rpc |
| selector | FLAGD_SOURCE_SELECTOR | selects a single sync source to retrieve flags from only that source | string | null | in-process |
| providerId | FLAGD_PROVIDER_ID | A unique identifier for flagd(grpc client) initiating the request. | string | null | in-process |
| offlineFlagSourcePath | FLAGD_OFFLINE_FLAG_SOURCE_PATH | offline, file-based flag definitions, overrides host/port/targetUri | string | null | file |
| offlinePollIntervalMs | FLAGD_OFFLINE_POLL_MS | poll interval for reading offlineFlagSourcePath | int | 5000 | file |
| contextEnricher | - | sync-metadata to evaluation context mapping function | function | identity function | in-process |
| fatalStatusCodes | - | a list of gRPC status codes, which will cause streams to give up and put the provider in a PROVIDER_FATAL state | array | [] | rpc & in-process |
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only new option - the other changes are just whitespace.


### Custom Name Resolution

Expand Down