Skip to content

fix: retry refresh token updates on serialization failures (SQL & ent)#4830

Open
RonnanSouza wants to merge 1 commit into
dexidp:masterfrom
RonnanSouza:fix/sql-retry-serialization-failure
Open

fix: retry refresh token updates on serialization failures (SQL & ent)#4830
RonnanSouza wants to merge 1 commit into
dexidp:masterfrom
RonnanSouza:fix/sql-retry-serialization-failure

Conversation

@RonnanSouza

@RonnanSouza RonnanSouza commented Jun 11, 2026

Copy link
Copy Markdown

Overview

When several clients call UpdateRefreshToken — which performs a read-modify-write inside a SERIALIZABLE transaction — concurrent rotation of the same token makes the database abort one transaction with a serialization failure, e.g.:

time=2026-06-10T09:34:44.629Z level=ERROR msg="failed to update refresh token" err="update refresh token: pq: could not serialize access due to concurrent update" ​

Neither SQL storage backend retried, so the abort surfaced to clients as a 500.

What this PR does / why we need it

Add a bounded, jittered retry around the refresh-token update in both SQL backends (storage/sql and storage/ent):

  • Up to 8 retries with jittered backoff; each attempt re-runs the transaction in a fresh read-modify-write, so it's safe.
  • Per-driver detection of transient failures via errors.As: Postgres 40001 / 40P01, MySQL 1213 / 1205.
  • The retry is an application-level wrapper in both backends, since neither Go's database/sql nor ent retries transactions natively.
  • Re-enables the refresh-token concurrency conformance tests for Postgres / MySQL (and MySQL 8 for ent).

Special notes for your reviewer

  • I've decided to set reasonable hard coded values for the retries and backoff, but we can expose to be configurable if you folks think it is necessary
  • Also, I've done this fix only for update refresh token as it is where I faced the issue and I think where the API is more vulnerable, but I can port to cover all the endpoints

@RonnanSouza RonnanSouza force-pushed the fix/sql-retry-serialization-failure branch 3 times, most recently from cf70e72 to 1f86e6a Compare June 12, 2026 13:21
UpdateRefreshToken runs a read-modify-write inside a SERIALIZABLE
transaction. Under concurrent refresh-token rotation of the same token
(e.g. multiple kube clients refreshing at once), Postgres aborts the
conflicting transaction with SQLSTATE 40001 and the error surfaced to
clients as HTTP 500.

Add a bounded, jittered retry around the refresh-token update that
re-runs the transaction on transient serialization/deadlock failures,
detected per-driver (Postgres 40001/40P01, MySQL 1213/1205). Error
wraps in the SQL storage now use %w so the driver error can be matched
with errors.As. SQLite serializes writes and opts out (nil check).

Enables the previously-disabled refresh-token concurrency conformance
tests for Postgres and MySQL.

Signed-off-by: Ronan <ronanpalmeiras@gmail.com>

fix(storage/ent): retry refresh token update on serialization failures

Mirror the SQL backend fix in the ent storage. UpdateRefreshToken runs
in a SERIALIZABLE transaction and aborts with a serialization failure
under concurrent rotation of the same token, surfacing as HTTP 500.

Wrap the refresh-token update in a bounded, jittered retry that re-runs
the transaction on transient serialization/deadlock failures, detected
per-driver (Postgres 40001/40P01, MySQL 1213/1205) via errors.As over
the ent-wrapped driver error.

Enables the previously-disabled refresh-token concurrency conformance
tests for Postgres, MySQL and MySQL 8.

Signed-off-by: Ronan <ronanpalmeiras@gmail.com>

refactor(storage): extract shared SQL retry policy into sqlretry

refactor(storage): extract shared SQL retry policy into sqlretry

Signed-off-by: Ronan <ronanpalmeiras@gmail.com>
@RonnanSouza RonnanSouza force-pushed the fix/sql-retry-serialization-failure branch from 1f86e6a to a016e56 Compare June 12, 2026 13:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant