Skip to content

DB performance issues after upgrading to v25.4.0 #4520

@Pedr0Rocha

Description

@Pedr0Rocha

Preflight checklist

Ory Network Project

No response

Describe the bug

Hi,
we upgraded our self-hosted Kratos version from v1.3.0 to v25.4.0 yesterday and we've been having DB performance issues since then.

I tried to pinpoint what's going wrong by looking at the query plans from before/after. The new ones indeed look better in paper, but this is what I actually see in the cpu resources:

This is a period of 2 hours, ~2M calls to this query

Before:

SELECT
  COALESCE(identity_credential_identifiers.identifier, $3) cred_identifier,
  ict.id cred_type_id,
  ict.name cred_type,
  identity_credentials.config cred_config,
  identity_credentials.created_at created_at,
  identity_credentials.id cred_id,
  identity_credentials.identity_id identity_id,
  identity_credentials.nid nid,
  identity_credentials.updated_at updated_at,
  identity_credentials.version cred_version
FROM
  identity_credentials AS identity_credentials
INNER JOIN
  identity_credential_types ict
ON
  (identity_credentials.identity_credential_type_id = ict.id)
LEFT JOIN
  identity_credential_identifiers
ON
  identity_credential_identifiers.identity_credential_id = identity_credentials.id
  AND identity_credential_identifiers.nid = identity_credentials.nid
WHERE
  ((identity_credentials.identity_id = $1
      AND identity_credentials.nid = $2))
Image

After:

SELECT
  COALESCE(identity_credential_identifiers.identifier, $3) cred_identifier,
  identity_credentials.config,
  identity_credentials.created_at,
  identity_credentials.id,
  identity_credentials.identity_credential_type_id,
  identity_credentials.identity_id,
  identity_credentials.nid,
  identity_credentials.updated_at,
  identity_credentials.version
FROM
  identity_credentials AS identity_credentials
LEFT JOIN
  identity_credential_identifiers
ON
  identity_credential_identifiers.identity_credential_id = identity_credentials.id
  AND identity_credential_identifiers.nid = identity_credentials.nid
WHERE
  (identity_credentials.identity_id = $1)
  AND (identity_credentials.nid = $2)
ORDER BY
  identity_credential_identifiers.identifier ASC
Image

Important

Notice how the 50th percentile is almost the same, but the 95th, 99th is way higher. So it could be some kind of edge case or another query/operation affecting it. The database load went up by a lot for this query as a consequence.


Other metrics

These are other metrics that spiked quite a lot after the upgrade (~14h in the graphs) and I don't understand why yet:

The number of new connections is a lot higher, same amount of pods are connecting to it 4~6 instances of Kratos. Is there any change around the connection pool that would explain this?

Image

Amount of fetched/returned rows used to be 7x less, but I cannot find where this is happening. If you could point me into the right direction I could investigate which query could be fetching/returning more.

Image

And of course, the CPU usage

Image

I'm not super familiar with Kratos internals so maybe I'm looking at the wrong query or plans. Please let me know which other information you need to make this clearer.

Reproducing the bug

The only change we did was to upgrade kratos from v1.3.0 to v25.4.0
This is in a Postgres DB

Relevant log output

Relevant configuration

Version

v25.4.0

On which operating system are you observing this issue?

None

In which environment are you deploying?

None

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething is not working.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions