You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi,
we upgraded our self-hosted Kratos version from v1.3.0 to v25.4.0 yesterday and we've been having DB performance issues since then.
I tried to pinpoint what's going wrong by looking at the query plans from before/after. The new ones indeed look better in paper, but this is what I actually see in the cpu resources:
This is a period of 2 hours, ~2M calls to this query
Before:
SELECT
COALESCE(identity_credential_identifiers.identifier, $3) cred_identifier,
ict.id cred_type_id,
ict.name cred_type,
identity_credentials.config cred_config,
identity_credentials.created_at created_at,
identity_credentials.id cred_id,
identity_credentials.identity_id identity_id,
identity_credentials.nid nid,
identity_credentials.updated_at updated_at,
identity_credentials.version cred_version
FROM
identity_credentials AS identity_credentials
INNER JOIN
identity_credential_types ict
ON
(identity_credentials.identity_credential_type_id=ict.id)
LEFT JOIN
identity_credential_identifiers
ONidentity_credential_identifiers.identity_credential_id=identity_credentials.idANDidentity_credential_identifiers.nid=identity_credentials.nidWHERE
((identity_credentials.identity_id= $1ANDidentity_credentials.nid= $2))
After:
SELECT
COALESCE(identity_credential_identifiers.identifier, $3) cred_identifier,
identity_credentials.config,
identity_credentials.created_at,
identity_credentials.id,
identity_credentials.identity_credential_type_id,
identity_credentials.identity_id,
identity_credentials.nid,
identity_credentials.updated_at,
identity_credentials.versionFROM
identity_credentials AS identity_credentials
LEFT JOIN
identity_credential_identifiers
ONidentity_credential_identifiers.identity_credential_id=identity_credentials.idANDidentity_credential_identifiers.nid=identity_credentials.nidWHERE
(identity_credentials.identity_id= $1)
AND (identity_credentials.nid= $2)
ORDER BYidentity_credential_identifiers.identifierASC
Important
Notice how the 50th percentile is almost the same, but the 95th, 99th is way higher. So it could be some kind of edge case or another query/operation affecting it. The database load went up by a lot for this query as a consequence.
Other metrics
These are other metrics that spiked quite a lot after the upgrade (~14h in the graphs) and I don't understand why yet:
The number of new connections is a lot higher, same amount of pods are connecting to it 4~6 instances of Kratos. Is there any change around the connection pool that would explain this?
Amount of fetched/returned rows used to be 7x less, but I cannot find where this is happening. If you could point me into the right direction I could investigate which query could be fetching/returning more.
And of course, the CPU usage
I'm not super familiar with Kratos internals so maybe I'm looking at the wrong query or plans. Please let me know which other information you need to make this clearer.
Reproducing the bug
The only change we did was to upgrade kratos from v1.3.0 to v25.4.0
This is in a Postgres DB
Relevant log output
Relevant configuration
Version
v25.4.0
On which operating system are you observing this issue?
Preflight checklist
Ory Network Project
No response
Describe the bug
Hi,
we upgraded our self-hosted Kratos version from
v1.3.0tov25.4.0yesterday and we've been having DB performance issues since then.I tried to pinpoint what's going wrong by looking at the query plans from before/after. The new ones indeed look better in paper, but this is what I actually see in the cpu resources:
This is a period of 2 hours, ~2M calls to this query
Before:
After:
Important
Notice how the 50th percentile is almost the same, but the 95th, 99th is way higher. So it could be some kind of edge case or another query/operation affecting it. The database load went up by a lot for this query as a consequence.
Other metrics
These are other metrics that spiked quite a lot after the upgrade (~14h in the graphs) and I don't understand why yet:
The number of new connections is a lot higher, same amount of pods are connecting to it 4~6 instances of Kratos. Is there any change around the connection pool that would explain this?
Amount of fetched/returned rows used to be 7x less, but I cannot find where this is happening. If you could point me into the right direction I could investigate which query could be fetching/returning more.
And of course, the CPU usage
I'm not super familiar with Kratos internals so maybe I'm looking at the wrong query or plans. Please let me know which other information you need to make this clearer.
Reproducing the bug
The only change we did was to upgrade kratos from v1.3.0 to v25.4.0
This is in a Postgres DB
Relevant log output
Relevant configuration
Version
v25.4.0
On which operating system are you observing this issue?
None
In which environment are you deploying?
None
Additional Context
No response