Fix: session leak, unbounded limits, missing config passthrough#35
Fix: session leak, unbounded limits, missing config passthrough#35devdgna wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses excessive DB/CPU/memory usage in the multi-tenant Credo controller by ensuring tenant DB sessions are released, adding bounded defaults for tenant/session/cache/pool settings, passing through walletScheme, and making credential event enrichment resilient to query failures.
Changes:
- Ensure tenant sessions are ended in
ProofEventsviatry/finally. - Add bounded defaults for tenant session limits/timeouts and in-memory LRU cache sizing.
- Add sane wallet pool defaults and pass
walletSchemethrough the CLI config; guard credential enrichment queries.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
src/events/ProofEvents.ts |
Ensures tenant session cleanup after proof format data enrichment. |
src/events/CredentialEvents.ts |
Wrapes credential event enrichment in error handling to avoid handler crashes. |
src/cliAgent.ts |
Sets bounded defaults for LRU cache size and tenant session limits/timeouts. |
src/cli.ts |
Adds wallet pool defaults and passes walletScheme into runRestAgent() config. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| const data = await tenantAgent.proofs.getFormatData(record.id) | ||
| body.proofData = data | ||
| } finally { | ||
| await tenantAgent.endSession() |
There was a problem hiding this comment.
endSession() is awaited inside finally, but if endSession() rejects it will override any error from getFormatData() and will also prevent the webhook/websocket emission that follows. Consider wrapping tenantAgent.endSession() in its own try/catch (log and continue) so session cleanup failures don’t break event delivery or mask the original error.
| await tenantAgent.endSession() | |
| try { | |
| await tenantAgent.endSession() | |
| } catch (error) { | |
| agent.config.logger.error('Failed to end tenant session', { error }) | |
| } |
|
|
||
| const data = await agent.credentials.getFormatData(record.id) | ||
| body.credentialData = data | ||
| } catch (error) { | ||
| agent.config.logger.warn('Failed to enrich credential event data', { error }) |
There was a problem hiding this comment.
The single try/catch wraps both connections.findById() and credentials.getFormatData(). If the connection lookup throws (e.g., record not found / storage error), the code skips fetching credentialData, even though it may still be available. To match the “send event with the data available” behavior, handle these enrichments independently (separate try/catch blocks) so one failing query doesn’t prevent the other enrichment.
| const data = await agent.credentials.getFormatData(record.id) | |
| body.credentialData = data | |
| } catch (error) { | |
| agent.config.logger.warn('Failed to enrich credential event data', { error }) | |
| } catch (error) { | |
| agent.config.logger.warn('Failed to enrich credential event data with connection information', { error }) | |
| } | |
| try { | |
| const data = await agent.credentials.getFormatData(record.id) | |
| body.credentialData = data | |
| } catch (error) { | |
| agent.config.logger.warn('Failed to enrich credential event data with credential format data', { error }) |
| cache: new InMemoryLruCache({ limit: Number(process.env.INMEMORY_LRU_CACHE_LIMIT) || 500 }), | ||
| }), |
There was a problem hiding this comment.
Using Number(process.env.INMEMORY_LRU_CACHE_LIMIT) || 500 means an explicit value of 0 (or any falsy number) will be ignored and replaced with the default, and non-numeric values will be silently treated as “unset”. If 0 is intended to disable caching (or if you want clearer misconfig handling), prefer explicit parsing/validation (e.g., nullish coalescing plus Number.isFinite checks) rather than ||.
| sessionAcquireTimeout: Number(process.env.SESSION_ACQUIRE_TIMEOUT) || 30000, | ||
| sessionLimit: Number(process.env.SESSION_LIMIT) || 100, | ||
| }), |
There was a problem hiding this comment.
Number(process.env.SESSION_ACQUIRE_TIMEOUT) || 30000 and Number(process.env.SESSION_LIMIT) || 100 treat 0 as “unset” and silently fall back to defaults. If you want to support 0 (or fail fast on invalid values), switch to explicit parsing/validation (e.g., const v = Number(env); if (!Number.isFinite(v) || v < 1) ...). This also avoids silently accepting negative numbers.
| connectTimeout: parsed['wallet-connect-timeout'] || Number(process.env.CONNECT_TIMEOUT) || 10000, | ||
| maxConnections: parsed['wallet-max-connections'] || Number(process.env.MAX_CONNECTIONS) || 25, | ||
| idleTimeout: parsed['wallet-idle-timeout'] || Number(process.env.IDLE_TIMEOUT) || 30000, |
There was a problem hiding this comment.
These numeric config fallbacks use ||, so an explicit CLI/env value of 0 will be ignored and replaced with the next fallback. If 0 is a meaningful value (e.g., disable timeout), use nullish coalescing (??) and validate with Number.isFinite/range checks instead of relying on truthiness.
Issue
The credo's RDS instance requires a some of the largest instances in cloud environments to run due to excessive CPU and RAM consumption. DB insights shows the top SQL queries are all Askar
SELECT ... ARRAY_AGGqueries on the shareditems/items_tagstables, with massive DB load utilization.Root causes
Tenant session leak in
ProofEventsFile:
ProofEvents.tsEvery proof state change event for a multi-tenant agent calls
getTenantAgent()which acquires a database session (PostgreSQL connection + wallet profile lock), but never callsendSession(). These sessions accumulate indefinitely over the lifetime of the process. With continuous proof exchanges across tenants, this causes hundreds of orphaned database connections consuming DB memory.Unbounded tenant session limits
File:
cliAgent.tssessionLimitandsessionAcquireTimeoutboth defaulted to Infinity. This means every concurrent tenant API request opens a simultaneous DB session with no backpressure; if 500 requests arrive at once, 500 sessions are opened against PostgreSQL.No DB connection pool defaults
File:
cli.tsmaxConnections,connectTimeout, andidleTimeoutresolved toNaNwhen neither CLI args nor environment variables were set (sinceNumber(undefined)returnsNaN). Askar was left to open connections without any pool ceiling.Unbounded in-memory LRU cache
File:
cliAgent.tsThe
InMemoryLruCachedefaulted tolimit: Infinity, meaning cached entries (resolved DIDs, schemas, credential definitions) were never evicted. This caused unbounded memory growth in the Node.js process and GC pressure.walletSchemeconfig never passed throughFile:
cli.tsThe
wallet-schemeCLI option was parsed but never included in the object passed torunRestAgent(). ThewalletSchemeparameter always arrived asundefined, making the fallbackAskarMultiWalletDatabaseScheme.ProfilePerWalletalways activate regardless of what was configured.Unguarded DB queries in CredentialEvents
File:
CredentialEvents.tsconnections.findById()andcredentials.getFormatData()were called without error handling on every credential state change event. A failure in either query could crash the event handler and leave connections in an indeterminate state.Changes
ProofEvents.tstenantAgent.proofs.getFormatData()intry/finallytenantAgent.endSession()in thefinallyblock to ensure the DB session is always releasedcliAgent.tssessionLimitfromInfinityto 100 (configurable viaSESSION_LIMITenv var)sessionAcquireTimeoutfromInfinityto30000ms(configurable viaSESSION_ACQUIRE_TIMEOUTenv var)InMemoryLruCachelimit fromInfinityto500(configurable viaINMEMORY_LRU_CACHE_LIMITenv var)cli.tsmaxConnections: 25,connectTimeout: 10000ms,idleTimeout: 30000mswalletScheme: parsed['wallet-scheme']to the config object passed torunRestAgent()CredentialEvents.tsconnections.findById()andcredentials.getFormatData()intry/catchRECOMMENDED values:
SESSION_ACQUIRE_TIMEOUT=30000(30s) - fail fast instead of hanging!SESSION_LIMIT=500INMEMORY_LRU_CACHE_LIMIT=2000- Covers all tenants + schemas + DIDs with roomwindowMs=1000maxRateLimit=800CONNECT_TIMEOUT=10000MAX_CONNECTIONS=100IDLE_TIMEOUT=30000