Skip to content

Fix: session leak, unbounded limits, missing config passthrough#35

Open
devdgna wants to merge 1 commit into
mainfrom
fix/optimisation
Open

Fix: session leak, unbounded limits, missing config passthrough#35
devdgna wants to merge 1 commit into
mainfrom
fix/optimisation

Conversation

@devdgna
Copy link
Copy Markdown

@devdgna devdgna commented Mar 10, 2026

Issue
The credo's RDS instance requires a some of the largest instances in cloud environments to run due to excessive CPU and RAM consumption. DB insights shows the top SQL queries are all Askar SELECT ... ARRAY_AGG queries on the shared items/items_tags tables, with massive DB load utilization.

Root causes

  1. Tenant session leak in ProofEvents
    File: ProofEvents.ts
    Every proof state change event for a multi-tenant agent calls getTenantAgent() which acquires a database session (PostgreSQL connection + wallet profile lock), but never calls endSession(). These sessions accumulate indefinitely over the lifetime of the process. With continuous proof exchanges across tenants, this causes hundreds of orphaned database connections consuming DB memory.

  2. Unbounded tenant session limits
    File: cliAgent.ts
    sessionLimit and sessionAcquireTimeout both defaulted to Infinity. This means every concurrent tenant API request opens a simultaneous DB session with no backpressure; if 500 requests arrive at once, 500 sessions are opened against PostgreSQL.

  3. No DB connection pool defaults
    File: cli.ts
    maxConnections, connectTimeout, and idleTimeout resolved to NaN when neither CLI args nor environment variables were set (since Number(undefined) returns NaN). Askar was left to open connections without any pool ceiling.

  4. Unbounded in-memory LRU cache
    File: cliAgent.ts
    The InMemoryLruCache defaulted to limit: Infinity, meaning cached entries (resolved DIDs, schemas, credential definitions) were never evicted. This caused unbounded memory growth in the Node.js process and GC pressure.

  5. walletScheme config never passed through
    File: cli.ts
    The wallet-scheme CLI option was parsed but never included in the object passed to runRestAgent(). The walletScheme parameter always arrived as undefined, making the fallback AskarMultiWalletDatabaseScheme.ProfilePerWallet always activate regardless of what was configured.

  6. Unguarded DB queries in CredentialEvents
    File: CredentialEvents.ts
    connections.findById() and credentials.getFormatData() were called without error handling on every credential state change event. A failure in either query could crash the event handler and leave connections in an indeterminate state.

Changes

  1. Fix session leak: ProofEvents.ts
  • Wrapped tenantAgent.proofs.getFormatData() in try/finally
  • Added await tenantAgent.endSession() in the finally block to ensure the DB session is always released
  1. Set bounded defaults: cliAgent.ts
  • Changed sessionLimit from Infinity to 100 (configurable via SESSION_LIMIT env var)
  • Changed sessionAcquireTimeout from Infinity to 30000ms (configurable via SESSION_ACQUIRE_TIMEOUT env var)
  • Changed InMemoryLruCache limit from Infinity to 500 (configurable via INMEMORY_LRU_CACHE_LIMIT env var)
  1. Set connection pool defaults and fix config passthrough: cli.ts
  • Added fallback defaults: maxConnections: 25, connectTimeout: 10000 ms, idleTimeout: 30000 ms
  • Added missing walletScheme: parsed['wallet-scheme'] to the config object passed to runRestAgent()
  1. Guard DB queries: CredentialEvents.ts
  • Wrapped connections.findById() and credentials.getFormatData() in try/catch
  • On failure, logs a warning and still sends the webhook/websocket event with the data available (graceful degradation)

RECOMMENDED values:

SESSION_ACQUIRE_TIMEOUT=30000 (30s) - fail fast instead of hanging!
SESSION_LIMIT=500
INMEMORY_LRU_CACHE_LIMIT=2000 - Covers all tenants + schemas + DIDs with room

windowMs=1000
maxRateLimit=800

CONNECT_TIMEOUT=10000
MAX_CONNECTIONS=100
IDLE_TIMEOUT=30000

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses excessive DB/CPU/memory usage in the multi-tenant Credo controller by ensuring tenant DB sessions are released, adding bounded defaults for tenant/session/cache/pool settings, passing through walletScheme, and making credential event enrichment resilient to query failures.

Changes:

  • Ensure tenant sessions are ended in ProofEvents via try/finally.
  • Add bounded defaults for tenant session limits/timeouts and in-memory LRU cache sizing.
  • Add sane wallet pool defaults and pass walletScheme through the CLI config; guard credential enrichment queries.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
src/events/ProofEvents.ts Ensures tenant session cleanup after proof format data enrichment.
src/events/CredentialEvents.ts Wrapes credential event enrichment in error handling to avoid handler crashes.
src/cliAgent.ts Sets bounded defaults for LRU cache size and tenant session limits/timeouts.
src/cli.ts Adds wallet pool defaults and passes walletScheme into runRestAgent() config.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/events/ProofEvents.ts
const data = await tenantAgent.proofs.getFormatData(record.id)
body.proofData = data
} finally {
await tenantAgent.endSession()
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

endSession() is awaited inside finally, but if endSession() rejects it will override any error from getFormatData() and will also prevent the webhook/websocket emission that follows. Consider wrapping tenantAgent.endSession() in its own try/catch (log and continue) so session cleanup failures don’t break event delivery or mask the original error.

Suggested change
await tenantAgent.endSession()
try {
await tenantAgent.endSession()
} catch (error) {
agent.config.logger.error('Failed to end tenant session', { error })
}

Copilot uses AI. Check for mistakes.
Comment on lines +26 to +30

const data = await agent.credentials.getFormatData(record.id)
body.credentialData = data
} catch (error) {
agent.config.logger.warn('Failed to enrich credential event data', { error })
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The single try/catch wraps both connections.findById() and credentials.getFormatData(). If the connection lookup throws (e.g., record not found / storage error), the code skips fetching credentialData, even though it may still be available. To match the “send event with the data available” behavior, handle these enrichments independently (separate try/catch blocks) so one failing query doesn’t prevent the other enrichment.

Suggested change
const data = await agent.credentials.getFormatData(record.id)
body.credentialData = data
} catch (error) {
agent.config.logger.warn('Failed to enrich credential event data', { error })
} catch (error) {
agent.config.logger.warn('Failed to enrich credential event data with connection information', { error })
}
try {
const data = await agent.credentials.getFormatData(record.id)
body.credentialData = data
} catch (error) {
agent.config.logger.warn('Failed to enrich credential event data with credential format data', { error })

Copilot uses AI. Check for mistakes.
Comment thread src/cliAgent.ts
Comment on lines +194 to 195
cache: new InMemoryLruCache({ limit: Number(process.env.INMEMORY_LRU_CACHE_LIMIT) || 500 }),
}),
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using Number(process.env.INMEMORY_LRU_CACHE_LIMIT) || 500 means an explicit value of 0 (or any falsy number) will be ignored and replaced with the default, and non-numeric values will be silently treated as “unset”. If 0 is intended to disable caching (or if you want clearer misconfig handling), prefer explicit parsing/validation (e.g., nullish coalescing plus Number.isFinite checks) rather than ||.

Copilot uses AI. Check for mistakes.
Comment thread src/cliAgent.ts
Comment on lines +238 to 240
sessionAcquireTimeout: Number(process.env.SESSION_ACQUIRE_TIMEOUT) || 30000,
sessionLimit: Number(process.env.SESSION_LIMIT) || 100,
}),
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Number(process.env.SESSION_ACQUIRE_TIMEOUT) || 30000 and Number(process.env.SESSION_LIMIT) || 100 treat 0 as “unset” and silently fall back to defaults. If you want to support 0 (or fail fast on invalid values), switch to explicit parsing/validation (e.g., const v = Number(env); if (!Number.isFinite(v) || v < 1) ...). This also avoids silently accepting negative numbers.

Copilot uses AI. Check for mistakes.
Comment thread src/cli.ts
Comment on lines +183 to +185
connectTimeout: parsed['wallet-connect-timeout'] || Number(process.env.CONNECT_TIMEOUT) || 10000,
maxConnections: parsed['wallet-max-connections'] || Number(process.env.MAX_CONNECTIONS) || 25,
idleTimeout: parsed['wallet-idle-timeout'] || Number(process.env.IDLE_TIMEOUT) || 30000,
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These numeric config fallbacks use ||, so an explicit CLI/env value of 0 will be ignored and replaced with the next fallback. If 0 is a meaningful value (e.g., disable timeout), use nullish coalescing (??) and validate with Number.isFinite/range checks instead of relying on truthiness.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants