feat(auth-session-mgmt): align with blog — connector OCC retry, async-job wait, housekeeping, conn lifetime by charycha · Pull Request #1287 · aws-samples/aurora-dsql-samples

charycha · 2026-06-11T19:47:52Z

Follow-up to #1216 (already merged) to keep the sample in lockstep with the
latest blog post content. The blog teaches several techniques that aren't
yet present in the merged code; this PR adds them so readers who clone
the repo see the same patterns the blog references.

What this adds

maxLifetimeSeconds: 3300 on the connection pool so connections
retire at 55 minutes, ahead of Aurora DSQL's 1-hour hard cap.
Connector-built-in OCC retry via AuroraDSQLPool.transaction().
Repositories now route writes through it in production; tests fall back
to a manual BEGIN/COMMIT + retryWithBackoff when the mock pool doesn't
expose transaction().
sys.wait_for_job wait in the migrator. CREATE INDEX ASYNC now
blocks until the index reaches VALID before the migration returns, so
the app doesn't start serving traffic against a sequential scan. Opt-out
flag for callers that want non-blocking migrations.
src/scripts/housekeeping.ts: batched, idempotent DELETE loop for
expired and long-revoked sessions, respecting Aurora DSQL's 3000-row
per-transaction cap. Wired up as npm run housekeeping. Configurable
retention via SESSION_RETENTION_DAYS (default 30 days).

Validation

196/196 tests passing (added 2 new for the async-job wait + skip flag).
End-to-end run against a live DSQL cluster before pushing — register /
login / profile / list / revoke flow worked, the IDOR fix from feat: add Amazon Aurora DSQL auth and session management sample #1216
still rejects cross-user revocation, and sessions_token_hash_key is
auto-created from the UNIQUE constraint as expected.

Context

@amaksimo — your review on #1216 surfaced these as comments #4, #5, #7,
and #9 of your June 11 pass. They were addressed in the blog post
already; this PR brings the code in line. Happy to split into smaller
commits or rework if you'd prefer a different shape.

…-job wait, housekeeping, conn lifetime Follow-up to PR aws-samples#1216 to keep the sample in lockstep with the latest blog post content. connection.ts - Add maxLifetimeSeconds: 3300 to the pool config so connections retire at 55 minutes, ahead of DSQL's 1-hour hard cap. repositories (sessionRepository.ts, userRepository.ts) - Extend PoolLike with optional transaction(callback) matching AuroraDSQLPool.transaction from @aws/aurora-dsql-node-postgres-connector. - New runInTransaction helper: when the pool exposes transaction() (production with the connector) it routes through the connector's OCC retry helper, which catches both OC000 (data) and OC001 (schema) conflicts under SQLSTATE 40001. Falls back to manual BEGIN/COMMIT wrapped in retryWithBackoff for unit-test mocks. - All write paths go through runInTransaction. No behavioural change for callers; eliminates duplicated rollback/retry plumbing in every method. migrate.ts - DDL_STATEMENTS is now { sql, isAsync } so the runner can distinguish CREATE INDEX ASYNC from synchronous DDL. - After a CREATE INDEX ASYNC statement, capture the returned job_id and SELECT sys.wait_for_job($1) on a fresh client outside the scheduling transaction. Migrations now return only after async indexes reach VALID, so the application doesn't start serving traffic against a sequential scan. - Opt-out via runMigrations(pool, { waitForAsyncJobs: false }) for callers that want non-blocking migrations. - Widened ClientLike.query return type so the runner can read the job_id row without an unsafe cast. src/scripts/housekeeping.ts (new) - Periodic purge job for expired and long-revoked sessions. Batched DELETE loop with LIMIT 3000 to respect DSQL's per-transaction cap; idempotent so partial completion is safely resumable. Configurable retention via SESSION_RETENTION_DAYS env (default 30 days). - New 'npm run housekeeping' script for cron / EventBridge-triggered Lambda / ECS scheduled task invocation. tests - migrate.test.ts: mock client now returns a job_id row for CREATE INDEX ASYNC. Two new tests assert sys.wait_for_job runs after the async index and that waitForAsyncJobs: false skips the wait. - connection.test.ts: pool config assertion updated to include maxLifetimeSeconds: 3300. - 196 tests passing.

amaksimo

Looks good overall, the architecture is right and the SQL is clean. A few asks before merging:

CI hasn't run on this branch (gh pr checks reports nothing). Can we get it green before merge? "196/196 passing" in the description is fine but it's self-reported.
The blog's V4 production-deployment section walks readers through a custom app_runtime role and dsql:DbConnect. None of that lands here — connection.ts still uses admin, no setup script, no README pointer. Either add a one-shot setup-runtime-role.ts, or update the README so readers cloning the repo aren't left guessing.
Smaller items inline.

amaksimo · 2026-06-11T23:02:52Z

+    // Aurora DSQL closes any single connection at 1 hour. Recycle each
+    // connection at 55 minutes so a request never lands on a connection
+    // that the cluster is about to close.
+    maxLifetimeSeconds: 3300,


FYI the connector's parsePgConfig already sets maxLifetimeSeconds: 3300 as a default (grep the published 0.1.9 build). Setting it explicitly here is harmless but redundant. I'd either drop it, or add a comment saying "matches connector default, kept for visibility" so the next person doesn't think it's load-bearing.

amaksimo · 2026-06-11T23:02:52Z

+  pool: PoolLike,
+  callback: (client: ClientLike) => Promise<T>,
+): Promise<T> {
+  if (pool.transaction) {


Nice abstraction for keeping tests cheap. One thing — in production getPool() always returns an AuroraDSQLPool so pool.transaction is always defined, which means the fallback path is only ever exercised by tests. If a future refactor accidentally returns a plain pg.Pool (or wraps it in something that drops transaction), the app silently downgrades to the manual loop and we lose the connector's retry without any signal in logs.

I'd add an assertion in getPool() that typeof pool.transaction === 'function'. Tests don't go through getPool so this wouldn't break them, and it makes the prod guarantee explicit.

amaksimo · 2026-06-11T23:02:52Z

+
+    // The 4th client ran the wait_for_job for the async index.
+    expect(clients[3].queries).toHaveLength(1);
+    expect(clients[3].queries[0]).toBe('SELECT sys.wait_for_job($1)');


The mock client's query is vi.fn(async (sql: string) => ...), only sql gets recorded. So a regression where the migrator passes undefined (or the wrong variable) for $1 would still pass this test. Cheap fix:

expect(clients[3].query).toHaveBeenCalledWith( 'SELECT sys.wait_for_job($1)', ['fake-job-id'], );

amaksimo · 2026-06-11T23:02:52Z

+      // Each batch is its own transaction. We don't need OCC retry here
+      // because the housekeeping job is the only writer touching old rows
+      // in practice; if it ever conflicts with concurrent revocation, the
+      // job is idempotent and can simply be re-run.


The reasoning is fine at the SQL level (the DELETE is idempotent), but "can simply be re-run" assumes the orchestrator retries on non-zero exit. cron doesn't. ECS scheduled tasks don't. EventBridge Scheduler retries are configurable but off by default for many setups.

I'd just wrap purgeOldSessions in retryWithBackoff (six lines, no behavior change in the happy path), or call out the orchestrator-retry expectation in the file header so a deployer doesn't get bitten on the first OCC conflict that escapes.

amaksimo · 2026-06-11T23:02:52Z

+    try {
+      return JSON.parse(value) as ClientMetadata;
+    } catch (error) {
+      console.warn(


If this fires in prod the operator knows some row failed to parse but not which one. Add the session ID (caller has it) or at minimum the value's length so the warning is actionable.

Also JSON.parse(value) as ClientMetadata is a cast not a check, a stored [] or a stringified field where an object is expected will pass through. Fine for a sample, just worth a one-liner comment that the runtime shape isn't validated.

Six items from the latest review pass. Top-level: setup-runtime-role.ts (custom DB role for production) - New src/scripts/setup-runtime-role.ts that runs the CREATE ROLE + AWS IAM GRANT + GRANT SQL the blog walks through, packaged so a reader cloning the repo doesn't have to copy-paste from the post. - Validates APP_ROLE_NAME against the PostgreSQL identifier grammar before interpolating it (env-supplied identifiers can't be parameterised, but they CAN be sanitised). - New 'npm run setup-runtime-role' script. - README 'Production hardening checklist' now starts with the runtime role item, with the full one-liner invocation. #1 maxLifetimeSeconds (connection.ts) - Kept the explicit setting but added a comment noting it matches the connector default in v0.1.9, so the next reader doesn't think it's load-bearing. #2 typeof transaction assertion (connection.ts) - New runtime check in getPool() that throws if pool.transaction is not a function. Production always has it (AuroraDSQLPool exposes it); tests don't go through getPool. This makes the production guarantee explicit and prevents a silent downgrade to the manual fallback path if a future refactor returns a plain pg.Pool. #3 wait_for_job test now asserts on params (migrate.test.ts) - Mock client now records params alongside sql. - Test uses toHaveBeenCalledWith('SELECT sys.wait_for_job($1)', ['fake-job-id']) so a regression that passes undefined for $1 would actually fail. #4 housekeeping.ts retry - Each batch now runs inside retryWithBackoff so transient OCC conflicts during cleanup don't fail the whole run. - Header comment updated to call out that orchestrators (cron, ECS scheduled tasks, EventBridge default) don't retry on non-zero exit, so the in-process retry is load-bearing. #5 parseClientMetadata logging (sessionRepository.ts) - Warning now includes the session id and the failed value's length so an operator can locate and inspect the offending row. - Doc comment now explicitly states the runtime shape isn't validated; the cast is structural. Tests: 196/196 passing.

charycha · 2026-06-12T15:45:24Z

Thanks for the second pass - pushed commit 83d567c addressing all six items.

Top-level

CI didn't run. Per CONTRIBUTING.md, integration tests don't run on
fork PRs because the workflows can't access repo secrets. Would
appreciate a maintainer triggering the workflows on this branch (or
pushing it into a feature branch in the upstream repo) so the gate
reports green before merge. 196/196 unit tests pass locally and the
full register/login/profile/list/revoke flow was validated end-to-end
against a live cluster before push.
app_runtime role / dsql:DbConnect. Added
src/scripts/setup-runtime-role.ts packaged as npm run setup-runtime-role,
validates APP_ROLE_NAME as a PostgreSQL identifier before
interpolation. README "Production hardening checklist" now leads with
the runtime-role step and the full invocation.

Inline

maxLifetimeSeconds redundant — kept explicit with a comment
noting it matches the connector default in v0.1.9, so future
readers don't think it's load-bearing.
typeof transaction assertion — added in getPool(). Tests
don't go through getPool so they keep working; production now
fails fast if a refactor returns a plain pg.Pool.
wait_for_job test weak — mock client records params now, and
the assertion uses toHaveBeenCalledWith('SELECT sys.wait_for_job($1)', ['fake-job-id']).
housekeeping orchestrator retry — each batch now runs inside
retryWithBackoff. Header comment updated to spell out why
(cron, ECS scheduled tasks, EventBridge-default don't retry on
non-zero exit).
parseClientMetadata logging — warn now includes session id
and value length; doc comment notes the runtime shape isn't
validated.

Tests: 196/196 passing.

amaksimo

Pulled the cluster down and ran setup-runtime-role.ts end-to-end against a live DSQL cluster. One real bug, two smaller real issues, and a walk-back from my prior pass:

Script doesn't run. AWS IAM GRANT <role> TO $1 throws 42601 syntax error at or near "$1" against a real cluster, every time. Literal-ARN form works fine. Unit tests didn't catch it because they mock query(). Inline.
Walking back my earlier GRANT USAGE ON SCHEMA public ask — verified live, DSQL grants it by default to new roles, and explicit GRANT USAGE on public is itself rejected (0A000 feature not supported on system entity). Don't add it.
Two smaller real issues inline (table-existence ordering, runtime DELETE diverging from the blog).
README undo-path note inline too — verified live, you need AWS IAM REVOKE before DROP ROLE or it fails with 2BP01.

On CI: there's no workflow for this sample at all — ci-gate.yml doesn't reference it and no per-sample workflow exists. The 196/196 are local-only. Two options: add an integ workflow that runs npm test + npm run setup-runtime-role against a test cluster (the existing per-sample integ workflows are a fine template), or run npm run setup-runtime-role against a real cluster locally and paste the output here. Up to you, but we shouldn't ship a script that's only ever been exercised by mocks.

Request-changes pending #1. Rest are nits.

amaksimo · 2026-06-12T21:12:19Z

+    // AWS IAM GRANT cannot use a parameterized role identifier; the role
+    // name has been validated above to ensure it's a safe identifier.
+    await client.query(
+      `AWS IAM GRANT ${roleName} TO $1`,


Verified against a live cluster — this throws 42601 syntax error at or near "$1". DSQL's parser doesn't accept $1 here; AWS IAM GRANT … TO only takes a literal string. The script as-written always fails on this statement.

Fix: validate the ARN with a regex (mirror the roleName validation on line 60) and string-interpolate the validated value. Something like:

if (!/^arn:aws[a-z-]*:iam::\d{12}:role\/[\w+=,.@-]+$/.test(taskRoleArn)) { console.error(`[setup-runtime-role] APP_TASK_ROLE_ARN is not a valid IAM role ARN.`); process.exit(1); } await client.query(`AWS IAM GRANT ${roleName} TO '${taskRoleArn}'`);

amaksimo · 2026-06-12T21:12:19Z

+
+    console.log(`[setup-runtime-role] granting privileges on users ...`);
+    await client.query(
+      `GRANT SELECT, INSERT, UPDATE ON users TO ${roleName}`,


If someone runs setup-runtime-role before migrate, this GRANT fails with 42P01 relation "users" does not exist (verified live). Either add a precondition check (e.g. a SELECT 1 FROM users LIMIT 0 at the top of the script) or call out the ordering in the script header + README hardening section. Right now nothing tells the reader migrations must run first.

amaksimo · 2026-06-12T21:12:19Z

+
+    console.log(`[setup-runtime-role] granting privileges on sessions ...`);
+    await client.query(
+      `GRANT SELECT, INSERT, UPDATE, DELETE ON sessions TO ${roleName}`,


Blog V5's prod-deployment snippet grants S/I/U on sessions; this adds DELETE so housekeeping can run under the same role. Worth either splitting into a separate housekeeping DB role with DELETE-only privs (cleaner separation, IMO), or noting in the script header why the runtime is broader than the blog shows. Right now they disagree silently.

amaksimo · 2026-06-12T21:12:19Z

+  npm run setup-runtime-role
+  ```
+
+  Then change `connection.ts` to connect as `app_runtime` instead of `admin`, and attach an IAM task-role policy that grants only `dsql:DbConnect` (not `dsql:DbConnectAdmin`). Keep `admin` for one-off setup steps such as creating the role itself or running migrations. This follows Aurora DSQL's [Database roles and IAM authentication](https://docs.aws.amazon.com/aurora-dsql/latest/userguide/working-with-database-roles.html) guidance.


Quick note for the undo path — verified live: DROP ROLE app_runtime fails with 2BP01 cannot be dropped because some objects depend on it if you don't first AWS IAM REVOKE app_runtime FROM '<arn>'. Worth one line at the end of this bullet so readers can roll the role back cleanly.

@amaksimo

…ral-ARN form, table precondition, undo path Addresses the four findings from @amaksimo's CHANGES_REQUESTED review. The first finding was a real bug that only manifested against a live cluster; the rest are tightening based on his end-to-end run. #1 AWS IAM GRANT \$1 syntax error (release-blocker, verified live) - DSQL's parser rejects '' in AWS IAM GRANT … TO \$1 with 42601 syntax error. Unit tests didn't catch it because they mock query(). - Switched to literal-ARN string interpolation, with a strict regex validator on APP_TASK_ROLE_ARN guarding the input before interpolation. The regex matches arn:aws[partition]:iam::<12-digit>:role/<name> with the IAM-allowed punctuation. #2 Table-existence ordering (verified live: 42P01) - If a deployer runs setup-runtime-role before migrations, the GRANT … ON users/sessions statements fail partway through with '42P01 relation "users" does not exist', leaving a dangling role. - Added a precondition check (SELECT 1 FROM users LIMIT 0; same for sessions) at the top of the script. On miss it exits with a clear message pointing to migrations. #3 DELETE on sessions diverges from blog - Blog V5's prod-deployment snippet shows S/I/U on sessions; the script grants SIU + DELETE so housekeeping can run under the same role. - Added a header note explaining why the runtime is broader and pointing readers to the split-role alternative if they prefer stricter separation. #4 README undo path (verified live: 2BP01) - DROP ROLE app_runtime fails with '2BP01 cannot be dropped because some objects depend on it' if the IAM mapping isn't revoked first. - Added a one-paragraph undo block to the runtime-role bullet in the Production hardening checklist showing AWS IAM REVOKE then DROP ROLE. Live validation - npm run setup-runtime-role end-to-end against a live us-east-1 DSQL cluster: CREATE ROLE -> AWS IAM GRANT (literal ARN) -> GRANT on users -> GRANT on sessions, all succeeded with exit code 0. - Precondition check verified by dropping users + sessions and re- running: script exits with the documented message. - Tests: 196/196 passing.

charycha · 2026-06-12T21:47:31Z

Thanks for catching this with the live run, all four addressed in
commit 3ed9457. Verified end-to-end against a fresh us-east-1 DSQL
cluster before pushing.

1. AWS IAM GRANT $1 syntax error (release-blocker) — switched to
literal-ARN string interpolation, with a strict ARN regex validator on
APP_TASK_ROLE_ARN (mirrors the existing APP_ROLE_NAME validation):

const IAM_ROLE_ARN_REGEX = /^arn:aws[a-z-]*:iam::\d{12}:role\/[\w+=,.@-]+$/;
// ...
await client.query(`AWS IAM GRANT ${roleName} TO '${taskRoleArn}'`);

amaksimo

Re-ran the whole setup-runtime-role.ts flow against a live cluster — #1 (literal-ARN GRANT), #2 (table precondition), and #3 (DELETE divergence note) all check out. Thanks for the thorough turnaround.

One thing left on the README undo path — it's still incomplete. Verified it with isolated test roles on the cluster: the IAM REVOKE you added is necessary but not sufficient. The table grants are a second, independent dependency, so DROP ROLE still fails with the same 2BP01 even after the IAM REVOKE. Detail + corrected block inline.

Fix that and I'm happy to approve.

amaksimo · 2026-06-12T22:40:34Z

+
+  ```sql
+  AWS IAM REVOKE app_runtime FROM 'arn:aws:iam::111122223333:role/auth-service-task-role';
+  DROP ROLE app_runtime;


This two-step undo still fails. Tested it live with isolated roles — after AWS IAM REVOKE, DROP ROLE errors with:

ERROR: role "app_runtime" cannot be dropped because some objects depend on it DETAIL: privileges for table users, privileges for table sessions

The table grants from this script are a separate dependency from the IAM mapping, and both independently block DROP ROLE (confirmed: revoking only the table grants but not the IAM mapping fails too, with DETAIL: privileges for trust relationship …). So the IAM REVOKE you added is needed but not enough on its own.

Full sequence that actually drops the role cleanly (verified end-to-end, every step exit 0):

REVOKE ALL ON users FROM app_runtime; REVOKE ALL ON sessions FROM app_runtime; AWS IAM REVOKE app_runtime FROM 'arn:aws:iam::111122223333:role/auth-service-task-role'; DROP ROLE app_runtime;

The three REVOKEs can be in any order; all three just have to come before DROP ROLE. Worth swapping the block above for this one (and the surrounding sentence, since it's not just the IAM mapping).

amaksimo reviewed Jun 11, 2026

View reviewed changes

amaksimo requested changes Jun 12, 2026

View reviewed changes

amaksimo reviewed Jun 12, 2026

View reviewed changes

Conversation

charycha commented Jun 11, 2026

What this adds

Validation

Context

Uh oh!

amaksimo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

charycha commented Jun 12, 2026

Uh oh!

amaksimo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

charycha commented Jun 12, 2026

Uh oh!

amaksimo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants