feat(auth-session-mgmt): align with blog — connector OCC retry, async-job wait, housekeeping, conn lifetime#1287
Conversation
…-job wait, housekeeping, conn lifetime Follow-up to PR aws-samples#1216 to keep the sample in lockstep with the latest blog post content. connection.ts - Add maxLifetimeSeconds: 3300 to the pool config so connections retire at 55 minutes, ahead of DSQL's 1-hour hard cap. repositories (sessionRepository.ts, userRepository.ts) - Extend PoolLike with optional transaction(callback) matching AuroraDSQLPool.transaction from @aws/aurora-dsql-node-postgres-connector. - New runInTransaction helper: when the pool exposes transaction() (production with the connector) it routes through the connector's OCC retry helper, which catches both OC000 (data) and OC001 (schema) conflicts under SQLSTATE 40001. Falls back to manual BEGIN/COMMIT wrapped in retryWithBackoff for unit-test mocks. - All write paths go through runInTransaction. No behavioural change for callers; eliminates duplicated rollback/retry plumbing in every method. migrate.ts - DDL_STATEMENTS is now { sql, isAsync } so the runner can distinguish CREATE INDEX ASYNC from synchronous DDL. - After a CREATE INDEX ASYNC statement, capture the returned job_id and SELECT sys.wait_for_job($1) on a fresh client outside the scheduling transaction. Migrations now return only after async indexes reach VALID, so the application doesn't start serving traffic against a sequential scan. - Opt-out via runMigrations(pool, { waitForAsyncJobs: false }) for callers that want non-blocking migrations. - Widened ClientLike.query return type so the runner can read the job_id row without an unsafe cast. src/scripts/housekeeping.ts (new) - Periodic purge job for expired and long-revoked sessions. Batched DELETE loop with LIMIT 3000 to respect DSQL's per-transaction cap; idempotent so partial completion is safely resumable. Configurable retention via SESSION_RETENTION_DAYS env (default 30 days). - New 'npm run housekeeping' script for cron / EventBridge-triggered Lambda / ECS scheduled task invocation. tests - migrate.test.ts: mock client now returns a job_id row for CREATE INDEX ASYNC. Two new tests assert sys.wait_for_job runs after the async index and that waitForAsyncJobs: false skips the wait. - connection.test.ts: pool config assertion updated to include maxLifetimeSeconds: 3300. - 196 tests passing.
amaksimo
left a comment
There was a problem hiding this comment.
Looks good overall, the architecture is right and the SQL is clean. A few asks before merging:
- CI hasn't run on this branch (
gh pr checksreports nothing). Can we get it green before merge? "196/196 passing" in the description is fine but it's self-reported. - The blog's V4 production-deployment section walks readers through a custom
app_runtimerole anddsql:DbConnect. None of that lands here —connection.tsstill usesadmin, no setup script, no README pointer. Either add a one-shotsetup-runtime-role.ts, or update the README so readers cloning the repo aren't left guessing. - Smaller items inline.
| // Aurora DSQL closes any single connection at 1 hour. Recycle each | ||
| // connection at 55 minutes so a request never lands on a connection | ||
| // that the cluster is about to close. | ||
| maxLifetimeSeconds: 3300, |
There was a problem hiding this comment.
FYI the connector's parsePgConfig already sets maxLifetimeSeconds: 3300 as a default (grep the published 0.1.9 build). Setting it explicitly here is harmless but redundant. I'd either drop it, or add a comment saying "matches connector default, kept for visibility" so the next person doesn't think it's load-bearing.
| pool: PoolLike, | ||
| callback: (client: ClientLike) => Promise<T>, | ||
| ): Promise<T> { | ||
| if (pool.transaction) { |
There was a problem hiding this comment.
Nice abstraction for keeping tests cheap. One thing — in production getPool() always returns an AuroraDSQLPool so pool.transaction is always defined, which means the fallback path is only ever exercised by tests. If a future refactor accidentally returns a plain pg.Pool (or wraps it in something that drops transaction), the app silently downgrades to the manual loop and we lose the connector's retry without any signal in logs.
I'd add an assertion in getPool() that typeof pool.transaction === 'function'. Tests don't go through getPool so this wouldn't break them, and it makes the prod guarantee explicit.
|
|
||
| // The 4th client ran the wait_for_job for the async index. | ||
| expect(clients[3].queries).toHaveLength(1); | ||
| expect(clients[3].queries[0]).toBe('SELECT sys.wait_for_job($1)'); |
There was a problem hiding this comment.
The mock client's query is vi.fn(async (sql: string) => ...), only sql gets recorded. So a regression where the migrator passes undefined (or the wrong variable) for $1 would still pass this test. Cheap fix:
expect(clients[3].query).toHaveBeenCalledWith(
'SELECT sys.wait_for_job($1)',
['fake-job-id'],
);| // Each batch is its own transaction. We don't need OCC retry here | ||
| // because the housekeeping job is the only writer touching old rows | ||
| // in practice; if it ever conflicts with concurrent revocation, the | ||
| // job is idempotent and can simply be re-run. |
There was a problem hiding this comment.
The reasoning is fine at the SQL level (the DELETE is idempotent), but "can simply be re-run" assumes the orchestrator retries on non-zero exit. cron doesn't. ECS scheduled tasks don't. EventBridge Scheduler retries are configurable but off by default for many setups.
I'd just wrap purgeOldSessions in retryWithBackoff (six lines, no behavior change in the happy path), or call out the orchestrator-retry expectation in the file header so a deployer doesn't get bitten on the first OCC conflict that escapes.
| try { | ||
| return JSON.parse(value) as ClientMetadata; | ||
| } catch (error) { | ||
| console.warn( |
There was a problem hiding this comment.
If this fires in prod the operator knows some row failed to parse but not which one. Add the session ID (caller has it) or at minimum the value's length so the warning is actionable.
Also JSON.parse(value) as ClientMetadata is a cast not a check, a stored [] or a stringified field where an object is expected will pass through. Fine for a sample, just worth a one-liner comment that the runtime shape isn't validated.
Six items from the latest review pass.
Top-level: setup-runtime-role.ts (custom DB role for production)
- New src/scripts/setup-runtime-role.ts that runs the CREATE ROLE +
AWS IAM GRANT + GRANT SQL the blog walks through, packaged so a
reader cloning the repo doesn't have to copy-paste from the post.
- Validates APP_ROLE_NAME against the PostgreSQL identifier grammar
before interpolating it (env-supplied identifiers can't be
parameterised, but they CAN be sanitised).
- New 'npm run setup-runtime-role' script.
- README 'Production hardening checklist' now starts with the runtime
role item, with the full one-liner invocation.
#1 maxLifetimeSeconds (connection.ts)
- Kept the explicit setting but added a comment noting it matches the
connector default in v0.1.9, so the next reader doesn't think it's
load-bearing.
#2 typeof transaction assertion (connection.ts)
- New runtime check in getPool() that throws if pool.transaction is
not a function. Production always has it (AuroraDSQLPool exposes
it); tests don't go through getPool. This makes the production
guarantee explicit and prevents a silent downgrade to the manual
fallback path if a future refactor returns a plain pg.Pool.
#3 wait_for_job test now asserts on params (migrate.test.ts)
- Mock client now records params alongside sql.
- Test uses toHaveBeenCalledWith('SELECT sys.wait_for_job($1)',
['fake-job-id']) so a regression that passes undefined for $1
would actually fail.
#4 housekeeping.ts retry
- Each batch now runs inside retryWithBackoff so transient OCC
conflicts during cleanup don't fail the whole run.
- Header comment updated to call out that orchestrators (cron, ECS
scheduled tasks, EventBridge default) don't retry on non-zero
exit, so the in-process retry is load-bearing.
#5 parseClientMetadata logging (sessionRepository.ts)
- Warning now includes the session id and the failed value's length
so an operator can locate and inspect the offending row.
- Doc comment now explicitly states the runtime shape isn't
validated; the cast is structural.
Tests: 196/196 passing.
|
Thanks for the second pass - pushed commit 83d567c addressing all six items. Top-level
Inline
Tests: 196/196 passing. |
amaksimo
left a comment
There was a problem hiding this comment.
Pulled the cluster down and ran setup-runtime-role.ts end-to-end against a live DSQL cluster. One real bug, two smaller real issues, and a walk-back from my prior pass:
- Script doesn't run.
AWS IAM GRANT <role> TO $1throws42601 syntax error at or near "$1"against a real cluster, every time. Literal-ARN form works fine. Unit tests didn't catch it because they mockquery(). Inline. - Walking back my earlier
GRANT USAGE ON SCHEMA publicask — verified live, DSQL grants it by default to new roles, and explicitGRANT USAGEonpublicis itself rejected (0A000 feature not supported on system entity). Don't add it. - Two smaller real issues inline (table-existence ordering, runtime DELETE diverging from the blog).
- README undo-path note inline too — verified live, you need
AWS IAM REVOKEbeforeDROP ROLEor it fails with2BP01.
On CI: there's no workflow for this sample at all — ci-gate.yml doesn't reference it and no per-sample workflow exists. The 196/196 are local-only. Two options: add an integ workflow that runs npm test + npm run setup-runtime-role against a test cluster (the existing per-sample integ workflows are a fine template), or run npm run setup-runtime-role against a real cluster locally and paste the output here. Up to you, but we shouldn't ship a script that's only ever been exercised by mocks.
Request-changes pending #1. Rest are nits.
| // AWS IAM GRANT cannot use a parameterized role identifier; the role | ||
| // name has been validated above to ensure it's a safe identifier. | ||
| await client.query( | ||
| `AWS IAM GRANT ${roleName} TO $1`, |
There was a problem hiding this comment.
Verified against a live cluster — this throws 42601 syntax error at or near "$1". DSQL's parser doesn't accept $1 here; AWS IAM GRANT … TO only takes a literal string. The script as-written always fails on this statement.
Fix: validate the ARN with a regex (mirror the roleName validation on line 60) and string-interpolate the validated value. Something like:
if (!/^arn:aws[a-z-]*:iam::\d{12}:role\/[\w+=,.@-]+$/.test(taskRoleArn)) {
console.error(`[setup-runtime-role] APP_TASK_ROLE_ARN is not a valid IAM role ARN.`);
process.exit(1);
}
await client.query(`AWS IAM GRANT ${roleName} TO '${taskRoleArn}'`);|
|
||
| console.log(`[setup-runtime-role] granting privileges on users ...`); | ||
| await client.query( | ||
| `GRANT SELECT, INSERT, UPDATE ON users TO ${roleName}`, |
There was a problem hiding this comment.
If someone runs setup-runtime-role before migrate, this GRANT fails with 42P01 relation "users" does not exist (verified live). Either add a precondition check (e.g. a SELECT 1 FROM users LIMIT 0 at the top of the script) or call out the ordering in the script header + README hardening section. Right now nothing tells the reader migrations must run first.
|
|
||
| console.log(`[setup-runtime-role] granting privileges on sessions ...`); | ||
| await client.query( | ||
| `GRANT SELECT, INSERT, UPDATE, DELETE ON sessions TO ${roleName}`, |
There was a problem hiding this comment.
Blog V5's prod-deployment snippet grants S/I/U on sessions; this adds DELETE so housekeeping can run under the same role. Worth either splitting into a separate housekeeping DB role with DELETE-only privs (cleaner separation, IMO), or noting in the script header why the runtime is broader than the blog shows. Right now they disagree silently.
| npm run setup-runtime-role | ||
| ``` | ||
|
|
||
| Then change `connection.ts` to connect as `app_runtime` instead of `admin`, and attach an IAM task-role policy that grants only `dsql:DbConnect` (not `dsql:DbConnectAdmin`). Keep `admin` for one-off setup steps such as creating the role itself or running migrations. This follows Aurora DSQL's [Database roles and IAM authentication](https://docs.aws.amazon.com/aurora-dsql/latest/userguide/working-with-database-roles.html) guidance. |
There was a problem hiding this comment.
Quick note for the undo path — verified live: DROP ROLE app_runtime fails with 2BP01 cannot be dropped because some objects depend on it if you don't first AWS IAM REVOKE app_runtime FROM '<arn>'. Worth one line at the end of this bullet so readers can roll the role back cleanly.
…ral-ARN form, table precondition, undo path Addresses the four findings from @amaksimo's CHANGES_REQUESTED review. The first finding was a real bug that only manifested against a live cluster; the rest are tightening based on his end-to-end run. #1 AWS IAM GRANT \$1 syntax error (release-blocker, verified live) - DSQL's parser rejects '' in AWS IAM GRANT … TO \$1 with 42601 syntax error. Unit tests didn't catch it because they mock query(). - Switched to literal-ARN string interpolation, with a strict regex validator on APP_TASK_ROLE_ARN guarding the input before interpolation. The regex matches arn:aws[partition]:iam::<12-digit>:role/<name> with the IAM-allowed punctuation. #2 Table-existence ordering (verified live: 42P01) - If a deployer runs setup-runtime-role before migrations, the GRANT … ON users/sessions statements fail partway through with '42P01 relation "users" does not exist', leaving a dangling role. - Added a precondition check (SELECT 1 FROM users LIMIT 0; same for sessions) at the top of the script. On miss it exits with a clear message pointing to migrations. #3 DELETE on sessions diverges from blog - Blog V5's prod-deployment snippet shows S/I/U on sessions; the script grants SIU + DELETE so housekeeping can run under the same role. - Added a header note explaining why the runtime is broader and pointing readers to the split-role alternative if they prefer stricter separation. #4 README undo path (verified live: 2BP01) - DROP ROLE app_runtime fails with '2BP01 cannot be dropped because some objects depend on it' if the IAM mapping isn't revoked first. - Added a one-paragraph undo block to the runtime-role bullet in the Production hardening checklist showing AWS IAM REVOKE then DROP ROLE. Live validation - npm run setup-runtime-role end-to-end against a live us-east-1 DSQL cluster: CREATE ROLE -> AWS IAM GRANT (literal ARN) -> GRANT on users -> GRANT on sessions, all succeeded with exit code 0. - Precondition check verified by dropping users + sessions and re- running: script exits with the documented message. - Tests: 196/196 passing.
|
Thanks for catching this with the live run, all four addressed in 1. const IAM_ROLE_ARN_REGEX = /^arn:aws[a-z-]*:iam::\d{12}:role\/[\w+=,.@-]+$/;
// ...
await client.query(`AWS IAM GRANT ${roleName} TO '${taskRoleArn}'`); |
amaksimo
left a comment
There was a problem hiding this comment.
Re-ran the whole setup-runtime-role.ts flow against a live cluster — #1 (literal-ARN GRANT), #2 (table precondition), and #3 (DELETE divergence note) all check out. Thanks for the thorough turnaround.
One thing left on the README undo path — it's still incomplete. Verified it with isolated test roles on the cluster: the IAM REVOKE you added is necessary but not sufficient. The table grants are a second, independent dependency, so DROP ROLE still fails with the same 2BP01 even after the IAM REVOKE. Detail + corrected block inline.
Fix that and I'm happy to approve.
|
|
||
| ```sql | ||
| AWS IAM REVOKE app_runtime FROM 'arn:aws:iam::111122223333:role/auth-service-task-role'; | ||
| DROP ROLE app_runtime; |
There was a problem hiding this comment.
This two-step undo still fails. Tested it live with isolated roles — after AWS IAM REVOKE, DROP ROLE errors with:
ERROR: role "app_runtime" cannot be dropped because some objects depend on it
DETAIL: privileges for table users, privileges for table sessions
The table grants from this script are a separate dependency from the IAM mapping, and both independently block DROP ROLE (confirmed: revoking only the table grants but not the IAM mapping fails too, with DETAIL: privileges for trust relationship …). So the IAM REVOKE you added is needed but not enough on its own.
Full sequence that actually drops the role cleanly (verified end-to-end, every step exit 0):
REVOKE ALL ON users FROM app_runtime;
REVOKE ALL ON sessions FROM app_runtime;
AWS IAM REVOKE app_runtime FROM 'arn:aws:iam::111122223333:role/auth-service-task-role';
DROP ROLE app_runtime;The three REVOKEs can be in any order; all three just have to come before DROP ROLE. Worth swapping the block above for this one (and the surrounding sentence, since it's not just the IAM mapping).
Follow-up to #1216 (already merged) to keep the sample in lockstep with the
latest blog post content. The blog teaches several techniques that aren't
yet present in the merged code; this PR adds them so readers who clone
the repo see the same patterns the blog references.
What this adds
maxLifetimeSeconds: 3300on the connection pool so connectionsretire at 55 minutes, ahead of Aurora DSQL's 1-hour hard cap.
AuroraDSQLPool.transaction().Repositories now route writes through it in production; tests fall back
to a manual BEGIN/COMMIT +
retryWithBackoffwhen the mock pool doesn'texpose
transaction().sys.wait_for_jobwait in the migrator.CREATE INDEX ASYNCnowblocks until the index reaches VALID before the migration returns, so
the app doesn't start serving traffic against a sequential scan. Opt-out
flag for callers that want non-blocking migrations.
src/scripts/housekeeping.ts: batched, idempotent DELETE loop forexpired and long-revoked sessions, respecting Aurora DSQL's 3000-row
per-transaction cap. Wired up as
npm run housekeeping. Configurableretention via
SESSION_RETENTION_DAYS(default 30 days).Validation
login / profile / list / revoke flow worked, the IDOR fix from feat: add Amazon Aurora DSQL auth and session management sample #1216
still rejects cross-user revocation, and
sessions_token_hash_keyisauto-created from the UNIQUE constraint as expected.
Context
@amaksimo — your review on #1216 surfaced these as comments #4, #5, #7,
and #9 of your June 11 pass. They were addressed in the blog post
already; this PR brings the code in line. Happy to split into smaller
commits or rework if you'd prefer a different shape.