Skip to content

feat(auth-session-mgmt): align with blog — connector OCC retry, async-job wait, housekeeping, conn lifetime#1287

Open
charycha wants to merge 3 commits into
aws-samples:mainfrom
charycha:feat/auth-session-mgmt-followups
Open

feat(auth-session-mgmt): align with blog — connector OCC retry, async-job wait, housekeeping, conn lifetime#1287
charycha wants to merge 3 commits into
aws-samples:mainfrom
charycha:feat/auth-session-mgmt-followups

Conversation

@charycha

Copy link
Copy Markdown
Contributor

Follow-up to #1216 (already merged) to keep the sample in lockstep with the
latest blog post content. The blog teaches several techniques that aren't
yet present in the merged code; this PR adds them so readers who clone
the repo see the same patterns the blog references.

What this adds

  • maxLifetimeSeconds: 3300 on the connection pool so connections
    retire at 55 minutes, ahead of Aurora DSQL's 1-hour hard cap.
  • Connector-built-in OCC retry via AuroraDSQLPool.transaction().
    Repositories now route writes through it in production; tests fall back
    to a manual BEGIN/COMMIT + retryWithBackoff when the mock pool doesn't
    expose transaction().
  • sys.wait_for_job wait in the migrator. CREATE INDEX ASYNC now
    blocks until the index reaches VALID before the migration returns, so
    the app doesn't start serving traffic against a sequential scan. Opt-out
    flag for callers that want non-blocking migrations.
  • src/scripts/housekeeping.ts: batched, idempotent DELETE loop for
    expired and long-revoked sessions, respecting Aurora DSQL's 3000-row
    per-transaction cap. Wired up as npm run housekeeping. Configurable
    retention via SESSION_RETENTION_DAYS (default 30 days).

Validation

  • 196/196 tests passing (added 2 new for the async-job wait + skip flag).
  • End-to-end run against a live DSQL cluster before pushing — register /
    login / profile / list / revoke flow worked, the IDOR fix from feat: add Amazon Aurora DSQL auth and session management sample #1216
    still rejects cross-user revocation, and sessions_token_hash_key is
    auto-created from the UNIQUE constraint as expected.

Context

@amaksimo — your review on #1216 surfaced these as comments #4, #5, #7,
and #9 of your June 11 pass. They were addressed in the blog post
already; this PR brings the code in line. Happy to split into smaller
commits or rework if you'd prefer a different shape.

…-job wait, housekeeping, conn lifetime

Follow-up to PR aws-samples#1216 to keep the sample in lockstep with the latest blog
post content.

connection.ts
- Add maxLifetimeSeconds: 3300 to the pool config so connections retire at
  55 minutes, ahead of DSQL's 1-hour hard cap.

repositories (sessionRepository.ts, userRepository.ts)
- Extend PoolLike with optional transaction(callback) matching
  AuroraDSQLPool.transaction from @aws/aurora-dsql-node-postgres-connector.
- New runInTransaction helper: when the pool exposes transaction()
  (production with the connector) it routes through the connector's OCC
  retry helper, which catches both OC000 (data) and OC001 (schema)
  conflicts under SQLSTATE 40001. Falls back to manual BEGIN/COMMIT
  wrapped in retryWithBackoff for unit-test mocks.
- All write paths go through runInTransaction. No behavioural change for
  callers; eliminates duplicated rollback/retry plumbing in every method.

migrate.ts
- DDL_STATEMENTS is now { sql, isAsync } so the runner can distinguish
  CREATE INDEX ASYNC from synchronous DDL.
- After a CREATE INDEX ASYNC statement, capture the returned job_id and
  SELECT sys.wait_for_job($1) on a fresh client outside the scheduling
  transaction. Migrations now return only after async indexes reach VALID,
  so the application doesn't start serving traffic against a sequential
  scan.
- Opt-out via runMigrations(pool, { waitForAsyncJobs: false }) for callers
  that want non-blocking migrations.
- Widened ClientLike.query return type so the runner can read the job_id
  row without an unsafe cast.

src/scripts/housekeeping.ts (new)
- Periodic purge job for expired and long-revoked sessions. Batched DELETE
  loop with LIMIT 3000 to respect DSQL's per-transaction cap; idempotent
  so partial completion is safely resumable. Configurable retention via
  SESSION_RETENTION_DAYS env (default 30 days).
- New 'npm run housekeeping' script for cron / EventBridge-triggered
  Lambda / ECS scheduled task invocation.

tests
- migrate.test.ts: mock client now returns a job_id row for CREATE INDEX
  ASYNC. Two new tests assert sys.wait_for_job runs after the async index
  and that waitForAsyncJobs: false skips the wait.
- connection.test.ts: pool config assertion updated to include
  maxLifetimeSeconds: 3300.
- 196 tests passing.

@amaksimo amaksimo left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, the architecture is right and the SQL is clean. A few asks before merging:

  • CI hasn't run on this branch (gh pr checks reports nothing). Can we get it green before merge? "196/196 passing" in the description is fine but it's self-reported.
  • The blog's V4 production-deployment section walks readers through a custom app_runtime role and dsql:DbConnect. None of that lands here — connection.ts still uses admin, no setup script, no README pointer. Either add a one-shot setup-runtime-role.ts, or update the README so readers cloning the repo aren't left guessing.
  • Smaller items inline.

// Aurora DSQL closes any single connection at 1 hour. Recycle each
// connection at 55 minutes so a request never lands on a connection
// that the cluster is about to close.
maxLifetimeSeconds: 3300,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI the connector's parsePgConfig already sets maxLifetimeSeconds: 3300 as a default (grep the published 0.1.9 build). Setting it explicitly here is harmless but redundant. I'd either drop it, or add a comment saying "matches connector default, kept for visibility" so the next person doesn't think it's load-bearing.

pool: PoolLike,
callback: (client: ClientLike) => Promise<T>,
): Promise<T> {
if (pool.transaction) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice abstraction for keeping tests cheap. One thing — in production getPool() always returns an AuroraDSQLPool so pool.transaction is always defined, which means the fallback path is only ever exercised by tests. If a future refactor accidentally returns a plain pg.Pool (or wraps it in something that drops transaction), the app silently downgrades to the manual loop and we lose the connector's retry without any signal in logs.

I'd add an assertion in getPool() that typeof pool.transaction === 'function'. Tests don't go through getPool so this wouldn't break them, and it makes the prod guarantee explicit.


// The 4th client ran the wait_for_job for the async index.
expect(clients[3].queries).toHaveLength(1);
expect(clients[3].queries[0]).toBe('SELECT sys.wait_for_job($1)');

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mock client's query is vi.fn(async (sql: string) => ...), only sql gets recorded. So a regression where the migrator passes undefined (or the wrong variable) for $1 would still pass this test. Cheap fix:

expect(clients[3].query).toHaveBeenCalledWith(
  'SELECT sys.wait_for_job($1)',
  ['fake-job-id'],
);

// Each batch is its own transaction. We don't need OCC retry here
// because the housekeeping job is the only writer touching old rows
// in practice; if it ever conflicts with concurrent revocation, the
// job is idempotent and can simply be re-run.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reasoning is fine at the SQL level (the DELETE is idempotent), but "can simply be re-run" assumes the orchestrator retries on non-zero exit. cron doesn't. ECS scheduled tasks don't. EventBridge Scheduler retries are configurable but off by default for many setups.

I'd just wrap purgeOldSessions in retryWithBackoff (six lines, no behavior change in the happy path), or call out the orchestrator-retry expectation in the file header so a deployer doesn't get bitten on the first OCC conflict that escapes.

try {
return JSON.parse(value) as ClientMetadata;
} catch (error) {
console.warn(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this fires in prod the operator knows some row failed to parse but not which one. Add the session ID (caller has it) or at minimum the value's length so the warning is actionable.

Also JSON.parse(value) as ClientMetadata is a cast not a check, a stored [] or a stringified field where an object is expected will pass through. Fine for a sample, just worth a one-liner comment that the runtime shape isn't validated.

Six items from the latest review pass.

Top-level: setup-runtime-role.ts (custom DB role for production)
- New src/scripts/setup-runtime-role.ts that runs the CREATE ROLE +
  AWS IAM GRANT + GRANT SQL the blog walks through, packaged so a
  reader cloning the repo doesn't have to copy-paste from the post.
- Validates APP_ROLE_NAME against the PostgreSQL identifier grammar
  before interpolating it (env-supplied identifiers can't be
  parameterised, but they CAN be sanitised).
- New 'npm run setup-runtime-role' script.
- README 'Production hardening checklist' now starts with the runtime
  role item, with the full one-liner invocation.

#1 maxLifetimeSeconds (connection.ts)
- Kept the explicit setting but added a comment noting it matches the
  connector default in v0.1.9, so the next reader doesn't think it's
  load-bearing.

#2 typeof transaction assertion (connection.ts)
- New runtime check in getPool() that throws if pool.transaction is
  not a function. Production always has it (AuroraDSQLPool exposes
  it); tests don't go through getPool. This makes the production
  guarantee explicit and prevents a silent downgrade to the manual
  fallback path if a future refactor returns a plain pg.Pool.

#3 wait_for_job test now asserts on params (migrate.test.ts)
- Mock client now records params alongside sql.
- Test uses toHaveBeenCalledWith('SELECT sys.wait_for_job($1)',
  ['fake-job-id']) so a regression that passes undefined for $1
  would actually fail.

#4 housekeeping.ts retry
- Each batch now runs inside retryWithBackoff so transient OCC
  conflicts during cleanup don't fail the whole run.
- Header comment updated to call out that orchestrators (cron, ECS
  scheduled tasks, EventBridge default) don't retry on non-zero
  exit, so the in-process retry is load-bearing.

#5 parseClientMetadata logging (sessionRepository.ts)
- Warning now includes the session id and the failed value's length
  so an operator can locate and inspect the offending row.
- Doc comment now explicitly states the runtime shape isn't
  validated; the cast is structural.

Tests: 196/196 passing.
@charycha

Copy link
Copy Markdown
Contributor Author

Thanks for the second pass - pushed commit 83d567c addressing all six items.

Top-level

  • CI didn't run. Per CONTRIBUTING.md, integration tests don't run on
    fork PRs because the workflows can't access repo secrets. Would
    appreciate a maintainer triggering the workflows on this branch (or
    pushing it into a feature branch in the upstream repo) so the gate
    reports green before merge. 196/196 unit tests pass locally and the
    full register/login/profile/list/revoke flow was validated end-to-end
    against a live cluster before push.
  • app_runtime role / dsql:DbConnect. Added
    src/scripts/setup-runtime-role.ts packaged as npm run setup-runtime-role,
    validates APP_ROLE_NAME as a PostgreSQL identifier before
    interpolation. README "Production hardening checklist" now leads with
    the runtime-role step and the full invocation.

Inline

  1. maxLifetimeSeconds redundant — kept explicit with a comment
    noting it matches the connector default in v0.1.9, so future
    readers don't think it's load-bearing.
  2. typeof transaction assertion — added in getPool(). Tests
    don't go through getPool so they keep working; production now
    fails fast if a refactor returns a plain pg.Pool.
  3. wait_for_job test weak — mock client records params now, and
    the assertion uses toHaveBeenCalledWith('SELECT sys.wait_for_job($1)', ['fake-job-id']).
  4. housekeeping orchestrator retry — each batch now runs inside
    retryWithBackoff. Header comment updated to spell out why
    (cron, ECS scheduled tasks, EventBridge-default don't retry on
    non-zero exit).
  5. parseClientMetadata logging — warn now includes session id
    and value length; doc comment notes the runtime shape isn't
    validated.

Tests: 196/196 passing.

@amaksimo amaksimo left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pulled the cluster down and ran setup-runtime-role.ts end-to-end against a live DSQL cluster. One real bug, two smaller real issues, and a walk-back from my prior pass:

  1. Script doesn't run. AWS IAM GRANT <role> TO $1 throws 42601 syntax error at or near "$1" against a real cluster, every time. Literal-ARN form works fine. Unit tests didn't catch it because they mock query(). Inline.
  2. Walking back my earlier GRANT USAGE ON SCHEMA public ask — verified live, DSQL grants it by default to new roles, and explicit GRANT USAGE on public is itself rejected (0A000 feature not supported on system entity). Don't add it.
  3. Two smaller real issues inline (table-existence ordering, runtime DELETE diverging from the blog).
  4. README undo-path note inline too — verified live, you need AWS IAM REVOKE before DROP ROLE or it fails with 2BP01.

On CI: there's no workflow for this sample at all — ci-gate.yml doesn't reference it and no per-sample workflow exists. The 196/196 are local-only. Two options: add an integ workflow that runs npm test + npm run setup-runtime-role against a test cluster (the existing per-sample integ workflows are a fine template), or run npm run setup-runtime-role against a real cluster locally and paste the output here. Up to you, but we shouldn't ship a script that's only ever been exercised by mocks.

Request-changes pending #1. Rest are nits.

// AWS IAM GRANT cannot use a parameterized role identifier; the role
// name has been validated above to ensure it's a safe identifier.
await client.query(
`AWS IAM GRANT ${roleName} TO $1`,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified against a live cluster — this throws 42601 syntax error at or near "$1". DSQL's parser doesn't accept $1 here; AWS IAM GRANT … TO only takes a literal string. The script as-written always fails on this statement.

Fix: validate the ARN with a regex (mirror the roleName validation on line 60) and string-interpolate the validated value. Something like:

if (!/^arn:aws[a-z-]*:iam::\d{12}:role\/[\w+=,.@-]+$/.test(taskRoleArn)) {
  console.error(`[setup-runtime-role] APP_TASK_ROLE_ARN is not a valid IAM role ARN.`);
  process.exit(1);
}
await client.query(`AWS IAM GRANT ${roleName} TO '${taskRoleArn}'`);


console.log(`[setup-runtime-role] granting privileges on users ...`);
await client.query(
`GRANT SELECT, INSERT, UPDATE ON users TO ${roleName}`,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If someone runs setup-runtime-role before migrate, this GRANT fails with 42P01 relation "users" does not exist (verified live). Either add a precondition check (e.g. a SELECT 1 FROM users LIMIT 0 at the top of the script) or call out the ordering in the script header + README hardening section. Right now nothing tells the reader migrations must run first.


console.log(`[setup-runtime-role] granting privileges on sessions ...`);
await client.query(
`GRANT SELECT, INSERT, UPDATE, DELETE ON sessions TO ${roleName}`,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blog V5's prod-deployment snippet grants S/I/U on sessions; this adds DELETE so housekeeping can run under the same role. Worth either splitting into a separate housekeeping DB role with DELETE-only privs (cleaner separation, IMO), or noting in the script header why the runtime is broader than the blog shows. Right now they disagree silently.

npm run setup-runtime-role
```

Then change `connection.ts` to connect as `app_runtime` instead of `admin`, and attach an IAM task-role policy that grants only `dsql:DbConnect` (not `dsql:DbConnectAdmin`). Keep `admin` for one-off setup steps such as creating the role itself or running migrations. This follows Aurora DSQL's [Database roles and IAM authentication](https://docs.aws.amazon.com/aurora-dsql/latest/userguide/working-with-database-roles.html) guidance.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick note for the undo path — verified live: DROP ROLE app_runtime fails with 2BP01 cannot be dropped because some objects depend on it if you don't first AWS IAM REVOKE app_runtime FROM '<arn>'. Worth one line at the end of this bullet so readers can roll the role back cleanly.

…ral-ARN form, table precondition, undo path

Addresses the four findings from @amaksimo's CHANGES_REQUESTED review.
The first finding was a real bug that only manifested against a live
cluster; the rest are tightening based on his end-to-end run.

#1 AWS IAM GRANT \$1 syntax error (release-blocker, verified live)
- DSQL's parser rejects '' in AWS IAM GRANT … TO \$1 with 42601
  syntax error. Unit tests didn't catch it because they mock query().
- Switched to literal-ARN string interpolation, with a strict regex
  validator on APP_TASK_ROLE_ARN guarding the input before
  interpolation. The regex matches arn:aws[partition]:iam::<12-digit>:role/<name>
  with the IAM-allowed punctuation.

#2 Table-existence ordering (verified live: 42P01)
- If a deployer runs setup-runtime-role before migrations, the
  GRANT … ON users/sessions statements fail partway through with
  '42P01 relation "users" does not exist', leaving a dangling role.
- Added a precondition check (SELECT 1 FROM users LIMIT 0; same for
  sessions) at the top of the script. On miss it exits with a clear
  message pointing to migrations.

#3 DELETE on sessions diverges from blog
- Blog V5's prod-deployment snippet shows S/I/U on sessions; the
  script grants SIU + DELETE so housekeeping can run under the same
  role.
- Added a header note explaining why the runtime is broader and
  pointing readers to the split-role alternative if they prefer
  stricter separation.

#4 README undo path (verified live: 2BP01)
- DROP ROLE app_runtime fails with '2BP01 cannot be dropped because
  some objects depend on it' if the IAM mapping isn't revoked first.
- Added a one-paragraph undo block to the runtime-role bullet in the
  Production hardening checklist showing AWS IAM REVOKE then DROP ROLE.

Live validation
- npm run setup-runtime-role end-to-end against a live us-east-1
  DSQL cluster: CREATE ROLE -> AWS IAM GRANT (literal ARN) -> GRANT
  on users -> GRANT on sessions, all succeeded with exit code 0.
- Precondition check verified by dropping users + sessions and re-
  running: script exits with the documented message.
- Tests: 196/196 passing.
@charycha

Copy link
Copy Markdown
Contributor Author

Thanks for catching this with the live run, all four addressed in
commit 3ed9457. Verified end-to-end against a fresh us-east-1 DSQL
cluster before pushing.

1. AWS IAM GRANT $1 syntax error (release-blocker) — switched to
literal-ARN string interpolation, with a strict ARN regex validator on
APP_TASK_ROLE_ARN (mirrors the existing APP_ROLE_NAME validation):

const IAM_ROLE_ARN_REGEX = /^arn:aws[a-z-]*:iam::\d{12}:role\/[\w+=,.@-]+$/;
// ...
await client.query(`AWS IAM GRANT ${roleName} TO '${taskRoleArn}'`);

@amaksimo amaksimo left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-ran the whole setup-runtime-role.ts flow against a live cluster — #1 (literal-ARN GRANT), #2 (table precondition), and #3 (DELETE divergence note) all check out. Thanks for the thorough turnaround.

One thing left on the README undo path — it's still incomplete. Verified it with isolated test roles on the cluster: the IAM REVOKE you added is necessary but not sufficient. The table grants are a second, independent dependency, so DROP ROLE still fails with the same 2BP01 even after the IAM REVOKE. Detail + corrected block inline.

Fix that and I'm happy to approve.


```sql
AWS IAM REVOKE app_runtime FROM 'arn:aws:iam::111122223333:role/auth-service-task-role';
DROP ROLE app_runtime;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This two-step undo still fails. Tested it live with isolated roles — after AWS IAM REVOKE, DROP ROLE errors with:

ERROR: role "app_runtime" cannot be dropped because some objects depend on it
DETAIL: privileges for table users, privileges for table sessions

The table grants from this script are a separate dependency from the IAM mapping, and both independently block DROP ROLE (confirmed: revoking only the table grants but not the IAM mapping fails too, with DETAIL: privileges for trust relationship …). So the IAM REVOKE you added is needed but not enough on its own.

Full sequence that actually drops the role cleanly (verified end-to-end, every step exit 0):

REVOKE ALL ON users    FROM app_runtime;
REVOKE ALL ON sessions FROM app_runtime;
AWS IAM REVOKE app_runtime FROM 'arn:aws:iam::111122223333:role/auth-service-task-role';
DROP ROLE app_runtime;

The three REVOKEs can be in any order; all three just have to come before DROP ROLE. Worth swapping the block above for this one (and the surrounding sentence, since it's not just the IAM mapping).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants