Skip to content

fix(oauth): skip audience enforcement for virtual servers when resource is not configured#4410

Merged
jonpspri merged 5 commits intomainfrom
bugfix/virtual-server-oauth-audience
Apr 28, 2026
Merged

fix(oauth): skip audience enforcement for virtual servers when resource is not configured#4410
jonpspri merged 5 commits intomainfrom
bugfix/virtual-server-oauth-audience

Conversation

@madhav165
Copy link
Copy Markdown
Collaborator

@madhav165 madhav165 commented Apr 23, 2026

🐛 Bug-fix PR

📌 Summary

OAuth audience verification always fails for virtual servers when the IdP does not support RFC 8707 Resource Indicators (e.g. Authentik, some Keycloak configurations). These IdPs set the token aud claim to an abstract identifier (typically the OAuth client_id) rather than the resource URL the client requested. Since virtual servers only have authorization_servers in their oauth_config (no resource), the only expected audience is the canonical resource URL derived from APP_DOMAIN — which never matches.

This is the virtual-server counterpart to #4404, which applies the same auto-learn pattern to the gateway (OAuth callback) path.

🔁 Reproduction Steps

Closes #4171

  1. Create a virtual server with oauth_enabled=true and an authorization server pointing to an IdP without RFC 8707 support (e.g. Authentik)
  2. Set APP_DOMAIN to the gateway's internal domain
  3. Have an MCP client connect to /servers/{server_id}/mcp — the client completes the OAuth flow with the IdP
  4. The IdP issues a token with aud: "my-oauth-client-id" (ignoring the resource parameter)
  5. ContextForge builds expected_audiences = ["{APP_DOMAIN}/servers/{server_id}/mcp"]
  6. Audience mismatch → 401 on every request

🐞 Root Cause

_try_oauth_access_token in streamablehttp_transport.py always enforced audience verification using the canonical resource URL from _build_server_resource_url(). When oauth_config has no resource (which is the case for all virtual servers created via the admin UI), the expected audience list contained only this URL. Non-RFC-8707 IdPs set aud to an abstract identifier instead, so every token failed audience verification — even though signature and issuer were valid.

The admin UI for virtual servers does not expose resource fields, so there was no way to configure the expected audience without direct DB manipulation.

💡 Fix Description

Replace the old audience-list-building logic with a two-branch strategy in _try_oauth_access_token:

  1. When resource IS configured (learned or manually set): Pass it directly to verify_oauth_access_token as the expected audience. No canonical resource URL, no list building — the configured value is used as-is (string or list).

  2. When resource is NOT configured: Skip audience enforcement (expected_audience=None). Signature + issuer are still verified. After successful verification, extract the aud claim from the verified token and persist it as resource in server.oauth_config via the new _persist_learned_server_audience() helper. All subsequent requests then get strict audience enforcement.

The old client_id fallback and canonical resource URL derivation (_build_server_resource_url) are removed from the audience path entirely — they were the source of the bug for non-RFC-8707 IdPs.

This is not a security regression: without this change, every non-RFC-8707 IdP token is rejected outright (the feature is completely broken). The fix maintains signature + issuer verification on the first request and adds strict audience enforcement from the second request onward.

Why auto-learn instead of a UI toggle (as suggested in #4171)?

The issue proposed a per-server toggle or UI field to manually configure the expected audience. But the operator only controls which IdP to use (authorization_servers) — the aud claim value is determined entirely by the IdP's behavior, which varies across providers (it could be the client_id, an application URI, or something else entirely). Asking the operator to pre-configure it would require them to know what the IdP will put in aud before any token has been issued. Auto-learning from the first verified token eliminates this guesswork and works correctly regardless of the IdP's audience strategy.

Key implementation details:

  • _persist_learned_server_audience() is best-effort — DB failures are caught both internally and at the call site, never breaking the auth flow
  • The aud claim is stored as-is: string values stay strings, list values stay lists
  • Persist is called on every successful auth so that IdP-side changes (e.g. client_id rotation) are picked up automatically; a no-op early return skips the write when the stored value already matches
  • SQLAlchemy mutation tracking is handled correctly (new dict object assigned)

🧪 Verification

Check Command Status
Unit tests (test_auth.py) uv run pytest tests/unit/mcpgateway/test_auth.py ✅ 237 passed
Related OAuth tests uv run pytest tests/unit/mcpgateway/test_auth.py tests/unit/mcpgateway/routers/test_oauth_router.py tests/unit/mcpgateway/services/test_token_validation_service.py ✅ 362 passed
Doctests uv run pytest --doctest-modules mcpgateway/transports/streamablehttp_transport.py ✅ 22 passed
Ruff make ruff ✅ All checks passed
Formatting make autoflake isort black ✅ Clean

📐 MCP Compliance (if relevant)

  • Matches current MCP spec
  • No breaking change to MCP clients

✅ Checklist

  • Code formatted (make black isort pre-commit)
  • No secrets/credentials committed

@madhav165 madhav165 force-pushed the bugfix/virtual-server-oauth-audience branch 6 times, most recently from e363e1d to 49fb4fc Compare April 23, 2026 10:24
@madhav165
Copy link
Copy Markdown
Collaborator Author

Self-Review Notes

Reviewed the following areas that are likely to come up in automated or human review — documenting findings here to avoid re-investigation:

1. Is removing the client_id fallback a regression?
No. The old code used oauth_config.get("resource") or oauth_config.get("client_id") as a static fallback for audience enforcement. This PR replaces that with a dynamic learning mechanism: if no resource is configured, skip audience enforcement on the first request (signature + issuer still verified), learn the IdP's actual aud, and persist it as resource. The client_id fallback was a workaround for the exact problem this PR solves properly. Existing deployments with client_id set will auto-learn the correct audience on the next verified token.

2. Is removing the canonical resource URL (_build_server_resource_url) from audience checks a security regression?
No. The canonical URL derivation was the root cause of the bug — non-RFC-8707 IdPs never set aud to the resource URL, so every token failed. Removing it from the audience path is the fix. After the first successful verification, the learned aud is persisted and enforced strictly on all subsequent requests. This mirrors the approach in #4404 for the gateway path.

3. Does the always-overwrite behavior in _persist_learned_server_audience allow audience drift?
This is a deliberate design choice, consistent with #4404. The persist function overwrites resource when the token's aud differs from the stored value, so that IdP-side changes (client_id rotation, application migration) are picked up automatically. A no-op early return skips the write when the stored value already matches. The threat scenario (attacker with a validly-signed token from the same IdP but different aud) requires the attacker to already have a valid token from the configured authorization server.

4. Is the double DB read for the same server a performance concern?
The first session (line 5067) is intentionally short-lived — it closes before verify_oauth_access_token, which performs async network I/O (JWKS fetch). Holding a DB connection open across that call risks connection pool starvation. The second read inside _persist_learned_server_audience (line 943) needs an active session for the write. This is structurally necessary given the current architecture. The persist path is also effectively a no-op after the first request (early return when resource == raw_aud), so the second read only does real work once per server.

5. Does _persist_learned_server_audience need locking or SELECT ... FOR UPDATE?
No. Concurrent requests that both read resource=None and both write the same learned aud produce the same result — the write is idempotent. There is no TOCTOU risk because the persisted value is the same regardless of which request wins.

@madhav165 madhav165 added bug Something isn't working api REST API Related item labels Apr 23, 2026
@jonpspri jonpspri added the release-fix Critical bugfix required for the release label Apr 23, 2026
@brian-hussey brian-hussey added the SHOULD P2: Important but not vital; high-value items that are not crucial for the immediate release label Apr 23, 2026
Copy link
Copy Markdown
Member

@brian-hussey brian-hussey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Madhav, thank you for the submission.
There are a few issues with this PR as commented below.
Also please consider updating the docs/docs/architecture/oauth-design.md with updates to the oauth design that will come through with this change.

Thanks,
Brian

verified_claims: Decoded and *signature-verified* JWT claims.
db: Active database session.
"""
raw_aud = verified_claims.get("aud")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we validate the structure of the 'aud' claim here? At the moment it is accepted and trusted by default.
The token is verified but it could still be complex, empty or an unexpected type.
We could validate that it is a non-empty string or a list of non-empty strings before continuing.


try:
server = db.execute(select(DbServer).where(DbServer.id == server_id)).scalar_one_or_none()
if server is None or not server.oauth_config:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function returns early if server.oauth_config is falsy, but this means a server with oauth_enabled=True but no oauth_config will silently skip audience learning. This could hide configuration issues. Consider logging a warning when oauth_enabled is True but oauth_config is missing.

updated_config = dict(server.oauth_config)
updated_config["resource"] = raw_aud
server.oauth_config = updated_config
db.flush()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a potential race condition here: Using db.flush() without explicit transaction isolation.
If multiple requests with different aud claims arrive concurrently for the same server, the last write wins without any conflict detection. Consider adding optimistic locking (version field) or documenting that the last-verified-token-wins behavior is intentional.

madhav165 and others added 5 commits April 28, 2026 06:24
…ce is not configured

IdPs that do not support RFC 8707 (e.g. Authentik) set the token aud
claim to the OAuth client_id rather than the resource URL the client
requested. Virtual servers with only authorization_servers configured
(no resource or client_id in oauth_config) always fail audience
verification because the only expected audience is the canonical
resource URL derived from APP_DOMAIN.

When resource is absent from oauth_config, skip audience enforcement
(signature + issuer are still verified) and persist the aud claim from
the first verified token as resource for strict enforcement on all
subsequent requests. The aud value is stored as-is (string or list per
RFC 7519) without normalization.

Closes #4171

Signed-off-by: Madhav Kandukuri <madhav165@gmail.com>
Signed-off-by: Madhav Kandukuri <madhav165@gmail.com>
…way in streamablehttp_transport

Move Gateway as DbGateway to top-level import and remove all local
reimports of select, DbServer, and DbGateway that shadow the existing
module-level imports. Resolves pylint W0404 (reimported) and W0621
(redefined-outer-name) warnings.

Signed-off-by: Madhav Kandukuri <madhav165@gmail.com>
…onfusion

When the auto-learn audience pattern was introduced, three issues remained
that this commit addresses:

* First-request authentication was effectively issuer-only when no
  resource was configured. Any valid token from an allowed issuer would
  authenticate and pin its aud as resource, enabling cross-resource
  token confusion in shared-IdP deployments. The handler now falls
  back to a list of acceptable audiences derived from the canonical
  RFC 8707/9728 resource URL plus the legacy client_id field, and
  fails closed when neither anchor is available.

* The verified aud claim was persisted with no shape validation. A
  misconfigured IdP minting a token with a malformed aud (dict, int,
  empty list, etc.) would persist that value as resource and then
  crash PyJWT inside verify_oauth_access_token on the next request.
  A new _is_valid_audience helper enforces RFC 7519 §4.1.3 (string
  or non-empty list of non-empty strings) before persisting.

* The persist policy was 'always overwrite', which could silently
  collapse an operator-configured multi-audience list down to a
  single value, and could mask IdP-side audience changes that
  should surface as an explicit auth failure. The policy is now
  learn-once: existing resource values (operator-set or previously
  learned) are preserved. IdP rotation requires the operator to
  clear the field before the new value is learned.

Adds regression coverage for: stateful learn-then-enforce, fallback
audience derivation, malformed aud rejection (parametrized over
dict/int/empty/whitespace shapes), multi-audience list preservation,
oauth_config=None handling, no-aud-claim handler success, and the
fail-closed branch when no audience anchor can be derived.

Signed-off-by: Jonathan Springer <jps@s390x.com>
Follow-up coverage additions for the audience-learning hardening introduced
in the previous commit. Covers branches and assertions that the second-pass
review identified as gaps:

* C4: canonical URL unavailable but client_id configured → fallback uses
  client_id alone (the #4171 use case when app_domain is unset).
* C7/C8: parametrized matrix for non-string and whitespace-only client_id
  values (whitespace, empty, int, list, dict, None) all excluded from the
  fallback and producing a 401.
* B4: db.execute(...).scalar_one_or_none() returning None is handled
  gracefully (no flush, no error).
* B5: empty oauth_config dict at helper level (previously covered only
  for None).
* B12: db.flush() raising is caught by the outer except, swallowed, and
  the warning log is asserted via caplog.
* A1-A10: direct parametrized unit test for _is_valid_audience covering
  every shape (None, empty/whitespace string, valid string, empty list,
  list with None/empty/int, valid lists, dict, int, bytes).

Existing tests strengthened with previously-missing assertions:
* test_skips_when_aud_is_malformed now asserts the warning log emitted
  on rejection (previously only verified no DB calls).
* test_db_error_is_swallowed now asserts the warning log emitted on
  DB execute failure.
* test_no_resource_no_client_id_no_canonical_fails_closed now asserts
  the 401 status, WWW-Authenticate: Bearer header, and JSON detail
  body produced by the fail-closed branch (previously only verified the
  return value).

Signed-off-by: Jonathan Springer <jps@s390x.com>
@jonpspri jonpspri force-pushed the bugfix/virtual-server-oauth-audience branch from 7918c4b to a9117ad Compare April 28, 2026 05:54
@jonpspri jonpspri merged commit 6ec1992 into main Apr 28, 2026
30 checks passed
@jonpspri jonpspri deleted the bugfix/virtual-server-oauth-audience branch April 28, 2026 06:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api REST API Related item bug Something isn't working release-fix Critical bugfix required for the release SHOULD P2: Important but not vital; high-value items that are not crucial for the immediate release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG]: OAuth audience verification always fails with IdPs that do not support RFC 8707 Resource Indicators (e.g. Authentik)

3 participants