Skip to content

fix: omit spark.livy.session.idle.timeout by default to keep Fabric starter-pool acceleration (#184)#189

Merged
mdrakiburrahman merged 2 commits into
mainfrom
dev/mdrrahman/184
May 17, 2026
Merged

fix: omit spark.livy.session.idle.timeout by default to keep Fabric starter-pool acceleration (#184)#189
mdrakiburrahman merged 2 commits into
mainfrom
dev/mdrrahman/184

Conversation

@mdrakiburrahman

Copy link
Copy Markdown
Collaborator

Why this change is needed

Fixes #184.

How the bug was discovered

@karelrappo reported that every fresh Livy session created by dbt-fabricspark
falls back from a Fabric starter pool to an on-demand cluster, adding
roughly 3 min of cold-start to every cold run. Fabric returns
FallbackReasons: UserSparkConfigMismatch and explicitly names
spark.livy.session.idle.timeout as a session-immutable SparkConf that is
incompatible with the pool.

Repro confirmed

The bug was already firing on every CI run of this repo — visible in our own
fixture logs at tests/functional/fixtures/ws2_seed/logs/dbt.log (e.g.
lines 56, 355, 515):

'tags': {
  'FallbackReasons': 'UserSparkConfigMismatch',
  'FallbackMessages': 'Session falling back to on-demand cluster because no
  compatible starter pool exists which matches the requested configurations
  set by user. Incompatibility reason(s): The following SparkConfs in the
  SparkSettings are session immutable and incompatible with the pool:
  spark.livy.session.idle.timeout,'
}

@karelrappo verified that setting session_idle_timeout: "" drops the key
and restores starter-pool match: ~196 s acquire → ~43 s acquire.

How

FabricSparkCredentials.session_idle_timeout previously defaulted to "30m",
which always made the if credentials.session_idle_timeout: injection
guards inside concurrent_livy._build_acquire_payload and
singleton_livy._create_fabric_session add the key to the session conf.
Flipping the default to None keeps both guards in place but makes the
key absent unless the user explicitly opts in.

Considerations

  • Backwards compatibility: profiles that set session_idle_timeout
    explicitly continue to behave as before, with the same on-demand fallback
    trade-off — now documented in the README config table.
  • No alternative Fabric mechanism exists: Fabric's Livy API does not
    expose a top-level/non-SparkConf way to set per-session idle timeout. The
    fix is purely client-side suppression.
  • HC and singleton parity: both backends share the same falsy guard so
    a single default flip fixes both. Unit tests cover both paths plus a
    credential-level default check.
  • Functional fixture keeps "60m" because the test suite genuinely
    needs long-lived sessions; that intentional override still trips the
    fallback in CI as it does today, which is the accepted trade-off.

Alternatives considered

  • Keep the "30m" default and emit a warning: doesn't fix anyone's
    cold-start; preserves the today-broken behavior.
  • Substitute a different Fabric-supported property: none exists per
    current Fabric Livy API docs; idle timeout is a pool/workspace setting.

Test

Microsoft Employee contributors

Full npx nx run dbt-fabricspark:test ran green locally on this branch
(unit + local-e2e + functional, both no_schema and with_schema
including cross-workspace):

[0] Pre-test workspace nuke (WS1 + WS2)     |   0 min  4 sec |  Pass
[1] Provision no_schema lakehouse (WS1)     |   0 min  4 sec |  Pass
[2] Provision with_schema lakehouse (WS1)   |   0 min  4 sec |  Pass
[3] Provision with_schema lakehouse (WS2 cross-workspace read source)|   0 min  4 sec |  Pass
[4] Seed cross_ws_fixture into WS2 lakehouse|   0 min 42 sec |  Pass
[5] Create Livy session (no_schema)         |   3 min  5 sec |  Pass
[6] Create Livy session (with_schema)       |   3 min 20 sec |  Pass
[8] Functional tests (with_schema, includes cross-workspace)|  17 min 24 sec |  Pass
[7] Functional tests (no_schema)            |  18 min 55 sec |  Pass
[9] Post-test workspace nuke (WS1 + WS2)    |   0 min  5 sec |  Pass
Pre-checkin validation passed.

New unit tests in this PR:

  • tests/unit/test_credentials.py::test_credentials_session_idle_timeout_defaults_to_none
  • tests/unit/test_concurrent_livy.py::TestBuildAcquirePayloadIdleTimeout (4 cases)
  • tests/unit/test_livysession.py::TestCreateFabricSessionIdleTimeout (4 cases)

covering: default omits key (HC + singleton), empty string omits key,
explicit value injects key, environmentId injection is independent.

mdrakiburrahman and others added 2 commits May 17, 2026 20:58
… starter-pool acceleration

The adapter unconditionally injected `spark.livy.session.idle.timeout` into
every Livy session `conf` because `session_idle_timeout` defaulted to
"30m". Fabric treats that key as session-immutable, so its presence —
even when the value matched the pool's own default — emitted
`FallbackReasons: UserSparkConfigMismatch` and forced an on-demand cluster
cold start (~3 min vs ~40 s on a warm starter pool). The same fallback
was firing on every CI run of this repo today (visible in
`tests/functional/fixtures/ws2_seed/logs/dbt.log`).

Flip the credential default to `None` so the key is dropped from the
acquire payload unless the user explicitly opts in. Existing profiles
that set `session_idle_timeout` keep their previous behavior with the
same starter-pool trade-off, now documented in the README config table.

- credentials.py: default `session_idle_timeout` `"30m"` → `None`.
- README: config-table row + profile example + a contributor-friendly
  bug-bashing section at the top.
- Issue templates: cross-link the contributor guide for repros.
- CHANGELOG: new v1.12.2 bug-fix entry.
- Unit tests: cover default-off + explicit-opt-in for both the HC
  (`_build_acquire_payload`) and singleton (`_create_fabric_session`)
  paths, plus a credential-level default check.

Fixes #184

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@mdrakiburrahman mdrakiburrahman merged commit d315a56 into main May 17, 2026
2 checks passed
@mdrakiburrahman mdrakiburrahman deleted the dev/mdrrahman/184 branch May 17, 2026 22:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Fabric Livy sessions created by dbt-fabricspark always fall back to on-demand because spark.livy.session.idle.timeout is sent

1 participant