Skip to content

refactor(BA-5650): propagate owner_id through scheduler and sokovan [2/4]#11048

Closed
jopemachine wants to merge 17 commits into
refactor/BA-5650-D-session-repo-owner-idfrom
refactor/BA-5650-F-sokovan-owner-id
Closed

refactor(BA-5650): propagate owner_id through scheduler and sokovan [2/4]#11048
jopemachine wants to merge 17 commits into
refactor/BA-5650-D-session-repo-owner-idfrom
refactor/BA-5650-F-sokovan-owner-id

Conversation

@jopemachine
Copy link
Copy Markdown
Member

@jopemachine jopemachine commented Apr 14, 2026

Resolves #10911 (BA-5650)

Summary

  • Collapse old scheduler signatures to use owner_id instead of separate access_key/user_uuid
  • Propagate owner_id rename into sokovan scheduler handlers and coordinator
  • Update SessionMetadata to remove access_key attribute

Test plan

  • pants check passes on this slice
  • Scheduler handler tests pass

Stack

  1. refactor(BA-5650): add owner_id helpers and rename session data types [1/4] #11046 — add owner_id helpers and rename session data types
  2. refactor(BA-5650): propagate owner_id through scheduler and sokovan [2/4] #11048 ← You are here — propagate owner_id through scheduler and sokovan
  3. refactor(BA-5650): unify session ownership to owner_id and main_access_key [1/2] #11050 — resolve owner_id in services and drop REST v1 owner_access_key
  4. refactor(BA-5650): update tests and remaining ORM references [2/2] #11051 — update tests and remaining ORM references

🤖 Generated with Claude Code

@github-actions github-actions Bot added size:L 100~500 LoC comp:manager Related to Manager component labels Apr 14, 2026
@jopemachine jopemachine changed the title refactor(BA-5650-F): propagate owner_id rename into sokovan refactor(BA-5714): propagate owner_id rename into sokovan Apr 14, 2026
@jopemachine jopemachine marked this pull request as draft April 14, 2026 07:30
@jopemachine jopemachine force-pushed the refactor/BA-5650-E-scheduler-owner-id branch from 3b7a216 to 18a8ec2 Compare April 14, 2026 07:39
@jopemachine jopemachine force-pushed the refactor/BA-5650-F-sokovan-owner-id branch from 07a5509 to 89df267 Compare April 14, 2026 07:39
@jopemachine jopemachine requested a review from Copilot April 14, 2026 07:46
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR propagates the owner_id / main_access_key renames through the Sokovan scheduling stack by updating Sokovan data classes and all in-repo call sites that construct or consume them.

Changes:

  • Rename Sokovan session/keypair identifiers across data models (access_keymain_access_key, user_uuidowner_id) and update preparers/validators/provisioner/sequencers accordingly.
  • Update scheduler lifecycle handlers / controller / launcher / post-processors to use the renamed fields for hooks, env injection, cache invalidation, and history recording.
  • Adjust route health-record initialization logic and Prometheus preset label injection in deployment components.

Reviewed changes

Copilot reviewed 27 out of 27 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/ai/backend/manager/sokovan/scheduling_controller/scheduling_controller.py Pass main_access_key through enqueue hook dispatch/notify.
src/ai/backend/manager/sokovan/scheduling_controller/preparers/preparer.py Populate prepared session data with main_access_key and owner_id.
src/ai/backend/manager/sokovan/scheduler/provisioner/validators/user_resource_limit.py Switch user policy/occupancy lookups to owner_id.
src/ai/backend/manager/sokovan/scheduler/provisioner/validators/pending_session_resource_limit.py Switch keypair policy/pending lookups to main_access_key.
src/ai/backend/manager/sokovan/scheduler/provisioner/validators/pending_session_count_limit.py Switch keypair policy/pending lookups to main_access_key.
src/ai/backend/manager/sokovan/scheduler/provisioner/validators/keypair_resource_limit.py Switch keypair policy/occupancy lookups to main_access_key.
src/ai/backend/manager/sokovan/scheduler/provisioner/validators/concurrency.py Switch concurrency policy/count lookups to main_access_key.
src/ai/backend/manager/sokovan/scheduler/provisioner/sequencers/fair_share.py Switch fairness grouping/sorting to owner_id.
src/ai/backend/manager/sokovan/scheduler/provisioner/sequencers/drf.py Switch DRF sequencing key to main_access_key.
src/ai/backend/manager/sokovan/scheduler/provisioner/provisioner.py Update snapshot occupancy/concurrency accounting to main_access_key + owner_id.
src/ai/backend/manager/sokovan/scheduler/post_processors/cache_invalidation.py Invalidate cache based on main_access_key rather than access_key.
src/ai/backend/manager/sokovan/scheduler/launcher/launcher.py Update launcher logging/env/agent payload to use main_access_key + owner_id.
src/ai/backend/manager/sokovan/scheduler/handlers/maintenance/sweep_sessions.py Use main_access_key when emitting session transition info.
src/ai/backend/manager/sokovan/scheduler/handlers/lifecycle/start_sessions.py Source main_access_key from SessionDataForStart for transitions.
src/ai/backend/manager/sokovan/scheduler/handlers/lifecycle/schedule_sessions.py Use main_access_key in transition info; allow access_key=None for skipped.
src/ai/backend/manager/sokovan/scheduler/handlers/lifecycle/deprioritize_sessions.py Resolve main_access_key via repository for cache invalidation consumer.
src/ai/backend/manager/sokovan/scheduler/handlers/lifecycle/check_precondition.py Source main_access_key from SessionDataForPull for transitions.
src/ai/backend/manager/sokovan/scheduler/fair_share/aggregator.py Thread owner_id into kernel usage record creation spec.
src/ai/backend/manager/sokovan/scheduler/coordinator.py Resolve main_access_key for cache invalidation on promotion transitions.
src/ai/backend/manager/sokovan/deployment/route/handlers/observer/health_check.py Remove warning when no Valkey records exist for checkable routes.
src/ai/backend/manager/sokovan/deployment/route/executor.py Adjust route replica/health record initialization flow and initial delay computation.
src/ai/backend/manager/sokovan/deployment/route/coordinator.py Remove RUNNING-transition Valkey running_at marking logic.
src/ai/backend/manager/sokovan/deployment/executor.py Change Prometheus preset label injection for deployment-scoped querying.
src/ai/backend/manager/sokovan/data/workload.py Rename workload identifiers to main_access_key + owner_id.
src/ai/backend/manager/sokovan/data/lifecycle.py Rename lifecycle DTO fields to main_access_key + owner_id.
src/ai/backend/manager/sokovan/data/allocation.py Rename allocation DTO field to main_access_key and thread through allocator.
changes/11048.misc.md Changelog entry for the propagation refactor.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/ai/backend/manager/sokovan/deployment/route/coordinator.py
Comment thread src/ai/backend/manager/sokovan/data/allocation.py
Comment thread src/ai/backend/manager/sokovan/deployment/executor.py
Comment thread src/ai/backend/manager/sokovan/deployment/route/executor.py
Comment thread src/ai/backend/manager/sokovan/deployment/route/executor.py
@jopemachine jopemachine force-pushed the refactor/BA-5650-E-scheduler-owner-id branch from 18a8ec2 to 17d3c3a Compare April 14, 2026 07:52
@jopemachine jopemachine force-pushed the refactor/BA-5650-F-sokovan-owner-id branch from 89df267 to 203e3d2 Compare April 14, 2026 07:52
Comment thread changes/11048.misc.md Outdated
Comment thread src/ai/backend/manager/sokovan/data/allocation.py
Comment thread src/ai/backend/manager/sokovan/data/workload.py
Comment thread src/ai/backend/manager/sokovan/scheduler/coordinator.py
@jopemachine jopemachine force-pushed the refactor/BA-5650-E-scheduler-owner-id branch from 3ece8ba to 81cdac3 Compare April 14, 2026 08:08
@jopemachine jopemachine force-pushed the refactor/BA-5650-F-sokovan-owner-id branch from bd97fdd to 9d05db6 Compare April 14, 2026 08:13
@jopemachine jopemachine force-pushed the refactor/BA-5650-E-scheduler-owner-id branch from 7ab7788 to e10c89d Compare April 14, 2026 08:31
@jopemachine jopemachine force-pushed the refactor/BA-5650-F-sokovan-owner-id branch from a2e1502 to 350294e Compare April 14, 2026 08:34
jopemachine and others added 7 commits April 14, 2026 19:17
``SessionRepository`` and the underlying ``SessionDBSource`` now take
``owner_id: UUID`` on every method that previously accepted
``owner_access_key: AccessKey``. Affects:

- ``get_session_validated``
- ``match_sessions``
- ``update_session_name``
- ``find_dependency_sessions`` / ``_find_dependent_sessions``
- ``get_target_session_ids``
- ``get_session_with_group``

The matching ``dependency_graph`` helpers and ``creators`` are updated
in lockstep. Service-layer callers still pass ``owner_access_key``
temporarily; they will be migrated in the next slice.
Scheduler / predicates / scheduler-type collapse of the owner key:

- ``scheduler/predicates.py``: predicates now take SessionRow and
  resolve ``main_access_key`` from the owner via a helper when a
  keypair-scoped lookup (Redis concurrency, keypair resource policy)
  is required. Renames ``SessionRow.user_uuid`` references throughout.
- ``scheduler/drf.py``: per-user fairness tracking keyed by
  ``owner_id``/``main_access_key`` pair.
- ``repositories/scheduler/options.py``: drop the duplicated
  ``by_access_key_*`` factories — session filters go through
  ``SessionConditions`` helpers instead.
- ``repositories/scheduler/types/*``: rename ``access_key`` to
  ``main_access_key`` on ``ScheduledSessionData``,
  ``TerminatingSessionData``, ``SweptSessionInfo``, ``KernelEnqueueData``
  and ``SessionEnqueueData``.
- ``repositories/events/db_source/db_source.py`` and
  ``repositories/stream/db_source/db_source.py``: resolve the owner
  UUID from ``main_access_key`` via a sub-select shim while the schema
  still exposes ``sessions.access_key``.
- repositories/scheduler/types/session.py: rename
  PendingSessionData.access_key -> main_access_key (and drop the
  outdated resolved-main_access_key comment).
- repositories/scheduler/db_source/db_source.py: update the
  PendingSessionData call site to main_access_key + owner_id keyword
  names matching the dataclass.
- scheduler/drf.py: use existing_sess.user_uuid (SessionRow stores the
  owner UUID there, not owner_id).
- scheduler/predicates.py: guard every _resolve_main_access_key
  consumer (check_concurrency, check_keypair_resource_limit,
  check_pending_session_count_limit, check_pending_session_resource_limit)
  with an early main_ak-is-None return so that NULL main_access_key
  users don't fall through to keypair policy lookups that match with
  NULL.
- repositories/scheduler/types/allocation.py: rename SessionAllocation's
  ``access_key`` field to ``main_access_key`` and drop the stale
  explanatory comment.
- repositories/stream/db_source/db_source.py: raise UserNotFound when
  no user owns the supplied ``main_access_key`` — SessionNotFound was
  misleading since the failure is about the user lookup, not the
  session.
- Drop the leftover ``changes/BA-5650-D.misc.md`` file.
- Scheduler db_source and preparer now use ``main_access_key`` /
  ``owner_id`` on ``SessionEnqueueData``, ``KernelEnqueueData``,
  ``TerminatingSessionData``, ``SweptSessionInfo``, and
  ``ScheduledSessionData``.
- ``PendingSessions.owner_ids`` replaces ``user_uuids`` in db_source
  call sites.
- ``scheduling_controller`` reads ``session_spec.access_key``
  (SessionCreationSpec still uses this name pre-sokovan rename).
- ``PendingSessionData.to_session_workload`` maps
  ``main_access_key`` → ``SessionWorkload.access_key`` and
  ``owner_id`` → ``SessionWorkload.user_uuid`` so the bridge works
  until slice F renames the sokovan SessionWorkload fields.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jopemachine jopemachine force-pushed the refactor/BA-5650-E-scheduler-owner-id branch from e10c89d to b3067da Compare April 14, 2026 10:20
jopemachine and others added 8 commits April 14, 2026 19:20
``SessionRepository`` and the underlying ``SessionDBSource`` now take
``owner_id: UUID`` on every method that previously accepted
``owner_access_key: AccessKey``. Affects:

- ``get_session_validated``
- ``match_sessions``
- ``update_session_name``
- ``find_dependency_sessions`` / ``_find_dependent_sessions``
- ``get_target_session_ids``
- ``get_session_with_group``

The matching ``dependency_graph`` helpers and ``creators`` are updated
in lockstep. Service-layer callers still pass ``owner_access_key``
temporarily; they will be migrated in the next slice.
Scheduler / predicates / scheduler-type collapse of the owner key:

- ``scheduler/predicates.py``: predicates now take SessionRow and
  resolve ``main_access_key`` from the owner via a helper when a
  keypair-scoped lookup (Redis concurrency, keypair resource policy)
  is required. Renames ``SessionRow.user_uuid`` references throughout.
- ``scheduler/drf.py``: per-user fairness tracking keyed by
  ``owner_id``/``main_access_key`` pair.
- ``repositories/scheduler/options.py``: drop the duplicated
  ``by_access_key_*`` factories — session filters go through
  ``SessionConditions`` helpers instead.
- ``repositories/scheduler/types/*``: rename ``access_key`` to
  ``main_access_key`` on ``ScheduledSessionData``,
  ``TerminatingSessionData``, ``SweptSessionInfo``, ``KernelEnqueueData``
  and ``SessionEnqueueData``.
- ``repositories/events/db_source/db_source.py`` and
  ``repositories/stream/db_source/db_source.py``: resolve the owner
  UUID from ``main_access_key`` via a sub-select shim while the schema
  still exposes ``sessions.access_key``.
Rename access_key -> main_access_key on sokovan data types
(SessionAllocation, PreparedSessionData, SessionDataForPull,
SessionDataForStart, SessionWorkload) and update every sokovan caller
accordingly. Affects:

- sokovan/data/{allocation,lifecycle,workload}.py
- sokovan/scheduler/handlers/lifecycle/*
- sokovan/scheduler/handlers/maintenance/sweep_sessions.py
- sokovan/scheduler/provisioner/{provisioner,sequencers,validators}/*
- sokovan/scheduler/launcher/launcher.py
- sokovan/scheduler/post_processors/cache_invalidation.py
- sokovan/scheduler/fair_share/aggregator.py
- sokovan/scheduling_controller/{preparers,scheduling_controller}.py
- sokovan/deployment/{executor,route}.py

No external behavior change.
…comments

- Revert deployment/executor.py, deployment/route/coordinator.py,
  deployment/route/executor.py, deployment/route/handlers/observer/
  health_check.py to the parent branch state — those edits pre-dated
  main and are unrelated to BA-5650; they slipped in when this slice
  pulled the whole sokovan/ subtree.
- sokovan/data/allocation.py: restore the explanatory comment above
  ``main_access_key`` and fix ``from_agent_selections`` to pass
  ``main_access_key=`` (matches the renamed field).
- sokovan/data/workload.py: restore the explanatory comment above
  ``SessionWorkload.main_access_key``.
…ss_keys

- Resolve merge conflicts from cascading onto the updated slice E
  (drf.py keeps ``existing_sess.user_uuid``; predicates.py keeps the
  ``main_ak is None`` early returns; scheduler types stay on
  ``main_access_key``; stream db_source keeps ``UserNotFound``).
- Add ``SchedulerRepository.resolve_main_access_keys`` and the matching
  ``ScheduleDBSource.resolve_main_access_keys`` so the coordinator's
  cache-invalidation step can look up each session's owner
  main_access_key in a single query. Required by
  sokovan/scheduler/coordinator.py and lifecycle/deprioritize_sessions.py
  call sites that were previously reading ``session_info.metadata.access_key``.
- ``SessionWorkload`` now uses ``main_access_key`` / ``owner_id``; the
  ``PendingSessionData.to_session_workload`` bridge is updated accordingly.
- Scheduler db_source constructions of ``SessionDataForPull`` /
  ``SessionDataForStart`` use the new kwargs.
- ``cache_invalidation`` reverts to ``info.access_key``
  (``SessionTransitionInfo`` still carries the old field at slice F).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jopemachine jopemachine force-pushed the refactor/BA-5650-F-sokovan-owner-id branch from 350294e to efa2a91 Compare April 14, 2026 10:21
@jopemachine jopemachine changed the base branch from refactor/BA-5650-E-scheduler-owner-id to refactor/BA-5650-D-session-repo-owner-id April 15, 2026 01:41
@jopemachine jopemachine changed the title refactor(BA-5714): propagate owner_id rename into sokovan refactor(BA-5650): propagate owner_id through scheduler and sokovan [2/4] Apr 15, 2026
jopemachine and others added 2 commits April 15, 2026 10:50
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jopemachine jopemachine added this to the 26.5 milestone Apr 15, 2026
@jopemachine
Copy link
Copy Markdown
Member Author

Squashed into PR #11051 — stacked PRs cannot pass CI independently with cross-cutting type changes.

@jopemachine jopemachine reopened this Apr 15, 2026
@jopemachine
Copy link
Copy Markdown
Member Author

Restructured into 2 self-contained PRs (#11050, #11051) for CI pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp:manager Related to Manager component size:L 100~500 LoC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants