ATOM Scheduling & KV Cache Guide

ATOM (AiTer Optimized Model) uses a prefill-first scheduler with paged KV cache block management to drive LLM inference on AMD ROCm/HIP GPUs. This guide covers the scheduling algorithm, batch construction, block-level KV cache management, prefix caching, postprocessing, speculative decoding integration, and sequence lifecycle.

Quick Reference

Class	File	Purpose
`Scheduler`	`atom/model_engine/scheduler.py`	Orchestrates prefill/decode scheduling, preemption, and postprocessing
`ScheduledBatch`	`atom/model_engine/scheduler.py`	Immutable snapshot of a scheduled batch sent to the model runner
`ScheduledBatchOutput`	`atom/model_engine/scheduler.py`	Holds sampled token IDs and draft token IDs returned from forward pass
`BlockManager`	`atom/model_engine/block_manager.py`	Manages paged KV cache blocks with allocation, deallocation, and prefix caching
`Block`	`atom/model_engine/block_manager.py`	Single KV cache block with ID, reference count, hash, and token IDs
`Sequence`	`atom/model_engine/sequence.py`	Tracks a single request through its lifetime (tokens, blocks, status, timing)
`SequenceStatus`	`atom/model_engine/sequence.py`	Enum: `WAITING`, `RUNNING`, `FINISHED`, `EXIT_ENGINE`
`SequenceType`	`atom/model_engine/sequence.py`	Enum: `DUMMY`, `PREFILL`, `DECODE`
`RequestOutput`	`atom/model_engine/request.py`	Dataclass streamed to clients with new tokens and finish status
`Config`	`atom/config.py`	Scheduling-related fields: `max_num_seqs`, `max_num_batched_tokens`, `kv_cache_block_size`, etc.

Key config defaults:

Field	Default	Description
`max_num_seqs`	512	Maximum sequences in a single batch
`max_num_batched_tokens`	16384	Maximum tokens scheduled in a single step
`kv_cache_block_size`	16	Tokens per KV cache block (must be multiple of 16, or 1)
`enable_prefix_caching`	`False`	Enable hash-based prefix block sharing
`scheduler_delay_factor`	0.0	Delay factor for batching prompt requests (0 = no delay)
`gpu_memory_utilization`	0.9	Fraction of GPU memory for KV cache

1. Scheduling Algorithm

The scheduler implements a prefill-first policy: all waiting (prefill) requests are scheduled before any running (decode) requests. The entry point is Scheduler.schedule(), which returns a (ScheduledBatch, dict[int, Sequence]) tuple or None if both queues are empty.

1.1 Scheduler Initialization

class Scheduler:
    def __init__(self, config: Config):
        self.max_num_seqs = config.max_num_seqs
        self.max_num_batched_tokens = config.max_num_batched_tokens
        self.bos_token_id = config.bos_token_id
        self.eos_token_id = config.eos_token_id
        self.stop_token_ids = config.stop_token_ids
        self.block_manager = BlockManager(config)
        self.waiting: deque[Sequence] = deque()
        self.running: deque[Sequence] = deque()
        self.prev_time = 0.0
        self.prev_prompt = False
        self.last_prompt_latency = 0.0
        self.delay_factor = config.scheduler_delay_factor
        self.use_spec = config.speculative_config is not None
        self.mtp_k: int = (
            config.speculative_config.num_speculative_tokens if self.use_spec else 0
        )
        self.total_draft_tokens = 0
        self.total_accepted_tokens = 0

The scheduler maintains two deques -- waiting (pending prefill) and running (active decode) -- plus a BlockManager for KV cache allocation.

1.2 Schedule Flow

Scheduler.schedule() proceeds in two phases:

Phase 1 -- Prefill scheduling:

While the delay gate passes (_passed_delay), the waiting queue is non-empty, and num_seqs_prefill < max_num_seqs:
- Peek the first waiting sequence.
- Compute num_new_tokens = seq.num_tokens - seq.num_cached_tokens (prefix cache hits reduce new tokens).
- If num_batched_tokens + num_new_tokens > max_num_batched_tokens or block_manager.can_allocate(seq) returns False, break.
- Otherwise: allocate blocks, set seq.status = RUNNING, seq.type = PREFILL, move from waiting to running.
If any prefill sequences were scheduled, return the batch immediately (no decode mixing).

Phase 2 -- Decode scheduling (only when zero prefills were scheduled):

Pop sequences from running up to max_num_seqs.
For each sequence, check block_manager.can_append(seq).
If a block cannot be appended, preempt the last running sequence (move it back to waiting with status WAITING and deallocate its blocks).
If the sequence has speculative draft tokens (seq.spec_token_ids), record them in scheduled_spec_decode_tokens.
Call block_manager.may_append(seq, num_new_tokens) where num_new_tokens = mtp_k + 1.
Re-insert all scheduled sequences back into running (preserving order).

1.3 Delay Factor

When scheduler_delay_factor > 0, the scheduler delays prefill scheduling to allow the waiting queue to accumulate more requests for better batching:

def _passed_delay(self, now: float) -> bool:
    if self.prev_prompt:
        self.last_prompt_latency = now - self.prev_time
    self.prev_time, self.prev_prompt = now, False
    if self.delay_factor > 0 and self.waiting:
        earliest_arrival_time = min([seq.arrive_time for seq in self.waiting])
        passed_delay = (now - earliest_arrival_time) > (
            self.delay_factor * self.last_prompt_latency
        ) or not self.running
    else:
        passed_delay = True
    return passed_delay

A new prefill is scheduled only when the earliest waiting request has waited longer than delay_factor * last_prompt_latency, or when there are no running decode requests.

1.4 Preemption

When a decode step cannot extend a sequence's KV cache (no free blocks), the scheduler preempts the last running sequence:

def preempt(self, seq: Sequence):
    seq.status = SequenceStatus.WAITING
    self.block_manager.deallocate(seq)
    self.waiting.appendleft(seq)

The preempted sequence is pushed to the front of the waiting queue and its blocks are fully deallocated, so it will be re-prefilled on the next scheduling cycle.

2. ScheduledBatch Structure

ScheduledBatch is constructed by Scheduler.schedule() and passed to the model runner. It is a frozen snapshot of batch metadata.

2.1 Constructor Signature

class ScheduledBatch:
    def __init__(
        self,
        seqs: dict[int, Sequence],
        num_scheduled_tokens: list[int],
        total_tokens_num: int,
        total_tokens_num_prefill: int = 0,
        total_tokens_num_decode: int = 0,
        total_seqs_num: int = 0,
        total_seqs_num_prefill: int = 0,
        total_seqs_num_decode: int = 0,
        is_dummy_run: bool = False,
        num_spec_step: int = 0,
        scheduled_spec_decode_tokens: dict[int, list[int]] = {},
    ):

2.2 Fields

Field	Type	Description
`req_ids`	`list[int]`	Sequence IDs in batch order (`list(seqs.keys())`)
`scheduled_tokens`	`list[list[int]]`	Last `num_tokens` token IDs per sequence (the tokens to process)
`temperatures`	`list[float]`	Sampling temperature per sequence
`context_lens`	`list[int]`	Total token count per sequence (`seq.num_tokens`)
`block_tables`	`list[list[int]]`	Block ID tables for sequences that have block tables
`last_block_num_tokens`	`list[int]`	Number of valid tokens in each sequence's last block
`num_cached_tokens`	`list[int]`	Number of tokens served from prefix cache per sequence
`num_scheduled_tokens`	`list[int]`	Number of new tokens scheduled per sequence
`total_tokens_num`	`int`	Sum of all scheduled tokens across all sequences
`total_tokens_num_prefill`	`int`	Total scheduled tokens for prefill sequences
`total_tokens_num_decode`	`int`	Total scheduled tokens for decode sequences
`total_seqs_num`	`int`	Total number of sequences in the batch
`total_seqs_num_prefill`	`int`	Number of prefill sequences
`total_seqs_num_decode`	`int`	Number of decode sequences
`is_dummy_run`	`bool`	Whether this is a dummy/warmup run
`num_spec_step`	`int`	Number of speculative decode steps (`mtp_k`)
`scheduled_spec_decode_tokens`	`dict[int, list[int]]`	Draft token IDs per sequence ID from prior speculative step

2.3 ScheduledBatchOutput

Returned by the model runner after a forward pass:

class ScheduledBatchOutput:
    def __init__(
        self,
        token_ids: dict[int, tuple[int, ...]],
        draft_token_ids,
    ):
        self.req_ids = list(token_ids.keys())
        self.token_ids = token_ids        # {seq_id: (accepted_token_ids...)}
        self.draft_token_ids = draft_token_ids  # {seq_id: [draft_ids]} or None

token_ids maps sequence ID to a tuple of accepted token IDs.
draft_token_ids maps sequence ID to a list of speculative draft token IDs for the next step (when MTP is active).
A special key -1 in token_ids signals deferred output mode.

3. Block Manager

The BlockManager implements paged KV cache management with fixed-size blocks.

3.1 Block Class

class Block:
    def __init__(self, block_id):
        self.block_id = block_id   # Unique integer ID
        self.ref_count = 0         # Number of sequences referencing this block
        self.hash = -1             # xxhash64 digest for prefix caching (-1 = unhashed)
        self.token_ids = []        # Token IDs stored in this block

Methods:

update(hash, token_ids) -- Sets the block's hash and token content.
reset() -- Sets ref_count = 1, hash = -1, token_ids = [] (used on fresh allocation).

3.2 BlockManager Initialization

class BlockManager:
    def __init__(self, config: Config):
        block_size = config.kv_cache_block_size      # Tokens per block (default 16)
        num_blocks = config.num_kvcache_blocks        # Total blocks in pool
        self.block_size = block_size
        self.blocks: list[Block] = [Block(i) for i in range(num_blocks)]
        self.hash_to_block_id: dict[int, int] = dict()
        self.free_block_ids: deque[int] = deque(range(num_blocks))
        self.used_block_ids: set[int] = set()
        self.enable_prefix_caching = config.enable_prefix_caching

The block pool is pre-allocated at startup. free_block_ids is a deque for O(1) pop/push, used_block_ids tracks active blocks, and hash_to_block_id maps content hashes to block IDs for prefix caching.

3.3 Allocation (`allocate`)

Called during prefill scheduling for new sequences:

def allocate(self, seq: Sequence):

Iterates over seq.num_blocks blocks.
For each block, computes hash if the block is full (len(token_ids) == block_size). Partial (last) blocks get hash = -1.
If prefix caching is enabled, looks up hash_to_block_id:
- Cache hit: Verifies token_ids match. If the block is already in used_block_ids, increments ref_count. If it was evicted but still in the free list, re-allocates it. Increments seq.num_cached_tokens by block_size.
- Cache miss: Allocates from free_block_ids[0].
Full blocks are registered in hash_to_block_id.

3.4 Deallocation (`deallocate`)

Called when a sequence finishes or is preempted:

def deallocate(self, seq: Sequence):
    for block_id in reversed(seq.block_table):
        block = self.blocks[block_id]
        block.ref_count -= 1
        if block.ref_count == 0:
            self._deallocate_block(block_id)
    seq.num_cached_tokens = 0
    seq.block_table.clear()

Blocks are released in reverse order. Shared blocks (with ref_count > 1 from prefix caching) are not freed until all referencing sequences release them.

3.5 Can-Allocate and Can-Append Checks

def can_allocate(self, seq: Sequence) -> bool:
    return len(self.free_block_ids) >= seq.num_blocks

def can_append(self, seq: Sequence) -> bool:
    return len(self.free_block_ids) >= (len(seq) % self.block_size == 1)

can_allocate checks that enough free blocks exist for the full sequence.
can_append checks whether a decode step needs a new block. A new block is needed only when len(seq) % block_size == 1 (the previous block just filled up), requiring exactly 1 free block.

3.6 May-Append (Decode Extension)

def may_append(self, seq: Sequence, num_new_tokens: int = 1):

Called during decode scheduling to extend a sequence's block table:

If the sequence length modulo block_size falls within (0, num_new_tokens], or block_size == 1, a new block is needed:
- Allocates from free_block_ids and appends to block_table.
- For block_size == 1, immediately computes and stores the hash.
If seq_len % block_size == 0, the last block is now full -- computes and stores its hash using the chained prefix.
Otherwise the last block is partially filled with hash = -1 (hash deferred until full).

4. Prefix Caching

Prefix caching enables sharing KV cache blocks across sequences that share a common prompt prefix, avoiding redundant computation.

4.1 Hash Function

ATOM uses xxhash64 (via the xxhash Python library) for fast, collision-resistant block hashing:

@classmethod
def compute_hash(cls, token_ids: list[int], prefix: int = -1):
    h = xxhash.xxh64()
    if prefix != -1:
        h.update(prefix.to_bytes(8, "little"))
    h.update(np.array(token_ids).tobytes())
    return h.intdigest()

4.2 Hash Chaining

Blocks form a hash chain: each block's hash incorporates the previous block's hash as a prefix. This ensures that two blocks with identical token content but different preceding context produce different hashes.

First block: compute_hash(token_ids, prefix=-1) (no prefix).
Subsequent blocks: compute_hash(token_ids, prefix=prev_block.hash).
Only full blocks (where len(token_ids) == block_size) receive a hash. Partial blocks have hash = -1 and are not cached.

4.3 Cache Lookup During Allocation

During allocate(), for each full block:

Compute the block hash via the chain.
Look up hash_to_block_id.get(h, -1).
If found, verify self.blocks[block_id].token_ids == token_ids (guard against hash collisions).
Hit: Reuse the block. If already in used_block_ids, increment ref_count. Add block_size to seq.num_cached_tokens.
Miss (or first miss in chain): Once a cache miss occurs, all subsequent blocks in the sequence are also misses (cache_miss = True is sticky). Allocate fresh blocks from the free list.

4.4 Reference Counting

On allocation: block.reset() sets ref_count = 1.
On cache hit for an in-use block: ref_count += 1.
On deallocation: ref_count -= 1. Block returns to free list only when ref_count == 0.
Shared blocks (prefix cache hits) have ref_count > 1.

4.5 Enabling Prefix Caching

Set enable_prefix_caching=True in Config. When disabled, the hash lookup in allocate() is skipped entirely (block_id is always -1).

5. Postprocessing

Scheduler.postprocess() is called after the model forward pass to update sequences with sampled tokens, check stop conditions, generate streaming output, and clean up finished sequences.

5.1 Signature

def postprocess(
    self,
    seqs: list[Sequence],
    fwd_output: ScheduledBatchOutput,
    stream_output_queue=None,
) -> list[Sequence]:

5.2 Token Appending

For each running sequence whose ID appears in fwd_output.req_ids:

Deferred output or speculative decode with EOS: Replaces placeholder tokens in-place:

seq.token_ids[-num_placeholder:] = token_ids
seq.output_tokens[-num_placeholder:] = token_ids

Normal path: Calls seq.append_token(token_id) for each accepted token, which appends to token_ids, updates output_tokens, last_token, and num_tokens.

5.3 Stop Condition Checking

The postprocessor checks stop conditions in priority order:

Stop token sequences: Compares the tail of seq.token_ids against each entry in seq.stop_token_sequences. Also checks the MTP-adjusted position for speculative decode. Sets leave_reason = "stop_sequence".
EOS token: If self.eos_token_id appears in the accepted tokens and seq.ignore_eos is False. Sets leave_reason = "eos".
Stop token IDs: If any accepted token is in self.stop_token_ids (from Config.stop_token_ids, derived from the model's generation config). Sets leave_reason = "stop_{token_id}".
Max tokens: If seq.num_completion_tokens >= seq.max_tokens. Sets leave_reason = "max_tokens".

5.4 Stream Output

When stream_output_queue is provided, the scheduler creates a RequestOutput for each processed sequence:

request_output = RequestOutput(
    request_id=seq.id,
    output_tokens=output_tokens_list,
    finished=(leave_reason is not None),
    finish_reason=leave_reason,
)

RequestOutput fields:

Field	Type	Description
`request_id`	`int`	Sequence ID
`output_tokens`	`list[int]`	Newly generated tokens since last callback
`finished`	`bool`	Whether the sequence is done
`finish_reason`	`Optional[str]`	One of: `"eos"`, `"max_tokens"`, `"stop_sequence"`, `"stop_{token_id}"`, or `None`

Stream outputs are batched and put onto stream_output_queue via put_nowait.

5.5 Sequence Cleanup

For finished sequences:

Set seq.status = SequenceStatus.FINISHED.
Call block_manager.deallocate(seq) to free KV cache blocks.
Remove from the running deque.
Return in the finished_seqs list.

5.6 Placeholder Insertion

When speculative decoding or deferred output is active, placeholder EOS tokens are appended to still-running sequences to reserve KV cache slots for the next step:

if need_placeholder:
    for seq in seqs:
        if seq.status == SequenceStatus.RUNNING:
            for _ in range(seq.num_placeholder):
                seq.append_token(self.eos_token_id)

The placeholder count is determined as follows:

For sequences processed in this step (had output in fwd_output): always 1 + mtp_k, regardless of mode.
For sequences not processed (skipped in this step): the count depends on the batch-level mode:
- Deferred output + speculative: mtp_k + 1
- Deferred output only: 1
- Speculative only: mtp_k

6. Speculative Decoding Integration

ATOM supports Multi-Token Prediction (MTP) speculative decoding, where a draft model proposes mtp_k additional tokens per step.

6.1 Scheduler Tracking

self.use_spec = config.speculative_config is not None
self.mtp_k: int = config.speculative_config.num_speculative_tokens if self.use_spec else 0
self.total_draft_tokens = 0
self.total_accepted_tokens = 0

Note: SpeculativeConfig currently enforces num_speculative_tokens == 1.

6.2 Draft Tokens in Scheduling

During decode scheduling:

If seq.spec_token_ids is non-empty, the draft tokens are recorded in scheduled_spec_decode_tokens[seq.id].
num_new_tokens = mtp_k + 1 (1 target + mtp_k draft tokens), so may_append reserves enough block space.
The ScheduledBatch carries num_spec_step = mtp_k and the scheduled_spec_decode_tokens dict.

6.3 Acceptance Statistics

def update_spec_stats(self, num_accepted_tokens):
    self.total_draft_tokens += self.mtp_k
    self.total_accepted_tokens += num_accepted_tokens - self.mtp_k

Every 1000 draft tokens, the acceptance rate is logged:

[MTP Stats] Total draft tokens: 5000, Accepted: 3750, Acceptance rate: 75.00%

6.4 Draft Token Storage on Sequences

After postprocessing, accepted draft token IDs for the next step are stored on the sequence:

if draft_token_ids and seq.id in draft_token_ids:
    seq.spec_token_ids = draft_token_ids[seq.id]

These are picked up by the scheduler on the next schedule() call.

7. Sequence Management

The Sequence class represents a single request throughout its lifecycle.

7.1 Constructor

class Sequence:
    def __init__(
        self,
        token_ids: list[int],
        block_size: int,
        sampling_params=SamplingParams(),
        stop_token_sequences: list[list[int]] = None,
        stream_callback: Optional[Callable[[Any], None]] = None,
        id=None,
    ):

7.2 Core Fields

Field	Type	Description
`id`	`int`	Auto-incrementing unique ID (from `itertools.count`)
`token_ids`	`list[int]`	Full token sequence (prompt + completion)
`block_size`	`int`	KV cache block size (from config)
`status`	`SequenceStatus`	Current lifecycle state
`type`	`SequenceType`	Current step type (`DUMMY`, `PREFILL`, `DECODE`)
`num_tokens`	`int`	Total tokens (prompt + completion); property with setter that also updates `num_blocks` and `last_block_num_tokens`
`num_prompt_tokens`	`int`	Number of prompt tokens (fixed at init)
`num_cached_tokens`	`int`	Tokens served from prefix cache
`block_table`	`list[int]`	Ordered list of block IDs assigned to this sequence
`last_token`	`int`	Most recently appended token ID
`temperature`	`float`	Sampling temperature (from `SamplingParams`)
`max_tokens`	`int`	Max completion tokens (from `SamplingParams`, default 64)
`ignore_eos`	`bool`	Whether to ignore EOS tokens (from `SamplingParams`)
`stop_strings`	`Optional[list[str]]`	Stop strings (from `SamplingParams`)
`stop_token_sequences`	`list[list[int]]`	Token-level stop sequences
`stream_callback`	`Optional[Callable]`	Per-sequence stream callback
`output_tokens`	`list[int]`	Cache of newly generated tokens
`spec_token_ids`	`list[int]`	Speculative draft token IDs for next step
`num_placeholder`	`int`	Number of placeholder tokens inserted for speculative/deferred output

7.3 Timing Fields

Field	Type	Description
`arrive_time`	`float`	Timestamp when the sequence entered the scheduler
`first_token_time`	`float`	Timestamp of the first completion token (TTFT measurement)
`leave_time`	`float`	Timestamp when the sequence finished
`leave_reason`	`str`	Reason for finishing (e.g., `"eos"`, `"max_tokens"`, `"stop_sequence"`)

7.4 Computed Properties

Property	Returns
`num_completion_tokens`	`num_tokens - num_prompt_tokens`
`prompt_token_ids`	`token_ids[:num_prompt_tokens]`
`completion_token_ids`	`token_ids[num_prompt_tokens:]`
`num_cached_blocks`	`num_cached_tokens // block_size`
`is_finished`	`status == SequenceStatus.FINISHED`

7.5 num_tokens Setter

Setting num_tokens triggers derived field updates:

@num_tokens.setter
def num_tokens(self, value):
    self._num_tokens = value
    self.num_blocks = (value + self.block_size - 1) // self.block_size
    self.last_block_num_tokens = self._num_tokens - (self.num_blocks - 1) * self.block_size

7.6 Lifecycle

                          allocate blocks
   add(seq) ---------> WAITING ---------> RUNNING (PREFILL)
                          ^                    |
                          |                    | next schedule() step
                     preempt()                 v
                          |              RUNNING (DECODE) <--+
                          +--- can't append    |             |
                                               | stop condition met
                                               v
                                           FINISHED
                                               |
                                               | deallocate blocks
                                               v
                                         (removed from running)

7.7 SequenceStatus Enum

Value	Meaning
`WAITING`	In the waiting queue, pending prefill
`RUNNING`	Actively being processed (prefill or decode)
`FINISHED`	Stop condition met, blocks deallocated
`EXIT_ENGINE`	Sentinel for engine shutdown

7.8 SequenceType Enum

Value	Meaning
`DUMMY`	Initial state before scheduling
`PREFILL`	Currently in prefill phase
`DECODE`	Currently in decode phase

Source Files

File	Description
`atom/model_engine/scheduler.py`	`Scheduler`, `ScheduledBatch`, `ScheduledBatchOutput` -- scheduling algorithm, postprocessing, speculative decode stats
`atom/model_engine/block_manager.py`	`Block`, `BlockManager` -- paged KV cache block pool, allocation/deallocation, prefix caching with xxhash64
`atom/model_engine/sequence.py`	`Sequence`, `SequenceStatus`, `SequenceType` -- request lifecycle, token management, timing
`atom/model_engine/request.py`	`RequestOutput` -- streaming output dataclass with `request_id`, `output_tokens`, `finished`, `finish_reason`
`atom/config.py`	`Config` -- scheduling-related fields (`max_num_seqs`, `max_num_batched_tokens`, `kv_cache_block_size`, `enable_prefix_caching`, `scheduler_delay_factor`), `SpeculativeConfig`
`atom/sampling_params.py`	`SamplingParams` -- `temperature`, `max_tokens`, `ignore_eos`, `stop_strings`

FilesExpand file tree

scheduling_kv_cache_guide.md

Latest commit

History