Add tracing support in deepseek for vLLM flow (#36605)

pprajapatiTT · Copilot · web-flow · commit 6b55b0c77b82 · 2026-01-29T23:23:39.000Z
### Ticket #36604 ### What's changed Deepseek supports tracing only in decode path for now. This patch implements tracing in vLLM workflow for deepseek. ### Checklist - [ ] [![All post-commit tests](https://github.com/tenstorrent/tt-metal/actions/workflows/all-post-commit-workflows.yaml/badge.svg?branch=pprajapati/vllm_tracing)](https://github.com/tenstorrent/tt-metal/actions/workflows/all-post-commit-workflows.yaml?query=branch:pprajapati/vllm_tracing) - [ ] [![Blackhole Post commit](https://github.com/tenstorrent/tt-metal/actions/workflows/blackhole-post-commit.yaml/badge.svg?branch=pprajapati/vllm_tracing)](https://github.com/tenstorrent/tt-metal/actions/workflows/blackhole-post-commit.yaml?query=branch:pprajapati/vllm_tracing) - [ ] [![cpp-unit-tests](https://github.com/tenstorrent/tt-metal/actions/workflows/tt-metal-l2-nightly.yaml/badge.svg?branch=pprajapati/vllm_tracing)](https://github.com/tenstorrent/tt-metal/actions/workflows/tt-metal-l2-nightly.yaml?query=branch:pprajapati/vllm_tracing) - [ ] New/Existing tests provide coverage for changes #### Model tests If your changes cover model-related code, you should run tests corresponding to affected models and platforms (Single card, T3K, Galaxy). "Choose your pipeline" workflows facilitate running multiple kinds of tests in a single run. Each offers `models-mandatory` and `models-extended` presets. The former includes a minimal set of tests, to be run always. The latter extends that with additional ones - use your best judgement in deciding which is the most appropriate for your PR. - [ ] [![(Single) Choose your pipeline](https://github.com/tenstorrent/tt-metal/actions/workflows/pipeline-select.yaml/badge.svg?branch=pprajapati/vllm_tracing)](https://github.com/tenstorrent/tt-metal/actions/workflows/pipeline-select.yaml?query=branch:pprajapati/vllm_tracing) - [ ] `models-mandatory` preset (runs: [Device perf regressions](https://github.com/tenstorrent/tt-metal/actions/workflows/perf-device-models.yaml) and [Frequent model and ttnn tests](https://github.com/tenstorrent/tt-metal/actions/workflows/fast-dispatch-full-regressions-and-models.yaml)) - [ ] `models-extended` preset (runs: the mandatory tests, plus [Demo](https://github.com/tenstorrent/tt-metal/actions/workflows/single-card-demo-tests.yaml) and [Model perf](https://github.com/tenstorrent/tt-metal/actions/workflows/perf-models.yaml) tests) - [ ] other selection - specify runs - [ ] [![(T3K) Choose your pipeline](https://github.com/tenstorrent/tt-metal/actions/workflows/pipeline-select-t3k.yaml/badge.svg?branch=pprajapati/vllm_tracing)](https://github.com/tenstorrent/tt-metal/actions/workflows/pipeline-select-t3k.yaml?query=branch:pprajapati/vllm_tracing) - [ ] `models-mandatory` preset (runs: [Unit tests](https://github.com/tenstorrent/tt-metal/actions/workflows/t3000-unit-tests.yaml)) - [ ] `models-extended` preset (runs: the mandatory tests, plus [Demo](https://github.com/tenstorrent/tt-metal/actions/workflows/t3000-demo-tests.yaml) and [Model perf](https://github.com/tenstorrent/tt-metal/actions/workflows/t3000-model-perf-tests.yaml) tests) - [ ] other selection - specify runs - [ ] [![(Galaxy) Choose your pipeline](https://github.com/tenstorrent/tt-metal/actions/workflows/pipeline-select-galaxy.yaml/badge.svg?branch=pprajapati/vllm_tracing)](https://github.com/tenstorrent/tt-metal/actions/workflows/pipeline-select-galaxy.yaml?query=branch:pprajapati/vllm_tracing) - [ ] `models-mandatory` preset (runs: [Quick tests](https://github.com/tenstorrent/tt-metal/actions/workflows/galaxy-quick.yaml)) - [ ] `models-extended` preset (runs: the mandatory tests, plus [Demo](https://github.com/tenstorrent/tt-metal/actions/workflows/galaxy-demo-tests.yaml) and [Model perf](https://github.com/tenstorrent/tt-metal/actions/workflows/galaxy-model-perf-tests.yaml) tests) - [ ] other selection - specify runs --------- Signed-off-by: Pratikkumar Prajapati <pprajapati@tenstorrent.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
diff --git a/models/demos/deepseek_v3/demo/demo.py b/models/demos/deepseek_v3/demo/demo.py
@@ -277,7 +277,18 @@ def run_demo(
     logger.info(f"Opening mesh device with shape {mesh_shape}")
     if enable_trace:
         logger.info("Enabling trace for decode forward pass")
-        trace_region_size = 4880384 + int(0.20 * 4880384)  # 20% additional
+        # NOTE:
+        # The base trace region size below (~36.3 MiB) was empirically determined from
+        # vLLM decode workloads to be sufficient to keep the trace buffer from
+        # overflowing under typical DeepSeek-V3 demo settings (batch size, sequence
+        # length, and mesh configuration). We add 20% headroom as a conservative
+        # safety margin to accommodate variability across models / prompts without
+        # repeatedly re-tuning this value.
+        #
+        # If you are optimizing memory usage, this can be reduced after verifying
+        # that tracing completes without buffer exhaustion for your target workload.
+        BASE_TRACE_REGION_BYTES = 38_070_272
+        trace_region_size = BASE_TRACE_REGION_BYTES + int(0.20 * BASE_TRACE_REGION_BYTES)
         logger.info(f"Trace region size set to {trace_region_size}")
         mesh_device = ttnn.open_mesh_device(mesh_shape=mesh_shape, trace_region_size=trace_region_size)
     else:
diff --git a/models/demos/deepseek_v3/tt/generator.py b/models/demos/deepseek_v3/tt/generator.py
@@ -132,12 +132,21 @@ def __init__(
         self.random_weights = random_weights
         self.single_layer = single_layer
 
+        # Model runtime state
+        self.model_state = None
+        self.model_shared_state = None
+        self.model_prefill_cfg = None
+        self.model_decode_cfg = None
+        self.model_weight_config = None
+        self.page_tables_tt = None
+
         # Trace state (decode)
         self._trace_id: int | None = None
         self._trace_tokens: ttnn.Tensor | None = None
         self._trace_positions: ttnn.Tensor | None = None
         self._trace_rot_idxs: ttnn.Tensor | None = None
         self._trace_output: ttnn.Tensor | None = None
+        self._trace_page_tables_to_use: tuple[ttnn.Tensor, ...] | None = None
         self.enable_trace = enable_trace
         self.signpost = signpost
         self.prefill_max_tokens = prefill_max_tokens
@@ -279,20 +288,47 @@ def cleanup_all(self) -> None:
 
         # Clean up model states
         try:
-            if hasattr(self, "model_state") and self.model_state is not None:
+            if self.model_state is not None:
                 del self.model_state
         except Exception as e:
             logger.warning(f"Failed to cleanup model state: {e}")
 
         try:
-            if hasattr(self, "model_shared_state") and self.model_shared_state is not None:
+            if self.model_shared_state is not None:
                 del self.model_shared_state
         except Exception as e:
             logger.warning(f"Failed to cleanup model shared state: {e}")
 
+        # Clean up trace state
+        try:
+            if self._trace_id is not None:
+                ttnn.release_trace(self.mesh_device, self._trace_id)
+                del self._trace_id
+            if self._trace_tokens is not None:
+                ttnn.deallocate(self._trace_tokens)
+                del self._trace_tokens
+            if self._trace_positions is not None:
+                ttnn.deallocate(self._trace_positions)
+                del self._trace_positions
+            if self._trace_rot_idxs is not None:
+                ttnn.deallocate(self._trace_rot_idxs)
+                del self._trace_rot_idxs
+            if self._trace_output is not None:
+                ttnn.deallocate(self._trace_output)
+                del self._trace_output
+            if self._trace_page_tables_to_use is not None and self._trace_page_tables_to_use is not self.page_tables_tt:
+                for i, page_table in enumerate(self._trace_page_tables_to_use):
+                    try:
+                        ttnn.deallocate(page_table)
+                    except Exception as e:
+                        logger.warning(f"Failed to deallocate trace page table {i}: {e}")
+                del self._trace_page_tables_to_use
+        except Exception as e:
+            logger.warning(f"Failed to cleanup trace state: {e}")
+
         # Clean up page tables (TTNN tensors)
         try:
-            if hasattr(self, "page_tables_tt") and self.page_tables_tt is not None:
+            if self.page_tables_tt is not None:
                 for i, page_table in enumerate(self.page_tables_tt):
                     try:
                         ttnn.deallocate(page_table)
@@ -304,45 +340,37 @@ def cleanup_all(self) -> None:
 
         # Clean up RoPE setup
         try:
-            if hasattr(self, "rope_setup") and self.rope_setup is not None:
+            if self.rope_setup is not None:
                 del self.rope_setup
         except Exception as e:
             logger.warning(f"Failed to cleanup RoPE setup: {e}")
 
         # Clean up CCL
         try:
-            if hasattr(self, "ccl") and self.ccl is not None:
+            if self.ccl is not None:
                 del self.ccl
         except Exception as e:
             logger.warning(f"Failed to cleanup CCL: {e}")
 
         # Clean up configs
         try:
-            if hasattr(self, "model_prefill_cfg") and self.model_prefill_cfg is not None:
+            if self.model_prefill_cfg is not None:
                 del self.model_prefill_cfg
-            if hasattr(self, "model_decode_cfg") and self.model_decode_cfg is not None:
+            if self.model_decode_cfg is not None:
                 del self.model_decode_cfg
-            if hasattr(self, "model_weight_config") and self.model_weight_config is not None:
+            if self.model_weight_config is not None:
                 del self.model_weight_config
 
         except Exception as e:
             logger.warning(f"Failed to cleanup model configs: {e}")
 
         # Clean up paged config
         try:
-            if hasattr(self, "paged_config") and self.paged_config is not None:
+            if self.paged_config is not None:
                 del self.paged_config
         except Exception as e:
             logger.warning(f"Failed to cleanup paged config: {e}")
 
-        # Clean up trace state
-        if self.enable_trace:
-            try:
-                if hasattr(self, "_trace_id") and self._trace_id is not None:
-                    ttnn.release_trace(self.mesh_device, self._trace_id)
-            except Exception as e:
-                logger.warning(f"Failed to release trace: {e}")
-
     def __enter__(self):
         """Context manager entry."""
         return self
@@ -414,7 +442,7 @@ def _decode_step(
         tokens_step: torch.Tensor,
         positions: torch.Tensor,
         batch_size_per_row: int,
-        page_table: torch.Tensor | None = None,
+        page_tables: torch.Tensor | None = None,
         return_rot_idxs: bool = False,
     ) -> torch.Tensor | Tuple[torch.Tensor, ttnn.Tensor]:
         """Run a single decode step and return logits on host as torch tensor [1, 1, B, V].
@@ -444,8 +472,8 @@ def _decode_step(
             dtype=ttnn.int32,
         )
 
-        if page_table is not None:
-            page_tables_to_use = self._convert_vllm_page_table_for_batch(page_table)
+        if page_tables is not None:
+            page_tables_to_use = self._convert_vllm_page_table_for_batch(page_tables, device=self.mesh_device)
         else:
             page_tables_to_use = self._get_page_tables()
         # RowBatchedModel forward
@@ -637,11 +665,11 @@ def generate(
                     logger.info(f"Decoding step {gen_idx} for {num_of_prompts} user(s)...")
                     profiler.start(f"decode_time_{gen_idx}")
                     logits = self.decode_forward(
-                        next_tokens,
-                        positions,
-                        self.batch_size_per_row,
-                        profiler,
-                        gen_idx,
+                        tokens=next_tokens,
+                        positions=positions,
+                        batch_size_per_row=self.batch_size_per_row,
+                        profiler=profiler,
+                        gen_idx=gen_idx,
                         enable_trace=self.enable_trace,
                     )
                     profiler.end(f"decode_time_{gen_idx}")
@@ -818,14 +846,18 @@ def _prefill(
         return logits  # [1, 1, seq_len, V]
 
     def _capture_decode_trace(
-        self, init_tokens: torch.Tensor, positions: torch.Tensor, batch_size_per_row: int
+        self,
+        init_tokens: torch.Tensor,
+        positions: torch.Tensor,
+        batch_size_per_row: int,
+        page_tables: torch.Tensor | None = None,
     ) -> None:
         """Allocate persistent inputs, capture trace for one decode iteration, and store trace state."""
         assert self._trace_id is None, "Trace already captured"
 
         # 1) Warm-up compile run (no trace) to keep compilation out of capture
         logger.info("Running warm-up decode step (no trace)...")
-        _ = self._decode_step(init_tokens, positions, batch_size_per_row=batch_size_per_row)
+        _ = self._decode_step(init_tokens, positions, batch_size_per_row=batch_size_per_row, page_tables=page_tables)
         ttnn.synchronize_device(self.mesh_device)
 
         # 2) Allocate persistent device inputs
@@ -838,6 +870,13 @@ def _capture_decode_trace(
         )
 
         self._trace_rot_idxs = self.rope_setup.get_rot_idxs(positions)
+
+        if page_tables is not None:
+            self._trace_page_tables_to_use = self._convert_vllm_page_table_for_batch(
+                page_tables, device=self.mesh_device
+            )
+        else:
+            self._trace_page_tables_to_use = self._get_page_tables()
         ttnn.synchronize_device(self.mesh_device)
 
         # 3) Capture decode graph
@@ -847,15 +886,12 @@ def _capture_decode_trace(
 
         # Only capture the rot_mats generation from rot_idxs (all ttnn ops, no from_torch)
         rope_tensors = self.rope_setup.get_rot_mats_from_rot_idxs(self._trace_rot_idxs)
-        logger.info(f"Rope tensors done")
-
-        # TODO: Fix this for vLLM
         self._trace_output = RowBatchedModel.forward_decode(
             x=self._trace_tokens,
             position_idxs=self._trace_positions,
             cfg=self.model_run_config_decode,
             rope_tensors=rope_tensors,
-            page_tables=self.page_tables_tt,
+            page_tables=self._trace_page_tables_to_use,
         )
         ttnn.end_trace_capture(self.mesh_device, trace_id, cq_id=0)
         logger.info("Decode trace capture complete.")
@@ -866,16 +902,20 @@ def decode_forward(
         tokens: torch.Tensor,
         positions: torch.Tensor,
         batch_size_per_row: int,
-        profiler: BenchmarkProfiler,
-        gen_idx: int,
+        gen_idx: int = 0,
+        profiler: BenchmarkProfiler | None = None,
         enable_trace: bool = False,
+        page_tables: torch.Tensor | None = None,
     ) -> torch.Tensor:
+        # vLLM does not pass enable_trace param while initializing the model.
+        # vLLM sets it in decode/prefill calls only, so we need to set it here too.
+        self.enable_trace = enable_trace
         if not enable_trace:
-            return self._decode_step(tokens, positions, batch_size_per_row).squeeze(0).squeeze(0)
+            return self._decode_step(tokens, positions, batch_size_per_row, page_tables).squeeze(0).squeeze(0)
         else:
             # Capture trace and return trace output
             if self._trace_id is None:
-                self._capture_decode_trace(tokens, positions, batch_size_per_row)
+                self._capture_decode_trace(tokens, positions, batch_size_per_row, page_tables)
                 # First call: return the captured run's output
                 assert self._trace_output is not None
                 logits = ttnn.to_torch(
@@ -892,6 +932,7 @@ def decode_forward(
                 and self._trace_positions is not None
                 and self._trace_rot_idxs is not None
                 and self._trace_id is not None
+                and self._trace_page_tables_to_use is not None
             )
             torch_input = tokens.view(1, 1, -1).to(torch.int32)
 
@@ -921,13 +962,20 @@ def decode_forward(
             host_rot_idxs = self.rope_setup.get_rot_idxs(positions, on_host=True)
             ttnn.copy_host_to_device_tensor(host_rot_idxs, self._trace_rot_idxs)
 
+            if page_tables is not None:
+                page_tables_to_use = self._convert_vllm_page_table_for_batch(page_tables, device=None)
+                for i, page_table in enumerate(page_tables_to_use):
+                    ttnn.copy_host_to_device_tensor(page_table, self._trace_page_tables_to_use[i])
+
             self.ccl.reset_sem_counters()
-            profiler.start(f"trace_execution_{gen_idx}")
+            if profiler is not None:
+                profiler.start(f"trace_execution_{gen_idx}")
             ttnn.execute_trace(self.mesh_device, self._trace_id, cq_id=0, blocking=True)
-            profiler.end(f"trace_execution_{gen_idx}")
-            logger.info(
-                f"Trace execution t/s/user @ {gen_idx}th token: {1/profiler.get_duration(f'trace_execution_{gen_idx}')}"
-            )
+            if profiler is not None:
+                profiler.end(f"trace_execution_{gen_idx}")
+                logger.info(
+                    f"Trace execution t/s/user @ {gen_idx}th token: {1/profiler.get_duration(f'trace_execution_{gen_idx}')}"
+                )
             assert self._trace_output is not None
             logits = ttnn.to_torch(
                 self._trace_output,
@@ -1034,13 +1082,17 @@ def _convert_vllm_page_table_for_user(
         num_layers = self.hf_config.num_hidden_layers
         return tuple(ttnn.clone(page_table_tt) for _ in range(num_layers))
 
-    def _convert_vllm_page_table_for_batch(self, page_table: torch.Tensor) -> tuple[ttnn.Tensor, ...]:
+    def _convert_vllm_page_table_for_batch(
+        self, page_table: torch.Tensor, device: ttnn.Device | ttnn.MeshDevice | None
+    ) -> tuple[ttnn.Tensor, ...]:
         """
         Convert vLLM's block_tables (page_table) to TTNN tensor format for the entire batch.
         Creates one page table per layer as expected by the model.
 
         Args:
             page_table: torch.Tensor of shape [batch_size, max_num_blocks_per_req] from vLLM
+            device: ttnn.Device, ttnn.MeshDevice, or None. If provided, creates device tensors on the specified device.
+                   If None, creates host tensors instead of device tensors.
 
         Returns:
             Tuple of TTNN tensors, one per layer
@@ -1051,7 +1103,7 @@ def _convert_vllm_page_table_for_batch(self, page_table: torch.Tensor) -> tuple[
 
         page_table_tt = ttnn.from_torch(
             page_table,
-            device=self.mesh_device,
+            device=device,
             dtype=ttnn.int32,
             layout=ttnn.ROW_MAJOR_LAYOUT,
             mesh_mapper=ttnn.ShardTensorToMesh(self.mesh_device, dim=0),
diff --git a/models/demos/deepseek_v3/tt/generator_vllm.py b/models/demos/deepseek_v3/tt/generator_vllm.py
@@ -132,24 +132,26 @@ def prefill_forward(self, *args, **kwargs):
     def decode_forward(self, *args, **kwargs):
         assert self.model_run_config_decode is not None, "Model run config decode is not initialized"
 
-        page_table = kwargs.get("page_table", None)
+        page_tables = kwargs.get("page_table", None)
         kv_cache = kwargs.get("kv_cache", None)
+        enable_trace = kwargs.get("enable_trace", False)
         # Set kv_cache if provided and all entries are valid
         if kv_cache is not None and not any(entry is None for entry in kv_cache):
             self.set_kv_cache(kv_cache)
 
         tokens_step = kwargs["tokens"].squeeze(1)
+
         return_value = (
-            self._decode_step(
-                tokens_step=tokens_step,
+            super()
+            .decode_forward(
+                tokens=tokens_step,
                 positions=kwargs["start_pos"],
                 batch_size_per_row=USERS_PER_ROW,
-                page_table=page_table,
+                enable_trace=enable_trace,
+                page_tables=page_tables,
             )
-            .squeeze(0)
-            .squeeze(0)
             .unsqueeze(1)
-        )  # [1,1,B,V] -> [B, 1, V]
+        )  # [B, V] -> [B, 1, V]
         return return_value
 
     def allocate_kv_cache(self, kv_cache_shape, dtype, num_layers):