Skip to content

Can't access 2 times the source of the same module with vLLM integration #568

@Butanium

Description

@Butanium

I was playing around vllm to see if the attention probabilities were easily hookable, and encountered this weird behavior where accessing the source of a module a second time crashes.

Replacing all the print statements with _ = [pathtosource].source fails similarly

from nnsight.modeling.vllm import VLLM

# vLLM supports explicit parallelism
vllm = VLLM("meta-llama/Llama-3.1-8B", dispatch=True, tensor_parallel_size=1, gpu_memory_utilization=0.8)
print("\n"*8)



print("\n === First forward pass ===")
with vllm.trace("zzzzz!"):
    print("before source")
    _ = vllm.model.layers[0].self_attn.source
    # print(vllm.model.layers[0].self_attn.source)
    print("after source")


print("\n === 1.5th forward pass ===")
with vllm.trace("yyyyy!"):
    print("before source")
    _ = vllm.model.layers[1].mlp.source
    # print(vllm.model.layers[1].mlp.source)
    print("after source")

print("\n === 1.75th forward pass ===")
with vllm.trace("pppppp!"):
    print("before source")
    _ = vllm.model.layers[1].self_attn.source
    # print(vllm.model.layers[1].self_attn.source)
    print("after source")

print("\n === Second forward pass ===")
with vllm.trace("trtrt!"):
    print("before source")
    _ = vllm.model.layers[0].self_attn.source
    # print(vllm.model.layers[0].self_attn.source)
    print("after source")
Adding requests: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 466.40it/s]
Processed prompts: 100%|████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.16it/s, est. speed input: 10.83 toks/s, output: 34.64 toks/s]

 === First forward pass ===
Adding requests: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 227.88it/s]
Processed prompts:   0%|                                                                      | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]before source
Processed prompts: 100%|█████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.24it/s, est. speed input: 8.95 toks/s, output: 35.81 toks/s]

 === Second forward pass ===
Adding reques === First forward pass ===
Adding requests: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 256.44it/s]
Processed prompts:   0%|                                                                      | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]before source
                                    * def forward(
                                    0     self,
                                    1     positions: torch.Tensor,
                                    2     hidden_states: torch.Tensor,
                                    3 ) -> torch.Tensor:
 self_qkv_proj_0                ->  4     qkv, _ = self.qkv_proj(hidden_states)
 qkv_split_0                    ->  5     q, k, v = qkv.split([self.q_size, self.kv_size, self.kv_size], dim=-1)
 self_rotary_emb_0              ->  6     q, k = self.rotary_emb(positions, q, k)
                                    7     if self.do_llama_4_scaling:
 self__get_llama_4_attn_scale_0 ->  8         attn_scale = self._get_llama_4_attn_scale(positions)
 to_0                           ->  9         q = (q * attn_scale).to(q.dtype)
 self_attn_0                    -> 10     attn_output = self.attn(q, k, v)
 self_o_proj_0                  -> 11     output, _ = self.o_proj(attn_output)
                                   12     return output
                                   13 
after source
Processed prompts: 100%|█████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.98it/s, est. speed input: 9.90 toks/s, output: 31.67 toks/s]

 === 1.5th forward pass ===
Adding requests: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 220.29it/s]
Processed prompts:   0%|                                                                      | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]before source
                         * def forward(self, x):
 self_gate_up_proj_0 ->  0     x, _ = self.gate_up_proj(x)
 self_act_fn_0       ->  1     x = self.act_fn(x)
 self_down_proj_0    ->  2     x, _ = self.down_proj(x)
                         3     return x
                         4 
after source
Processed prompts: 100%|████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.50it/s, est. speed input: 10.00 toks/s, output: 40.02 toks/s]

 === 1.75th forward pass ===
Adding requests: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 311.15it/s]
Processed prompts:   0%|                                                                      | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]before source
                                    * def forward(
                                    0     self,
                                    1     positions: torch.Tensor,
                                    2     hidden_states: torch.Tensor,
                                    3 ) -> torch.Tensor:
 self_qkv_proj_0                ->  4     qkv, _ = self.qkv_proj(hidden_states)
 qkv_split_0                    ->  5     q, k, v = qkv.split([self.q_size, self.kv_size, self.kv_size], dim=-1)
 self_rotary_emb_0              ->  6     q, k = self.rotary_emb(positions, q, k)
                                    7     if self.do_llama_4_scaling:
 self__get_llama_4_attn_scale_0 ->  8         attn_scale = self._get_llama_4_attn_scale(positions)
 to_0                           ->  9         q = (q * attn_scale).to(q.dtype)
 self_attn_0                    -> 10     attn_output = self.attn(q, k, v)
 self_o_proj_0                  -> 11     output, _ = self.o_proj(attn_output)
                                   12     return output
                                   13 
after source
Processed prompts: 100%|████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.47it/s, est. speed input: 12.34 toks/s, output: 39.50 toks/s]

 === Second forward pass ===
Adding requests: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 341.72it/s]
Processed prompts:   0%|                                                                      | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]before source
[2025-12-29 18:13:35] ERROR dump_input.py:72: Dumping input data for V1 LLM engine (v0.13.0) with config: model='meta-llama/Llama-3.1-8B', speculative_config=None, tokenizer='meta-llama/Llama-3.1-8B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False), seed=0, served_model_name=meta-llama/Llama-3.1-8B, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False}, 'local_cache_dir': None}, 
[2025-12-29 18:13:35] ERROR dump_input.py:79: Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=3,prompt_token_ids_len=5,mm_features=[],sampling_params=SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=16, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, structured_outputs=None, extra_args=None),block_ids=([7],),num_computed_tokens=0,lora_request=None,prompt_embeds_shape=None)], scheduled_cached_reqs=CachedRequestData(req_ids=[], resumed_req_ids=[], new_token_ids=[], all_token_ids={}, new_block_ids=[], num_computed_tokens=[], num_output_tokens=[]), num_scheduled_tokens={3: 5}, total_num_scheduled_tokens=5, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[1], finished_req_ids=['2'], free_encoder_mm_hashes=[], preempted_req_ids=[], pending_structured_output_tokens=false, kv_connector_metadata=null, ec_connector_metadata=null)
Processed prompts:   0%|                                                                      | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
[rank0]: [rank0]: Traceback (most recent call last):
[rank0]: [rank0]:   File "/mnt/nw/home/c.dumas/projects/nnterp/junk_vllm.py", line 42, in <module>
[rank0]: [rank0]:     with vllm.trace("trtrt!"):
[rank0]: [rank0]:   File "/mnt/nw/home/c.dumas/projects/nnterp/.venv/lib/python3.10/site-packages/nnsight/intervention/tracing/base.py", line 659, in __exit__
[rank0]: [rank0]:     self.backend(self)
[rank0]: [rank0]:   File "/mnt/nw/home/c.dumas/projects/nnterp/.venv/lib/python3.10/site-packages/nnsight/intervention/backends/execution.py", line 24, in __call__
[rank0]: [rank0]:     raise wrap_exception(e, tracer.info) from None
[rank0]: [rank0]: nnsight.NNsightException: 

[rank0]: [rank0]: Traceback (most recent call last):
[rank0]: [rank0]:   File "/mnt/nw/home/c.dumas/projects/nnterp/junk_vllm.py", line 44, in <module>
[rank0]: [rank0]:     print(vllm.model.layers[0].self_attn.source)
[rank0]: [rank0]:   File "/usr/lib/python3.10/inspect.py", line 1139, in getsource
[rank0]: [rank0]:     lines, lnum = getsourcelines(object)
[rank0]: [rank0]:   File "/usr/lib/python3.10/inspect.py", line 1121, in getsourcelines
[rank0]: [rank0]:     lines, lnum = findsource(object)
[rank0]: [rank0]:   File "/usr/lib/python3.10/inspect.py", line 958, in findsource
[rank0]: [rank0]:     raise OSError('could not get source code')

[rank0]: [rank0]: OSError: could not get source code
[rank0]:[W1229 18:13:35.595722568 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions