comand:VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0 vllm serve /mnt/autovision-cbs/DeepSeek-V4-Flash --port 30000 --served-model-name default --trust-remote-code --kv-cache-dtype fp8 --block-size 256 --tensor-parallel-size 4 --max-num-seqs 512 --max-num-batched-tokens 4096 --no-enable-flashinfer-autotune --compilation-config '{"mode": 0, "cudagraph_mode": "FULL_DECODE_ONLY"}' --gpu-memory-utilization 0.95 --tokenizer-mode deepseek_v4 --tool-call-parser deepseek_v4 --enable-auto-tool-choice --reasoning-parser deepseek_v4 --speculative_config '{"method":"mtp","num_speculative_tokens":2}' --enable-log-requests --enable-logging-iteration-details --enable-expert-parallel --enable-prefix-caching --kv-offloading-backend native --kv-offloading-size 300
APIServer pid=11323) INFO 06-07 12:04:02 [async_llm.py:721] Aborted request(s) chatcmpl-8578d05b1c52d296-a6413c84.
(APIServer pid=11323) INFO 06-07 12:04:02 [async_llm.py:595] Request chatcmpl-8578d05b1c52d296 aborted.
(EngineCore pid=11599) INFO 06-07 12:04:02 [core.py:407] Iteration(57795): 0 context requests, 0 context tokens, 1 generation requests, 3 generation tokens, iteration elapsed time: 6.56 ms
(EngineCore pid=11599) INFO 06-07 12:04:02 [core.py:407] Iteration(57796): 0 context requests, 0 context tokens, 1 generation requests, 3 generation tokens, iteration elapsed time: 0.24 ms
(EngineCore pid=11599) INFO 06-07 12:04:02 [core.py:407] Iteration(57797): 0 context requests, 0 context tokens, 0 generation requests, 0 generation tokens, iteration elapsed time: 0.12 ms
(APIServer pid=11323) INFO 06-07 12:04:03 [logger.py:63] Received request chatcmpl-94a14e5f589d91c5: params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], thinking_token_budget=None, include_stop_str_in_output=False, ignore_eos=False, max_tokens=628319, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, structured_outputs=None, extra_args=None), lora_request: None.
(APIServer pid=11323) INFO 06-07 12:04:03 [async_llm.py:415] Added request chatcmpl-94a14e5f589d91c5-b7763ff8.
!!!!!!! Segfault encountered !!!!!!!
File "<unknown>", line 0, in cuMemcpyBatchAsync
File "<unknown>", line 0, in swap_blocks_batch(at::Tensor const&, at::Tensor const&, at::Tensor const&, bool)
File "<unknown>", line 0, in c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<void (*)(at::Tensor const&, at::Tensor const&, at::Tensor const&, bool), void, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, at::Tensor const&, bool> >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*)
File "<unknown>", line 0, in c10::Dispatcher::callBoxed(c10::OperatorHandle const&, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const [clone .isra.0]
File "<unknown>", line 0, in torch::jit::invokeOperatorFromPython(c10::ArrayRef<std::shared_ptr<torch::jit::Operator> >, pybind11::args const&, pybind11::kwargs const&, std::optional<c10::DispatchKey>)
File "<unknown>", line 0, in torch::jit::_get_operation_for_overload_or_packet(c10::ArrayRef<std::shared_ptr<torch::jit::Operator> >, c10::Symbol, pybind11::args const&, pybind11::kwargs const&, bool, std::optional<c10::DispatchKey>)
File "<unknown>", line 0, in torch::jit::_get_operation_for_overload_or_packet(std::vector<std::shared_ptr<torch::jit::Operator>, std::allocator<std::shared_ptr<torch::jit::Operator> > > const&, c10::Symbol, pybind11::args const&, pybind11::kwargs const&, bool, std::optional<c10::DispatchKey>)
File "<unknown>", line 0, in pybind11::cpp_function::initialize<torch::jit::initJITBindings(_object*)::{lambda(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)#2}::operator()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const::{lambda(pybind11::args const&, pybind11::kwargs const&)#1}, pybind11::object, pybind11::args const&, pybind11::kwargs const&, pybind11::name, pybind11::doc>(torch::jit::initJITBindings(_object*)::{lambda(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)#2}::operator()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const::{lambda(pybind11::args const&, pybind11::kwargs const&)#1}&&, pybind11::object (*)(pybind11::args const&, pybind11::kwargs const&), pybind11::name const&, pybind11::doc const&)::{lambda(pybind11::detail::function_call&)#1}::_FUN(pybind11::detail::function_call&)
File "<unknown>", line 0, in pybind11::cpp_function::dispatcher(_object*, _object*, _object*)
File "<unknown>", line 0, in _PyObject_Call
File "<unknown>", line 0, in _PyEval_EvalFrameDefault
File "<unknown>", line 0, in _PyObject_FastCallDictTstate
File "<unknown>", line 0, in _PyObject_Call_Prepend
File "<unknown>", line 0, in _PyObject_MakeTpCall
File "<unknown>", line 0, in _PyEval_EvalFrameDefault
File "<unknown>", line 0, in _PyEval_EvalFrameDefault
File "<unknown>", line 0, in _PyEval_EvalFrameDefault
File "<unknown>", line 0, in _PyEval_EvalFrameDefault
File "<unknown>", line 0, in PyEval_EvalCode
File "<unknown>", line 0, in PyRun_StringFlags
File "<unknown>", line 0, in PyRun_SimpleStringFlags
File "<unknown>", line 0, in Py_RunMain
File "<unknown>", line 0, in Py_BytesMain
File "<unknown>", line 0, in _start
File "<unknown>", line 0, in 0xffffffffffffffff
!!!!!!! Segfault encountered !!!!!!!
File "<unknown>", line 0, in cuMemcpyBatchAsync
File "<unknown>", line 0, in swap_blocks_batch(at::Tensor const&, at::Tensor const&, at::Tensor const&, bool)
File "<unknown>", line 0, in c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<void (*)(at::Tensor const&, at::Tensor const&, at::Tensor const&, bool), void, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, at::Tensor const&, bool> >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*)
File "<unknown>", line 0, in c10::Dispatcher::callBoxed(c10::OperatorHandle const&, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const [clone .isra.0]
File "<unknown>", line 0, in torch::jit::invokeOperatorFromPython(c10::ArrayRef<std::shared_ptr<torch::jit::Operator> >, pybind11::args const&, pybind11::kwargs const&, std::optional<c10::DispatchKey>)
File "<unknown>", line 0, in torch::jit::_get_operation_for_overload_or_packet(c10::ArrayRef<std::shared_ptr<torch::jit::Operator> >, c10::Symbol, pybind11::args const&, pybind11::kwargs const&, bool, std::optional<c10::DispatchKey>)
File "<unknown>", line 0, in torch::jit::_get_operation_for_overload_or_packet(std::vector<std::shared_ptr<torch::jit::Operator>, std::allocator<std::shared_ptr<torch::jit::Operator> > > const&, c10::Symbol, pybind11::args const&, pybind11::kwargs const&, bool, std::optional<c10::DispatchKey>)
File "<unknown>", line 0, in pybind11::cpp_function::initialize<torch::jit::initJITBindings(_object*)::{lambda(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)#2}::operator()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const::{lambda(pybind11::args const&, pybind11::kwargs const&)#1}, pybind11::object, pybind11::args const&, pybind11::kwargs const&, pybind11::name, pybind11::doc>(torch::jit::initJITBindings(_object*)::{lambda(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)#2}::operator()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const::{lambda(pybind11::args const&, pybind11::kwargs const&)#1}&&, pybind11::object (*)(pybind11::args const&, pybind11::kwargs const&), pybind11::name const&, pybind11::doc const&)::{lambda(pybind11::detail::function_call&)#1}::_FUN(pybind11::detail::function_call&)
File "<unknown>", line 0, in pybind11::cpp_function::dispatcher(_object*, _object*, _object*)
File "<unknown>", line 0, in _PyObject_Call
File "<unknown>", line 0, in _PyEval_EvalFrameDefault
File "<unknown>", line 0, in _PyObject_FastCallDictTstate
File "<unknown>", line 0, in _PyObject_Call_Prepend
File "<unknown>", line 0, in _PyObject_MakeTpCall
File "<unknown>", line 0, in _PyEval_EvalFrameDefault
File "<unknown>", line 0, in _PyEval_EvalFrameDefault
File "<unknown>", line 0, in _PyEval_EvalFrameDefault
File "<unknown>", line 0, in _PyEval_EvalFrameDefault
File "<unknown>", line 0, in PyEval_EvalCode
File "<unknown>", line 0, in PyRun_StringFlags
File "<unknown>", line 0, in PyRun_SimpleStringFlags
File "<unknown>", line 0, in Py_RunMain
File "<unknown>", line 0, in Py_BytesMain
File "<unknown>", line 0, in _start
File "<unknown>", line 0, in 0xffffffffffffffff
(EngineCore pid=11599) ERROR 06-07 12:04:06 [multiproc_executor.py:283] Worker proc VllmWorker-2 died unexpectedly, shutting down executor.
(Worker_TP0_EP0 pid=11739) INFO 06-07 12:04:06 [multiproc_executor.py:775] Parent process exited, terminating worker queues
(Worker_TP3_EP3 pid=11742) INFO 06-07 12:04:06 [multiproc_executor.py:872] WorkerProc shutting down.
(Worker_TP0_EP0 pid=11739) INFO 06-07 12:04:06 [multiproc_executor.py:872] WorkerProc shutting down.
(APIServer pid=11323) INFO 06-07 12:04:07 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 94.4 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, External prefix cache hit rate: 0.0%
(APIServer pid=11323) INFO 06-07 12:04:07 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.83, Accepted throughput: 61.00 tokens/s, Drafted throughput: 66.80 tokens/s, Accepted: 610 tokens, Drafted: 668 tokens, Per-position acceptance rate: 0.994, 0.832, Avg Draft acceptance rate: 91.3%
(APIServer pid=11323) INFO 06-07 12:04:07 [metrics.py:103] KV Transfer metrics: GPU_to_CPU_total_bytes=233522176, GPU_to_CPU_total_time=0.01344751998782158
(EngineCore pid=11599) ERROR 06-07 12:04:14 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.22.0) with config: model='/mnt/autovision-cbs/DeepSeek-V4-Flash', speculative_config=SpeculativeConfig(method='mtp', model='/mnt/autovision-cbs/DeepSeek-V4-Flash', num_spec_tokens=2), tokenizer='/mnt/autovision-cbs/DeepSeek-V4-Flash', skip_tokenizer_init=False, tokenizer_mode=deepseek_v4, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=1048576, download_dir=None, load_format=auto, tensor_parallel_size=4, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=deepseek_v4_fp8, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=fp8, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='deepseek_v4', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=True), seed=0, served_model_name=default, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['+quant_fp8', 'all', '+quant_fp8', '+quant_fp8', '+quant_fp8'], 'ir_enable_torch_wrap': False, 'splitting_ops': [], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': None, 'compile_sizes': [], 'compile_ranges_endpoints': [256, 256, 4096], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_DECODE_ONLY: (2, 0)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': True, 'fuse_rope_kvcache_cat_mla': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['vllm_c', 'native'], fused_add_rms_norm=['vllm_c', 'native']), enable_flashinfer_autotune=False, moe_backend='auto', linear_backend='auto'),
(EngineCore pid=11599) ERROR 06-07 12:04:14 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=CachedRequestData(req_ids=[],resumed_req_ids=set(),new_token_ids_lens=[],all_token_ids_lens={},new_block_ids=[],num_computed_tokens=[],num_output_tokens=[]), num_scheduled_tokens={}, total_num_scheduled_tokens=0, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 0, 0, 0, 0], finished_req_ids=[], free_encoder_mm_hashes=[], preempted_req_ids=[], has_structured_output_requests=false, pending_structured_output_tokens=false, num_invalid_spec_tokens=null, kv_connector_metadata=OffloadingConnectorMetadata(load_jobs={12684: TransferJob(req_id='chatcmpl-94a14e5f589d91c5-b7763ff8', transfer_spec=[CPULoadStoreSpec(block_ids=array([54722, 54723, 37112, ..., 46293, 46294, 46295], shape=(1663,))), GPULoadStoreSpec(block_ids=array([43696, 40476, 42880, ..., 47458, 1151, 1149], shape=(1663,)), group_sizes=[1641, 2, 2, 2, 16], block_indices=[0, 6562, 6562, 105022, 52496])])}, store_jobs={}, jobs_to_flush=[]), ec_connector_metadata=null, new_block_ids_to_zero=null)
(EngineCore pid=11599) ERROR 06-07 12:04:14 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=0, num_waiting_reqs=0, num_skipped_waiting_reqs=1, step_counter=0, current_wave=0, kv_cache_usage=0.03480098773699414, prefix_cache_stats=PrefixCacheStats(reset=False, requests=1, queries=420257, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=PrefixCacheStats(reset=False, requests=1, queries=420257, hits=420096, preempted_requests=0, preempted_queries=0, preempted_hits=0), kv_cache_eviction_events=[], spec_decoding_stats=None, kv_connector_stats=None, waiting_lora_adapters={}, running_lora_adapters={}, cudagraph_stats=None, perf_stats=None)
(EngineCore pid=11599) ERROR 06-07 12:04:14 [core.py:1167] EngineCore encountered a fatal error.
(EngineCore pid=11599) ERROR 06-07 12:04:14 [core.py:1167] Traceback (most recent call last):
(EngineCore pid=11599) ERROR 06-07 12:04:14 [core.py:1167] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1158, in run_engine_core
(EngineCore pid=11599) ERROR 06-07 12:04:14 [core.py:1167] engine_core.run_busy_loop()
(EngineCore pid=11599) ERROR 06-07 12:04:14 [core.py:1167] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1199, in run_busy_loop
(EngineCore pid=11599) ERROR 06-07 12:04:14 [core.py:1167] self._process_engine_step()
(EngineCore pid=11599) ERROR 06-07 12:04:14 [core.py:1167] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1238, in _process_engine_step
(EngineCore pid=11599) ERROR 06-07 12:04:14 [core.py:1167] outputs, model_executed = self.step_fn()
(EngineCore pid=11599) ERROR 06-07 12:04:14 [core.py:1167] ^^^^^^^^^^^^^^
(EngineCore pid=11599) ERROR 06-07 12:04:14 [core.py:1167] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 547, in step_with_batch_queue
(EngineCore pid=11599) ERROR 06-07 12:04:14 [core.py:1167] model_output = future.result()
(EngineCore pid=11599) ERROR 06-07 12:04:14 [core.py:1167] ^^^^^^^^^^^^^^^
(EngineCore pid=11599) ERROR 06-07 12:04:14 [core.py:1167] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 90, in result
(EngineCore pid=11599) ERROR 06-07 12:04:14 [core.py:1167] return super().result()
(EngineCore pid=11599) ERROR 06-07 12:04:14 [core.py:1167] ^^^^^^^^^^^^^^^^
(EngineCore pid=11599) ERROR 06-07 12:04:14 [core.py:1167] File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(EngineCore pid=11599) ERROR 06-07 12:04:14 [core.py:1167] return self.__get_result()
(EngineCore pid=11599) ERROR 06-07 12:04:14 [core.py:1167] ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=11599) ERROR 06-07 12:04:14 [core.py:1167] File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore pid=11599) ERROR 06-07 12:04:14 [core.py:1167] raise self._exception
(EngineCore pid=11599) ERROR 06-07 12:04:14 [core.py:1167] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 94, in _wait_for_response
(EngineCore pid=11599) ERROR 06-07 12:04:14 [core.py:1167] response = self.aggregate(self.get_response())
(EngineCore pid=11599) ERROR 06-07 12:04:14 [core.py:1167] ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=11599) ERROR 06-07 12:04:14 [core.py:1167] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 386, in get_response
(EngineCore pid=11599) ERROR 06-07 12:04:14 [core.py:1167] status, result = mq.dequeue(timeout=dequeue_timeout)
(EngineCore pid=11599) ERROR 06-07 12:04:14 [core.py:1167] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=11599) ERROR 06-07 12:04:14 [core.py:1167] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/shm_broadcast.py", line 772, in dequeue
(EngineCore pid=11599) ERROR 06-07 12:04:14 [core.py:1167] with self.acquire_read(timeout, indefinite) as buf:
(EngineCore pid=11599) ERROR 06-07 12:04:14 [core.py:1167] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=11599) ERROR 06-07 12:04:14 [core.py:1167] File "/usr/lib/python3.12/contextlib.py", line 137, in __enter__
(EngineCore pid=11599) ERROR 06-07 12:04:14 [core.py:1167] return next(self.gen)
(EngineCore pid=11599) ERROR 06-07 12:04:14 [core.py:1167] ^^^^^^^^^^^^^^
(EngineCore pid=11599) ERROR 06-07 12:04:14 [core.py:1167] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/shm_broadcast.py", line 694, in acquire_read
(EngineCore pid=11599) ERROR 06-07 12:04:14 [core.py:1167] raise RuntimeError("cancelled")
(EngineCore pid=11599) ERROR 06-07 12:04:14 [core.py:1167] RuntimeError: cancelled
(APIServer pid=11323) ERROR 06-07 12:04:14 [async_llm.py:704] AsyncLLM output_handler failed.
(APIServer pid=11323) ERROR 06-07 12:04:14 [async_llm.py:704] Traceback (most recent call last):
(APIServer pid=11323) ERROR 06-07 12:04:14 [async_llm.py:704] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 660, in output_handler
(APIServer pid=11323) ERROR 06-07 12:04:14 [async_llm.py:704] outputs = await engine_core.get_output_async()
(APIServer pid=11323) ERROR 06-07 12:04:14 [async_llm.py:704] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=11323) ERROR 06-07 12:04:14 [async_llm.py:704] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 1030, in get_output_async
(APIServer pid=11323) ERROR 06-07 12:04:14 [async_llm.py:704] raise self._format_exception(outputs) from None
(APIServer pid=11323) ERROR 06-07 12:04:14 [async_llm.py:704] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(APIServer pid=11323) INFO 06-07 12:04:14 [async_llm.py:601] Request chatcmpl-94a14e5f589d91c5 failed (engine dead).
(APIServer pid=11323) INFO: 10.16.212.240:42086 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=11323) INFO: Shutting down
(APIServer pid=11323) INFO: Waiting for application shutdown.
(APIServer pid=11323) INFO: Application shutdown complete.
(APIServer pid=11323) INFO: Finished server process [11323]
(APIServer pid=11323) Exception ignored in: <function AsyncLLM.__del__ at 0x7f32495c7560>
(APIServer pid=11323) Traceback (most recent call last):
(APIServer pid=11323) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 257, in __del__
(APIServer pid=11323) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 261, in shutdown
(APIServer pid=11323) TypeError: 'NoneType' object is not callable
/usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 3 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 3 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
Your current environment
The output of
python collect_env.py🐛 Describe the bug
comand:VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0 vllm serve /mnt/autovision-cbs/DeepSeek-V4-Flash --port 30000 --served-model-name default --trust-remote-code --kv-cache-dtype fp8 --block-size 256 --tensor-parallel-size 4 --max-num-seqs 512 --max-num-batched-tokens 4096 --no-enable-flashinfer-autotune --compilation-config '{"mode": 0, "cudagraph_mode": "FULL_DECODE_ONLY"}' --gpu-memory-utilization 0.95 --tokenizer-mode deepseek_v4 --tool-call-parser deepseek_v4 --enable-auto-tool-choice --reasoning-parser deepseek_v4 --speculative_config '{"method":"mtp","num_speculative_tokens":2}' --enable-log-requests --enable-logging-iteration-details --enable-expert-parallel --enable-prefix-caching --kv-offloading-backend native --kv-offloading-size 300
evaluation command:
evalscope eval --eval-type openai_api --model default --api-url http://10.74.45.28:30000/v1 --eval-batch-size 1 --datasets longbench_v2 --dataset-args '{"longbench_v2":{"subset_list":["long"]}}' --generation-config '{"temperature":1.0,"top_p":1.0,"extra_body": {"chat_template_kwargs":{"thinking": true}}}'
Before submitting a new issue...