feat: experimental vllm provider #5443

kevinzwang · 2025-10-24T02:11:09Z

Changes Made

Adds the experimental VLLMPrefixCachedProvider for daft.functions.ai.prompt. Does async batching and prefix routing.

When using the VLLMPrefixCachedProvider, prompt will create a VLLMExpr instead of a UDF, which Daft will turn into a custom VLLM operator. This operator is implemented as a streaming sink, and I had to make some minor changes to our streaming sink APIs to make the async batching mechanism work.

Related Issues

Checklist

Documented in API Docs (if applicable)
Documented in User Guide (if applicable)
If adding a new documentation page, doc is added to docs/mkdocs.yml navigation
Documentation builds and is formatted properly

codecov · 2025-10-24T02:56:41Z

Codecov Report

❌ Patch coverage is 15.19757% with 279 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.30%. Comparing base (47f5aaf) to head (9f3eb52).

Files with missing lines	Patch %	Lines
...rc/daft-local-execution/src/streaming_sink/vllm.rs	0.00%	89 Missing ⚠️
src/daft-logical-plan/src/ops/vllm.rs	0.00%	36 Missing ⚠️
src/daft-dsl/src/expr/mod.rs	3.12%	31 Missing ⚠️
src/daft-local-plan/src/plan.rs	3.70%	26 Missing ⚠️
src/daft-dsl/src/python.rs	0.00%	18 Missing ⚠️
src/daft-logical-plan/src/logical_plan.rs	20.00%	16 Missing ⚠️
src/daft-local-execution/src/pipeline.rs	0.00%	13 Missing ⚠️
daft/execution/vllm.py	0.00%	10 Missing ⚠️
...-logical-plan/src/optimization/rules/split_vllm.rs	67.74%	10 Missing ⚠️
src/daft-local-plan/src/translate.rs	0.00%	9 Missing ⚠️
... and 10 more

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #5443      +/-   ##
==========================================
+ Coverage   70.91%   71.30%   +0.38%     
==========================================
  Files         996     1000       +4     
  Lines      127688   126547    -1141     
==========================================
- Hits        90556    90234     -322     
+ Misses      37132    36313     -819

Files with missing lines	Coverage Δ
src/common/metrics/src/ops.rs	`0.00% <ø> (ø)`
src/daft-dsl/src/expr/bound_expr.rs	`78.57% <ø> (ø)`
src/daft-dsl/src/optimization.rs	`100.00% <100.00%> (ø)`
...on/src/streaming_sink/anti_semi_hash_join_probe.rs	`87.56% <100.00%> (+0.12%)`	⬆️
...rc/daft-local-execution/src/streaming_sink/base.rs	`79.41% <ø> (ø)`
.../daft-local-execution/src/streaming_sink/concat.rs	`90.00% <100.00%> (+0.71%)`	⬆️
...c/daft-local-execution/src/streaming_sink/limit.rs	`93.33% <100.00%> (ø)`
.../src/streaming_sink/monotonically_increasing_id.rs	`94.59% <100.00%> (ø)`
...cution/src/streaming_sink/outer_hash_join_probe.rs	`94.42% <100.00%> (+0.03%)`	⬆️
...rc/daft-logical-plan/src/optimization/optimizer.rs	`94.05% <100.00%> (+0.01%)`	⬆️
... and 25 more

... and 19 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

greptile-apps

Greptile Overview

Greptile Summary

This PR adds an experimental vllm_prompt() function that enables optimized LLM inference using vLLM's prefix caching capabilities through a new execution path.

Key Changes

New execution path: Adds PyExpr.vllm() that bypasses the standard UDF execution, creating a dedicated VLLMProject logical plan node and VLLMSink streaming sink for optimized batch processing
Python layer: New VLLMExecutor class manages an AsyncLLMEngine in a dedicated thread with async task submission and polling
Rust infrastructure: Comprehensive integration including new expression types (VLLMExpr), optimization rules (SplitVLLM), and streaming sink refactoring to support iterative finalization
API surface: Adds prompt() function with vLLM provider support featuring configurable concurrency, buffer management, and batch sizing

Critical Issues Found

Data loss bug in VLLMSink.finalize() (src/daft-local-execution/src/streaming_sink/vllm.rs:297): Only processes the first worker state when max_concurrency > 1, silently dropping buffered data and running tasks from other workers
Race condition in VLLMExecutor.num_running_tasks() (daft/execution/vllm.py:106): Reads running_task_count without lock protection despite concurrent modifications

Confidence Score: 1/5

This PR has critical bugs that will cause data loss and race conditions in production
Score reflects two critical logic errors: the VLLMSink.finalize() method only processes the first worker state when concurrency > 1 (causing silent data loss), and VLLMExecutor.num_running_tasks() has an unprotected read of shared state (race condition). Both issues will cause incorrect behavior in production workloads.
Pay close attention to src/daft-local-execution/src/streaming_sink/vllm.rs (finalize method must handle all states) and daft/execution/vllm.py (thread safety fix needed)

Important Files Changed

File Analysis

Filename	Score	Overview
src/daft-local-execution/src/streaming_sink/vllm.rs	1/5	New VLLMSink implementation with critical bug in finalize() that loses data from concurrent workers
daft/execution/vllm.py	2/5	New VLLMExecutor with async task management; has race condition in num_running_tasks()
daft/functions/ai/init.py	4/5	Added prompt() vLLM support with PyExpr.vllm() optimization path; has inline import style issues
src/daft-logical-plan/src/ops/vllm.rs	5/5	New VLLMProject logical plan node with schema generation and display methods
src/daft-logical-plan/src/optimization/rules/split_vllm.rs	5/5	New optimization rule to extract VLLM expressions into separate nodes
src/daft-local-execution/src/streaming_sink/base.rs	4/5	Refactored finalize to support iterative output via StreamingSinkFinalizeOutput enum

Sequence Diagram

sequenceDiagram
    participant User
    participant PromptFunc as prompt()
    participant PyExpr as PyExpr.vllm()
    participant LogicalPlan as VLLMProject
    participant Optimizer as SplitVLLM Rule
    participant LocalExec as VLLMSink
    participant VLLMExec as VLLMExecutor
    participant vLLM as AsyncLLMEngine

    User->>PromptFunc: prompt(messages, provider="vllm-prefix-cached")
    PromptFunc->>PromptFunc: Resolve VLLMPrefixCachedPrompterDescriptor
    PromptFunc->>PyExpr: vllm(model, concurrency, args...)
    PyExpr->>LogicalPlan: Create VLLMProject node
    LogicalPlan->>Optimizer: Optimization pass
    Optimizer->>Optimizer: SplitVLLM extracts VLLM expr to separate node
    Optimizer->>LocalExec: Translate to VLLMSink
    
    LocalExec->>VLLMExec: make_state() creates VLLMExecutor
    VLLMExec->>vLLM: Initialize AsyncLLMEngine in new thread
    
    loop For each input batch
        LocalExec->>LocalExec: Buffer input (max_buffer_size)
        LocalExec->>VLLMExec: submit(prompts, rows)
        VLLMExec->>vLLM: asyncio.run_coroutine_threadsafe(_generate)
        LocalExec->>VLLMExec: poll() for completed tasks
        VLLMExec-->>LocalExec: Return completed (outputs, rows)
        LocalExec-->>User: Stream results
    end
    
    LocalExec->>LocalExec: finalize(states) - drain remaining
    LocalExec->>VLLMExec: poll() until all tasks complete
    VLLMExec-->>LocalExec: Final results
    LocalExec-->>User: Final output batch

Additional Comments (2)

daft/functions/ai/__init__.py, line 263 (link)

style: move import to top of file per custom style guide

Context Used: Rule from dashboard - Import statements should be placed at the top of the file rather than inline within functions or met... (source)
daft/functions/ai/__init__.py, line 275 (link)

style: move import to top of file per custom style guide

Context Used: Rule from dashboard - Import statements should be placed at the top of the file rather than inline within functions or met... (source)

_{46 files reviewed, 4 comments}

_{Edit Code Review Agent Settings | Greptile}

src/daft-local-execution/src/streaming_sink/vllm.rs

daft/execution/vllm.py

kevinzwang · 2025-10-27T08:32:45Z

@greptileai

greptile-apps

Greptile Overview

Greptile Summary

This PR introduces experimental VLLM provider support for daft.functions.ai.prompt() with async batching, currently only for the local executor. The implementation bypasses the standard UDF execution path by routing VLLM expressions through a custom streaming sink operator.

Major changes:

New expression type: Expr::VLLM added to DSL with VLLMExpr struct containing model config, concurrency settings, and buffer parameters
Streaming sink API enhancements: Modified StreamingSink trait to support iterative finalization via StreamingSinkFinalizeOutput enum (allowing sinks to return HasMoreOutput for async processing) and made make_state() fallible
New VLLMProject plan node: Added throughout logical plan, local plan, and pipeline layers with proper optimizer rule integration
Python executor: VLLMExecutor manages dedicated event loop thread for async vLLM engine with proper locking
Execution restrictions: Correctly blocked for Ray/distributed execution with todo!() and NotImplemented errors

API Impact:
All existing streaming sinks updated to new trait signature (make_state() returns DaftResult, finalize() returns StreamingSinkFinalizeOutput). Changes are mechanical and maintain existing behavior.

Known limitations (per PR description):

Only one VLLM expression per projection supported
Local executor only
Prefix bucketing/routing not yet implemented

Confidence Score: 4/5

Safe to merge with minor caveats - streaming sink API changes are well-structured and VLLM implementation is properly isolated
The streaming sink API changes are cleanly implemented and all existing sinks have been updated correctly. The VLLM implementation is experimental and properly restricted to local execution. One point deducted because the SplitVLLM optimization rule's hardcoded column name could cause conflicts, and the interaction with SplitUDFs rule needs verification
src/daft-logical-plan/src/optimization/rules/split_vllm.rs - verify interaction with SplitUDFs rule and potential column name conflicts with "daft_vllm_output"

Important Files Changed

File Analysis

Filename	Score	Overview
src/daft-local-execution/src/streaming_sink/base.rs	5/5	Enhanced streaming sink API to support iterative finalization with `StreamingSinkFinalizeOutput` enum and made `make_state()` fallible
src/daft-local-execution/src/streaming_sink/vllm.rs	4/5	New VLLM streaming sink implementation with async batching and buffer management, correctly set `max_concurrency()` to 1
daft/execution/vllm.py	5/5	Python VLLMExecutor with dedicated event loop thread, proper locking on shared state, and async batch submission
src/daft-dsl/src/expr/mod.rs	5/5	Added new `VLLM` expression variant with proper semantic ID, display, and visitor integration
src/daft-logical-plan/src/ops/vllm.rs	5/5	New VLLMProject logical plan node with proper schema handling and stats state management
src/daft-logical-plan/src/optimization/rules/split_vllm.rs	4/5	Optimizer rule to extract VLLM expressions from projections into dedicated VLLMProject nodes, currently supports one VLLM expr per project
src/daft-logical-plan/src/logical_plan.rs	5/5	Integrated VLLMProject into logical plan enum with proper schema, stats, and child handling
daft/functions/ai/init.py	5/5	Updated `prompt()` to detect VLLMPrefixCachedPrompterDescriptor and route to PyExpr.vllm() instead of UDF execution

Sequence Diagram

sequenceDiagram
    participant User
    participant prompt() as daft.functions.ai.prompt()
    participant PyExpr as PyExpr.vllm()
    participant Optimizer as Logical Plan Optimizer
    participant Pipeline as Pipeline Builder
    participant VLLMSink as VLLMSink (Rust)
    participant VLLMExecutor as VLLMExecutor (Python)
    participant AsyncEngine as vLLM AsyncLLMEngine

    User->>prompt(): prompt(messages, provider="vllm-prefix-cached")
    prompt()->>prompt(): Detect VLLMPrefixCachedPrompterDescriptor
    prompt()->>PyExpr: .vllm(model, concurrency, buffer_size, ...)
    PyExpr->>Optimizer: Create Expr::VLLM in logical plan
    
    Optimizer->>Optimizer: SplitVLLM rule extracts VLLM expr
    Optimizer->>Optimizer: Create VLLMProject logical plan node
    
    Pipeline->>VLLMSink: Translate to StreamingSink
    
    loop For each input batch
        VLLMSink->>VLLMSink: Buffer input until max_buffer_size
        VLLMSink->>VLLMExecutor: submit(prompts, rows)
        VLLMExecutor->>AsyncEngine: asyncio.run_coroutine_threadsafe(_generate())
        AsyncEngine-->>VLLMExecutor: Stream completions to completed_tasks queue
        VLLMSink->>VLLMExecutor: poll() for completed tasks
        VLLMExecutor-->>VLLMSink: Return (outputs, rows) or None
        VLLMSink-->>Pipeline: Return output with NeedMoreInput/HasMoreOutput
    end
    
    Pipeline->>VLLMSink: finalize(states)
    loop Until all tasks complete
        VLLMSink->>VLLMSink: Submit remaining buffered tasks
        VLLMSink->>VLLMExecutor: poll() for results
        alt Tasks still running
            VLLMSink-->>Pipeline: HasMoreOutput with partial results
        else All complete
            VLLMSink-->>Pipeline: Finished with final results
        end
    end

_{46 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

src/daft-logical-plan/src/optimization/rules/split_vllm.rs

colin-ho

Thoughts on just leveraging async udfs instead of making dedicated logical + physical ops for vllm?

This PR here #5451 makes a streaming sink for async udfs that i think can also work with vllm. The idea is the same as what you have here, but it uses a joinset as the async task pool. The only thing missing is that you have max_buffer_size and max_running_tasks params here, but you can also just control that with the joinset, i.e. if limit is reached force a join_next.await

kevinzwang · 2025-10-28T18:14:51Z

Thoughts on just leveraging async udfs instead of making dedicated logical + physical ops for vllm?

It's true that what's implemented here can probably be done with async UDFs, but this is just the first step in our work on prefix routing for vllm. We will need to add additional logic to the swordfish side to allow for bucketing the buffer by prefix to emit out, which is why I have this buffer. There's also additional work on the Flotilla side for routing that will require a distributed operator

srilman

Overall LGTM, just a couple of nits.

src/daft-logical-plan/src/optimization/rules/split_vllm.rs

src/daft-local-execution/src/streaming_sink/vllm.rs

src/daft-local-execution/src/streaming_sink/base.rs

## Changes Made  ## Related Issues  ## Checklist - [ ] Documented in API Docs (if applicable) - [ ] Documented in User Guide (if applicable) - [ ] If adding a new documentation page, doc is added to `docs/mkdocs.yml` navigation - [ ] Documentation builds and is formatted properly

kevinzwang · 2025-11-04T04:31:39Z

@greptileai

greptile-apps

Greptile Overview

Greptile Summary

This PR adds experimental support for vLLM-based LLM inference with prefix caching and async batching optimization. When using VLLMPrefixCachedProvider, the prompt function creates a VLLMExpr instead of a UDF, which gets optimized into a custom streaming sink operator.

Key Changes:

New VLLMSink streaming sink with prefix bucketing and async batching
SplitVLLM optimizer rule extracts VLLM expressions into dedicated VLLMProject nodes
LocalVLLMExecutor and RemoteVLLMExecutor handle local and distributed (Ray) execution
PrefixRouter load balances requests across multiple Ray actors based on prefix similarity
Integration in daft.functions.ai.prompt() detects vLLM provider and bypasses standard UDF path

Architecture:
The implementation uses a streaming sink pattern where prompts are buffered, sorted by prefix similarity, bucketed together, and submitted to vLLM's async engine. Results are polled and returned incrementally. For distributed mode, multiple Ray actors are spawned with a prefix-aware router.

Confidence Score: 4/5

Safe to merge as experimental feature with minor style improvements needed
The implementation is well-structured with proper async handling and the critical max_concurrency bug from previous comments is fixed. One inline import violates project style guidelines. The experimental nature is clearly documented.
daft/execution/vllm.py has inline import that should be moved to top of file

Important Files Changed

File Analysis

Filename	Score	Overview
daft/execution/vllm.py	4/5	Implements VLLMExecutor classes for local, blocking, and distributed execution with async batching and prefix routing
src/daft-local-execution/src/streaming_sink/vllm.rs	5/5	VLLMSink implementation with prefix bucketing logic, correctly sets max_concurrency to 1
daft/ai/vllm/provider.py	5/5	Simple provider class for vLLM prefix caching
daft/ai/vllm/protocols/prompter.py	5/5	PrompterDescriptor configuration for vLLM with prefix caching parameters
src/daft-logical-plan/src/ops/vllm.rs	5/5	VLLMProject logical plan node definition
src/daft-logical-plan/src/optimization/rules/split_vllm.rs	5/5	Optimizer rule to extract VLLM expressions from projections into dedicated VLLMProject nodes
src/daft-distributed/src/pipeline_node/vllm.rs	5/5	Distributed execution node for VLLM with Ray actors initialization
daft/functions/ai/init.py	5/5	Updated prompt function to detect and use vLLM provider via PyExpr.vllm() instead of UDF path

Sequence Diagram

sequenceDiagram
    participant User as User Code
    participant Prompt as daft.functions.ai.prompt()
    participant Optimizer as SplitVLLM Rule
    participant Sink as VLLMSink
    participant Executor as LocalVLLMExecutor
    participant VLLM as vLLM AsyncEngine
    
    User->>Prompt: prompt(col("text"), provider=vllm)
    Prompt->>Prompt: Detect VLLMPrefixCachingPrompterDescriptor
    Prompt->>Prompt: Create VLLMExpr (not UDF)
    Prompt-->>User: Return Expression
    
    User->>User: Execute dataframe operation
    
    Note over Optimizer: Logical Plan Optimization
    Optimizer->>Optimizer: Extract VLLMExpr from Project
    Optimizer->>Optimizer: Create VLLMProject node
    
    Note over Sink: Physical Execution
    Sink->>Sink: Buffer incoming data
    Sink->>Sink: Sort by prefix similarity
    Sink->>Sink: Bucket prompts by prefix
    Sink->>Executor: submit(prefix, prompts, rows)
    Executor->>VLLM: Generate async (streaming)
    VLLM-->>Executor: Yield outputs
    Executor->>Executor: Store completed results
    Sink->>Executor: poll()
    Executor-->>Sink: Return (outputs, rows)
    Sink-->>User: Yield results incrementally

_{58 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

daft/execution/vllm.py

feat: experimental vllm_prompt function

9f3eb52

github-actions bot added the feat label Oct 24, 2025

async batch

2790c6d

kevinzwang requested review from colin-ho and srilman October 27, 2025 08:20

kevinzwang marked this pull request as ready for review October 27, 2025 08:22

kevinzwang changed the title ~~feat: experimental vllm_prompt function~~ feat: experimental vllm provider Oct 27, 2025

Merge branch 'main' into kevin/vllm-prompt

ebb1a73

greptile-apps bot reviewed Oct 27, 2025

View reviewed changes

src/daft-local-execution/src/streaming_sink/vllm.rs Outdated Show resolved Hide resolved

daft/execution/vllm.py Outdated Show resolved Hide resolved

greptile

a688645

greptile-apps bot reviewed Oct 27, 2025

View reviewed changes

src/daft-logical-plan/src/optimization/rules/split_vllm.rs Outdated Show resolved Hide resolved

kevinzwang added 4 commits October 27, 2025 01:55

greptile 2

d86ac61

skip test

fb7d1fc

small refactor

340fc5e

update

fb92e11

colin-ho reviewed Oct 28, 2025

View reviewed changes

srilman reviewed Oct 30, 2025

View reviewed changes

kevinzwang added 7 commits November 3, 2025 19:52

Merge branch 'main' into kevin/vllm-prompt

596bcef

review comments

418b188

naming and small fixes

fd0ccfd

rename

66095ba

benchmarks

d2e87e6

benchmarks

86fa23f

kevinzwang enabled auto-merge (squash) November 4, 2025 04:30

greptile-apps bot reviewed Nov 4, 2025

View reviewed changes

daft/execution/vllm.py Show resolved Hide resolved

kevinzwang merged commit 4829f43 into main Nov 4, 2025
39 checks passed

kevinzwang deleted the kevin/vllm-prompt branch November 4, 2025 05:06

feat: experimental vllm provider #5443

feat: experimental vllm provider #5443

Conversation

kevinzwang commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes Made

Related Issues

Checklist

Uh oh!

codecov bot commented Oct 24, 2025

Codecov Report

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Key Changes

Critical Issues Found

Confidence Score: 1/5

Important Files Changed

Sequence Diagram

Additional Comments (2)

Uh oh!

Uh oh!

Uh oh!

kevinzwang commented Oct 27, 2025

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

colin-ho left a comment

Choose a reason for hiding this comment

Uh oh!

kevinzwang commented Oct 28, 2025

Uh oh!

srilman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kevinzwang commented Nov 4, 2025

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kevinzwang commented Oct 24, 2025 •

edited

Loading

greptile-apps bot left a comment •

edited

Loading