Skip to content

Conversation

@kevinzwang
Copy link
Member

@kevinzwang kevinzwang commented Oct 24, 2025

Changes Made

Adds the experimental VLLMPrefixCachedProvider for daft.functions.ai.prompt. Does async batching and prefix routing.

When using the VLLMPrefixCachedProvider, prompt will create a VLLMExpr instead of a UDF, which Daft will turn into a custom VLLM operator. This operator is implemented as a streaming sink, and I had to make some minor changes to our streaming sink APIs to make the async batching mechanism work.

Related Issues

Checklist

  • Documented in API Docs (if applicable)
  • Documented in User Guide (if applicable)
  • If adding a new documentation page, doc is added to docs/mkdocs.yml navigation
  • Documentation builds and is formatted properly

@github-actions github-actions bot added the feat label Oct 24, 2025
@codecov
Copy link

codecov bot commented Oct 24, 2025

Codecov Report

❌ Patch coverage is 15.19757% with 279 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.30%. Comparing base (47f5aaf) to head (9f3eb52).

Files with missing lines Patch % Lines
...rc/daft-local-execution/src/streaming_sink/vllm.rs 0.00% 89 Missing ⚠️
src/daft-logical-plan/src/ops/vllm.rs 0.00% 36 Missing ⚠️
src/daft-dsl/src/expr/mod.rs 3.12% 31 Missing ⚠️
src/daft-local-plan/src/plan.rs 3.70% 26 Missing ⚠️
src/daft-dsl/src/python.rs 0.00% 18 Missing ⚠️
src/daft-logical-plan/src/logical_plan.rs 20.00% 16 Missing ⚠️
src/daft-local-execution/src/pipeline.rs 0.00% 13 Missing ⚠️
daft/execution/vllm.py 0.00% 10 Missing ⚠️
...-logical-plan/src/optimization/rules/split_vllm.rs 67.74% 10 Missing ⚠️
src/daft-local-plan/src/translate.rs 0.00% 9 Missing ⚠️
... and 10 more
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #5443      +/-   ##
==========================================
+ Coverage   70.91%   71.30%   +0.38%     
==========================================
  Files         996     1000       +4     
  Lines      127688   126547    -1141     
==========================================
- Hits        90556    90234     -322     
+ Misses      37132    36313     -819     
Files with missing lines Coverage Δ
src/common/metrics/src/ops.rs 0.00% <ø> (ø)
src/daft-dsl/src/expr/bound_expr.rs 78.57% <ø> (ø)
src/daft-dsl/src/optimization.rs 100.00% <100.00%> (ø)
...on/src/streaming_sink/anti_semi_hash_join_probe.rs 87.56% <100.00%> (+0.12%) ⬆️
...rc/daft-local-execution/src/streaming_sink/base.rs 79.41% <ø> (ø)
.../daft-local-execution/src/streaming_sink/concat.rs 90.00% <100.00%> (+0.71%) ⬆️
...c/daft-local-execution/src/streaming_sink/limit.rs 93.33% <100.00%> (ø)
.../src/streaming_sink/monotonically_increasing_id.rs 94.59% <100.00%> (ø)
...cution/src/streaming_sink/outer_hash_join_probe.rs 94.42% <100.00%> (+0.03%) ⬆️
...rc/daft-logical-plan/src/optimization/optimizer.rs 94.05% <100.00%> (+0.01%) ⬆️
... and 25 more

... and 19 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@kevinzwang kevinzwang marked this pull request as ready for review October 27, 2025 08:22
@kevinzwang kevinzwang changed the title feat: experimental vllm_prompt function feat: experimental vllm provider Oct 27, 2025
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR adds an experimental vllm_prompt() function that enables optimized LLM inference using vLLM's prefix caching capabilities through a new execution path.

Key Changes

  • New execution path: Adds PyExpr.vllm() that bypasses the standard UDF execution, creating a dedicated VLLMProject logical plan node and VLLMSink streaming sink for optimized batch processing
  • Python layer: New VLLMExecutor class manages an AsyncLLMEngine in a dedicated thread with async task submission and polling
  • Rust infrastructure: Comprehensive integration including new expression types (VLLMExpr), optimization rules (SplitVLLM), and streaming sink refactoring to support iterative finalization
  • API surface: Adds prompt() function with vLLM provider support featuring configurable concurrency, buffer management, and batch sizing

Critical Issues Found

  1. Data loss bug in VLLMSink.finalize() (src/daft-local-execution/src/streaming_sink/vllm.rs:297): Only processes the first worker state when max_concurrency > 1, silently dropping buffered data and running tasks from other workers
  2. Race condition in VLLMExecutor.num_running_tasks() (daft/execution/vllm.py:106): Reads running_task_count without lock protection despite concurrent modifications

Confidence Score: 1/5

  • This PR has critical bugs that will cause data loss and race conditions in production
  • Score reflects two critical logic errors: the VLLMSink.finalize() method only processes the first worker state when concurrency > 1 (causing silent data loss), and VLLMExecutor.num_running_tasks() has an unprotected read of shared state (race condition). Both issues will cause incorrect behavior in production workloads.
  • Pay close attention to src/daft-local-execution/src/streaming_sink/vllm.rs (finalize method must handle all states) and daft/execution/vllm.py (thread safety fix needed)

Important Files Changed

File Analysis

Filename Score Overview
src/daft-local-execution/src/streaming_sink/vllm.rs 1/5 New VLLMSink implementation with critical bug in finalize() that loses data from concurrent workers
daft/execution/vllm.py 2/5 New VLLMExecutor with async task management; has race condition in num_running_tasks()
daft/functions/ai/init.py 4/5 Added prompt() vLLM support with PyExpr.vllm() optimization path; has inline import style issues
src/daft-logical-plan/src/ops/vllm.rs 5/5 New VLLMProject logical plan node with schema generation and display methods
src/daft-logical-plan/src/optimization/rules/split_vllm.rs 5/5 New optimization rule to extract VLLM expressions into separate nodes
src/daft-local-execution/src/streaming_sink/base.rs 4/5 Refactored finalize to support iterative output via StreamingSinkFinalizeOutput enum

Sequence Diagram

sequenceDiagram
    participant User
    participant PromptFunc as prompt()
    participant PyExpr as PyExpr.vllm()
    participant LogicalPlan as VLLMProject
    participant Optimizer as SplitVLLM Rule
    participant LocalExec as VLLMSink
    participant VLLMExec as VLLMExecutor
    participant vLLM as AsyncLLMEngine

    User->>PromptFunc: prompt(messages, provider="vllm-prefix-cached")
    PromptFunc->>PromptFunc: Resolve VLLMPrefixCachedPrompterDescriptor
    PromptFunc->>PyExpr: vllm(model, concurrency, args...)
    PyExpr->>LogicalPlan: Create VLLMProject node
    LogicalPlan->>Optimizer: Optimization pass
    Optimizer->>Optimizer: SplitVLLM extracts VLLM expr to separate node
    Optimizer->>LocalExec: Translate to VLLMSink
    
    LocalExec->>VLLMExec: make_state() creates VLLMExecutor
    VLLMExec->>vLLM: Initialize AsyncLLMEngine in new thread
    
    loop For each input batch
        LocalExec->>LocalExec: Buffer input (max_buffer_size)
        LocalExec->>VLLMExec: submit(prompts, rows)
        VLLMExec->>vLLM: asyncio.run_coroutine_threadsafe(_generate)
        LocalExec->>VLLMExec: poll() for completed tasks
        VLLMExec-->>LocalExec: Return completed (outputs, rows)
        LocalExec-->>User: Stream results
    end
    
    LocalExec->>LocalExec: finalize(states) - drain remaining
    LocalExec->>VLLMExec: poll() until all tasks complete
    VLLMExec-->>LocalExec: Final results
    LocalExec-->>User: Final output batch
Loading

Additional Comments (2)

  1. daft/functions/ai/__init__.py, line 263 (link)

    style: move import to top of file per custom style guide

    Context Used: Rule from dashboard - Import statements should be placed at the top of the file rather than inline within functions or met... (source)

  2. daft/functions/ai/__init__.py, line 275 (link)

    style: move import to top of file per custom style guide

    Context Used: Rule from dashboard - Import statements should be placed at the top of the file rather than inline within functions or met... (source)

46 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

@kevinzwang
Copy link
Member Author

@greptileai

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR introduces experimental VLLM provider support for daft.functions.ai.prompt() with async batching, currently only for the local executor. The implementation bypasses the standard UDF execution path by routing VLLM expressions through a custom streaming sink operator.

Major changes:

  • New expression type: Expr::VLLM added to DSL with VLLMExpr struct containing model config, concurrency settings, and buffer parameters
  • Streaming sink API enhancements: Modified StreamingSink trait to support iterative finalization via StreamingSinkFinalizeOutput enum (allowing sinks to return HasMoreOutput for async processing) and made make_state() fallible
  • New VLLMProject plan node: Added throughout logical plan, local plan, and pipeline layers with proper optimizer rule integration
  • Python executor: VLLMExecutor manages dedicated event loop thread for async vLLM engine with proper locking
  • Execution restrictions: Correctly blocked for Ray/distributed execution with todo!() and NotImplemented errors

API Impact:
All existing streaming sinks updated to new trait signature (make_state() returns DaftResult, finalize() returns StreamingSinkFinalizeOutput). Changes are mechanical and maintain existing behavior.

Known limitations (per PR description):

  • Only one VLLM expression per projection supported
  • Local executor only
  • Prefix bucketing/routing not yet implemented

Confidence Score: 4/5

  • Safe to merge with minor caveats - streaming sink API changes are well-structured and VLLM implementation is properly isolated
  • The streaming sink API changes are cleanly implemented and all existing sinks have been updated correctly. The VLLM implementation is experimental and properly restricted to local execution. One point deducted because the SplitVLLM optimization rule's hardcoded column name could cause conflicts, and the interaction with SplitUDFs rule needs verification
  • src/daft-logical-plan/src/optimization/rules/split_vllm.rs - verify interaction with SplitUDFs rule and potential column name conflicts with "daft_vllm_output"

Important Files Changed

File Analysis

Filename Score Overview
src/daft-local-execution/src/streaming_sink/base.rs 5/5 Enhanced streaming sink API to support iterative finalization with StreamingSinkFinalizeOutput enum and made make_state() fallible
src/daft-local-execution/src/streaming_sink/vllm.rs 4/5 New VLLM streaming sink implementation with async batching and buffer management, correctly set max_concurrency() to 1
daft/execution/vllm.py 5/5 Python VLLMExecutor with dedicated event loop thread, proper locking on shared state, and async batch submission
src/daft-dsl/src/expr/mod.rs 5/5 Added new VLLM expression variant with proper semantic ID, display, and visitor integration
src/daft-logical-plan/src/ops/vllm.rs 5/5 New VLLMProject logical plan node with proper schema handling and stats state management
src/daft-logical-plan/src/optimization/rules/split_vllm.rs 4/5 Optimizer rule to extract VLLM expressions from projections into dedicated VLLMProject nodes, currently supports one VLLM expr per project
src/daft-logical-plan/src/logical_plan.rs 5/5 Integrated VLLMProject into logical plan enum with proper schema, stats, and child handling
daft/functions/ai/init.py 5/5 Updated prompt() to detect VLLMPrefixCachedPrompterDescriptor and route to PyExpr.vllm() instead of UDF execution

Sequence Diagram

sequenceDiagram
    participant User
    participant prompt() as daft.functions.ai.prompt()
    participant PyExpr as PyExpr.vllm()
    participant Optimizer as Logical Plan Optimizer
    participant Pipeline as Pipeline Builder
    participant VLLMSink as VLLMSink (Rust)
    participant VLLMExecutor as VLLMExecutor (Python)
    participant AsyncEngine as vLLM AsyncLLMEngine

    User->>prompt(): prompt(messages, provider="vllm-prefix-cached")
    prompt()->>prompt(): Detect VLLMPrefixCachedPrompterDescriptor
    prompt()->>PyExpr: .vllm(model, concurrency, buffer_size, ...)
    PyExpr->>Optimizer: Create Expr::VLLM in logical plan
    
    Optimizer->>Optimizer: SplitVLLM rule extracts VLLM expr
    Optimizer->>Optimizer: Create VLLMProject logical plan node
    
    Pipeline->>VLLMSink: Translate to StreamingSink
    
    loop For each input batch
        VLLMSink->>VLLMSink: Buffer input until max_buffer_size
        VLLMSink->>VLLMExecutor: submit(prompts, rows)
        VLLMExecutor->>AsyncEngine: asyncio.run_coroutine_threadsafe(_generate())
        AsyncEngine-->>VLLMExecutor: Stream completions to completed_tasks queue
        VLLMSink->>VLLMExecutor: poll() for completed tasks
        VLLMExecutor-->>VLLMSink: Return (outputs, rows) or None
        VLLMSink-->>Pipeline: Return output with NeedMoreInput/HasMoreOutput
    end
    
    Pipeline->>VLLMSink: finalize(states)
    loop Until all tasks complete
        VLLMSink->>VLLMSink: Submit remaining buffered tasks
        VLLMSink->>VLLMExecutor: poll() for results
        alt Tasks still running
            VLLMSink-->>Pipeline: HasMoreOutput with partial results
        else All complete
            VLLMSink-->>Pipeline: Finished with final results
        end
    end
Loading

46 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@colin-ho colin-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on just leveraging async udfs instead of making dedicated logical + physical ops for vllm?

This PR here #5451 makes a streaming sink for async udfs that i think can also work with vllm. The idea is the same as what you have here, but it uses a joinset as the async task pool. The only thing missing is that you have max_buffer_size and max_running_tasks params here, but you can also just control that with the joinset, i.e. if limit is reached force a join_next.await

@kevinzwang
Copy link
Member Author

Thoughts on just leveraging async udfs instead of making dedicated logical + physical ops for vllm?

It's true that what's implemented here can probably be done with async UDFs, but this is just the first step in our work on prefix routing for vllm. We will need to add additional logic to the swordfish side to allow for bucketing the buffer by prefix to emit out, which is why I have this buffer. There's also additional work on the Flotilla side for routing that will require a distributed operator

Copy link
Contributor

@srilman srilman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, just a couple of nits.

## Changes Made

<!-- Describe what changes were made and why. Include implementation
details if necessary. -->

## Related Issues

<!-- Link to related GitHub issues, e.g., "Closes #123" -->

## Checklist

- [ ] Documented in API Docs (if applicable)
- [ ] Documented in User Guide (if applicable)
- [ ] If adding a new documentation page, doc is added to
`docs/mkdocs.yml` navigation
- [ ] Documentation builds and is formatted properly
@kevinzwang kevinzwang enabled auto-merge (squash) November 4, 2025 04:30
@kevinzwang
Copy link
Member Author

@greptileai

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR adds experimental support for vLLM-based LLM inference with prefix caching and async batching optimization. When using VLLMPrefixCachedProvider, the prompt function creates a VLLMExpr instead of a UDF, which gets optimized into a custom streaming sink operator.

Key Changes:

  • New VLLMSink streaming sink with prefix bucketing and async batching
  • SplitVLLM optimizer rule extracts VLLM expressions into dedicated VLLMProject nodes
  • LocalVLLMExecutor and RemoteVLLMExecutor handle local and distributed (Ray) execution
  • PrefixRouter load balances requests across multiple Ray actors based on prefix similarity
  • Integration in daft.functions.ai.prompt() detects vLLM provider and bypasses standard UDF path

Architecture:
The implementation uses a streaming sink pattern where prompts are buffered, sorted by prefix similarity, bucketed together, and submitted to vLLM's async engine. Results are polled and returned incrementally. For distributed mode, multiple Ray actors are spawned with a prefix-aware router.

Confidence Score: 4/5

  • Safe to merge as experimental feature with minor style improvements needed
  • The implementation is well-structured with proper async handling and the critical max_concurrency bug from previous comments is fixed. One inline import violates project style guidelines. The experimental nature is clearly documented.
  • daft/execution/vllm.py has inline import that should be moved to top of file

Important Files Changed

File Analysis

Filename Score Overview
daft/execution/vllm.py 4/5 Implements VLLMExecutor classes for local, blocking, and distributed execution with async batching and prefix routing
src/daft-local-execution/src/streaming_sink/vllm.rs 5/5 VLLMSink implementation with prefix bucketing logic, correctly sets max_concurrency to 1
daft/ai/vllm/provider.py 5/5 Simple provider class for vLLM prefix caching
daft/ai/vllm/protocols/prompter.py 5/5 PrompterDescriptor configuration for vLLM with prefix caching parameters
src/daft-logical-plan/src/ops/vllm.rs 5/5 VLLMProject logical plan node definition
src/daft-logical-plan/src/optimization/rules/split_vllm.rs 5/5 Optimizer rule to extract VLLM expressions from projections into dedicated VLLMProject nodes
src/daft-distributed/src/pipeline_node/vllm.rs 5/5 Distributed execution node for VLLM with Ray actors initialization
daft/functions/ai/init.py 5/5 Updated prompt function to detect and use vLLM provider via PyExpr.vllm() instead of UDF path

Sequence Diagram

sequenceDiagram
    participant User as User Code
    participant Prompt as daft.functions.ai.prompt()
    participant Optimizer as SplitVLLM Rule
    participant Sink as VLLMSink
    participant Executor as LocalVLLMExecutor
    participant VLLM as vLLM AsyncEngine
    
    User->>Prompt: prompt(col("text"), provider=vllm)
    Prompt->>Prompt: Detect VLLMPrefixCachingPrompterDescriptor
    Prompt->>Prompt: Create VLLMExpr (not UDF)
    Prompt-->>User: Return Expression
    
    User->>User: Execute dataframe operation
    
    Note over Optimizer: Logical Plan Optimization
    Optimizer->>Optimizer: Extract VLLMExpr from Project
    Optimizer->>Optimizer: Create VLLMProject node
    
    Note over Sink: Physical Execution
    Sink->>Sink: Buffer incoming data
    Sink->>Sink: Sort by prefix similarity
    Sink->>Sink: Bucket prompts by prefix
    Sink->>Executor: submit(prefix, prompts, rows)
    Executor->>VLLM: Generate async (streaming)
    VLLM-->>Executor: Yield outputs
    Executor->>Executor: Store completed results
    Sink->>Executor: poll()
    Executor-->>Sink: Return (outputs, rows)
    Sink-->>User: Yield results incrementally
Loading

58 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@kevinzwang kevinzwang merged commit 4829f43 into main Nov 4, 2025
39 checks passed
@kevinzwang kevinzwang deleted the kevin/vllm-prompt branch November 4, 2025 05:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants