- 
                Notifications
    
You must be signed in to change notification settings  - Fork 328
 
refactor: Use embed text in document benchmark #5365
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Summary
Refactored the document embedding benchmark to use Daft's built-in embed_text function instead of a custom UDF class. This simplifies the code by removing ~30 lines of custom embedding logic.
Key changes:
- Replaced custom 
EmbedderUDF class (30 lines) withembed_text(df["chunk"], provider="sentence_transformers", model=EMBED_MODEL_ID) - Removed unused imports: 
torch - Removed constants: 
EMBEDDING_DIM,NUM_GPU_NODES,EMBEDDING_BATCH_SIZE(now handled internally byembed_text) - Added import: 
from daft.functions.ai import embed_text 
The new embed_text API automatically handles:
- GPU/CPU device selection via 
SentenceTransformer's automatic device detection - Concurrency settings based on available GPUs using 
get_gpu_udf_options()(detects Ray cluster GPUs) - Embedding dimensions via 
AutoConfig.from_pretrained() - Model compilation and inference mode optimizations
 
This change aligns the benchmark with Daft's recommended patterns for AI operations and makes the code more maintainable.
Confidence Score: 5/5
- This PR is safe to merge with minimal risk
 - The refactoring replaces custom UDF code with Daft's built-in 
embed_textAPI. The new API provides equivalent or better functionality with automatic GPU detection and concurrency management that matches the original hardcoded settings (8 GPU nodes, 1 GPU per worker). The implementation has been verified against test cases and follows project conventions. - No files require special attention
 
Important Files Changed
File Analysis
| Filename | Score | Overview | 
|---|---|---|
| benchmarking/ai/document_embedding/daft_main.py | 5/5 | Refactored to use the new embed_text API, replacing custom Embedder UDF class with built-in function | 
Sequence Diagram
sequenceDiagram
    participant Benchmark as Benchmark Script
    participant Daft as Daft DataFrame
    participant EmbedText as embed_text()
    participant Provider as SentenceTransformersProvider
    participant UDF as UDF System
    participant Model as SentenceTransformer Model
    Benchmark->>Daft: Load & process PDFs
    Daft->>Daft: Extract text, chunk documents
    Benchmark->>EmbedText: embed_text(chunks, provider, model)
    EmbedText->>Provider: get_text_embedder(model)
    Provider->>Provider: get_gpu_udf_options()
    Note over Provider: Detects Ray cluster GPUs<br/>Sets concurrency automatically
    Provider->>EmbedText: Return TextEmbedderDescriptor
    EmbedText->>UDF: Create UDF with GPU settings
    UDF->>Model: Initialize SentenceTransformer
    Note over Model: Auto device selection<br/>(CUDA/MPS/CPU)
    UDF->>Model: encode(text_batches)
    Model->>UDF: Return embeddings
    UDF->>Daft: Add embedding column
    Daft->>Benchmark: Write to Parquet
    1 file reviewed, no comments
          Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@            Coverage Diff             @@
##             main    #5365      +/-   ##
==========================================
- Coverage   75.27%   74.70%   -0.57%     
==========================================
  Files         988      988              
  Lines      124736   125664     +928     
==========================================
- Hits        93889    93879      -10     
- Misses      30847    31785     +938     🚀 New features to boost your workflow:
  | 
    
Changes Made
Related Issues
Checklist
docs/mkdocs.ymlnavigation