- 
                Notifications
    
You must be signed in to change notification settings  - Fork 328
 
fix: Lower json inflation factor #5461
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
          Codecov Report❌ Patch coverage is  Additional details and impacted files@@            Coverage Diff             @@
##             main    #5461      +/-   ##
==========================================
- Coverage   71.72%   70.07%   -1.65%     
==========================================
  Files         997      996       -1     
  Lines      126858   128630    +1772     
==========================================
- Hits        90984    90138     -846     
- Misses      35874    38492    +2618     
 🚀 New features to boost your workflow:
  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR lowers the default JSON inflation factor from 0.5 to 0.25 based on real-world workload testing and makes it configurable via API and environment variables. It threads the execution config through the optimizer to ensure size estimations use the correct inflation factors during query planning.
Key Changes:
- Changed default 
json_inflation_factorfrom 0.5 to 0.25 inDaftExecutionConfig - Added 
json_inflation_factorparameter to Python API (set_execution_config) - Added environment variable support: 
DAFT_JSON_INFLATION_FACTOR,DAFT_PARQUET_INFLATION_FACTOR,DAFT_CSV_INFLATION_FACTOR - Fixed JSON file size estimation to use 
json_inflation_factorinstead ofcsv_inflation_factor - Threaded 
execution_configthrough optimizer →EnrichWithStats→Source→ScanTaskfor accurate size estimates - Added test coverage for JSON inflation factor configuration
 
Issues Found:
- Critical bug in 
optimizer.rs:221whereReorderJoinsdoesn't receive the passed config, using default values instead 
Confidence Score: 3/5
- PR has a critical logic bug preventing join reordering from using configured inflation factors
 - The implementation correctly adds the new config parameter and threads it through most of the codebase, but there's a critical bug where ReorderJoins ignores the passed config parameter and uses default values. This means join reordering decisions won't use the configured inflation factors, defeating part of the PR's purpose. The rest of the changes are sound.
 - src/daft-logical-plan/src/optimization/optimizer.rs requires immediate attention to fix the config threading bug
 
Important Files Changed
File Analysis
| Filename | Score | Overview | 
|---|---|---|
| src/daft-logical-plan/src/optimization/optimizer.rs | 2/5 | Added execution config threading through optimizer; critical bug where ReorderJoins doesn't receive config | 
| src/common/daft-config/src/lib.rs | 5/5 | Changed default json_inflation_factor from 0.5 to 0.25; added env var support for inflation factors | 
| src/daft-scan/src/lib.rs | 5/5 | Fixed JSON file size estimation to use json_inflation_factor instead of csv_inflation_factor | 
| tests/test_size_estimations.py | 5/5 | Added new test for json_inflation_factor configuration | 
| src/daft-logical-plan/src/builder/mod.rs | 4/5 | Updated optimize methods to accept and pass execution config to optimizer and join reordering | 
Sequence Diagram
sequenceDiagram
    participant User
    participant DataFrame
    participant LogicalPlanBuilder
    participant Optimizer
    participant EnrichWithStats
    participant ReorderJoins
    participant JoinGraphBuilder
    participant Source
    participant ScanTask
    
    User->>DataFrame: set_execution_config(json_inflation_factor=0.25)
    User->>DataFrame: read_json() / optimize()
    DataFrame->>LogicalPlanBuilder: optimize(execution_config)
    LogicalPlanBuilder->>Optimizer: optimize with execution_config
    
    Note over Optimizer: Default optimizations (pushdowns, etc.)
    
    Optimizer->>EnrichWithStats: enrich_with_stats(Some(execution_config))
    EnrichWithStats->>Source: with_materialized_stats(cfg)
    Source->>ScanTask: estimate_in_memory_size_bytes(Some(cfg))
    Note over ScanTask: Uses cfg.json_inflation_factor=0.25<br/>instead of 0.5
    ScanTask-->>Source: estimated size
    Source-->>EnrichWithStats: stats with updated size
    EnrichWithStats-->>Optimizer: enriched plan
    
    Optimizer->>ReorderJoins: reorder_joins(Some(execution_config))
    Note over ReorderJoins: BUG: Uses default config<br/>instead of passed config
    ReorderJoins->>JoinGraphBuilder: from_logical_plan(plan, cfg)
    JoinGraphBuilder->>Source: with_materialized_stats(cfg)
    Source->>ScanTask: estimate_in_memory_size_bytes(Some(cfg))
    ScanTask-->>Source: estimated size
    Source-->>JoinGraphBuilder: stats
    JoinGraphBuilder-->>ReorderJoins: reordered joins
    ReorderJoins->>EnrichWithStats: enrich_with_stats(Some(execution_config))
    EnrichWithStats-->>Optimizer: final enriched plan
    
    Optimizer-->>LogicalPlanBuilder: optimized plan
    LogicalPlanBuilder-->>DataFrame: optimized plan
    DataFrame-->>User: result
    21 files reviewed, 1 comment
| pub fn reorder_joins(mut self, cfg: Option<Arc<DaftExecutionConfig>>) -> Self { | ||
| self.rule_batches.push(RuleBatch::new( | ||
| vec![ | ||
| Box::new(ReorderJoins::new()), | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: ReorderJoins::new() ignores the cfg parameter passed to the function, should use with_config
| Box::new(ReorderJoins::new()), | |
| Box::new(ReorderJoins::with_config(cfg.clone().unwrap_or_default())), | 
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/daft-logical-plan/src/optimization/optimizer.rs
Line: 221:221
Comment:
**logic:** `ReorderJoins::new()` ignores the `cfg` parameter passed to the function, should use `with_config`
```suggestion
                Box::new(ReorderJoins::with_config(cfg.clone().unwrap_or_default())),
```
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh hm do we really want to pass execution configs into the optimizer?
I'm also kind of curious, why were we able to handle CSV and Parquet inflation factors without having to do this?
          
 We weren't. Currently the optimizer has an  The   | 
    
Changes Made
This PR changes our estimate for json inflation factor based on a real workload. Additionally it allows users to configure the estimated inflation factor via config.
This required threading the execution config through to the optimizer.
Related Issues
Checklist
docs/mkdocs.ymlnavigation