Skip to content

Conversation

xXSup3rN0v4Xx
Copy link

Description

This PR introduces a new PolarsQueryEngine alongside the existing PandasQueryEngine, enabling natural language querying of Polars DataFrames with LLMs. The implementation provides complete feature parity with PandasQueryEngine while leveraging Polars' performance benefits for large-scale columnar data processing.

Key Features Added

  • PolarsQueryEngine: Full-featured query engine supporting natural language to Polars code conversion
  • Security: Complete RCE protection using LlamaIndex's exec_utils sandboxing (same as PandasQueryEngine)
  • Optimized Prompts: Polars-specific syntax rules and examples for reliable LLM code generation
  • Async Support: Both sync (_query) and async (_aquery) query methods
  • Comprehensive Documentation: API reference and Jupyter notebook following LlamaIndex patterns

Files Added/Modified

Core Implementation

  • llama-index-experimental/llama_index/experimental/query_engine/polars/polars_query_engine.py - Main PolarsQueryEngine class
  • llama-index-experimental/llama_index/experimental/query_engine/polars/output_parser.py - Secure execution with PolarsInstructionParser
  • llama-index-experimental/llama_index/experimental/query_engine/polars/prompts.py - Optimized Polars-specific prompts
  • llama-index-experimental/llama_index/experimental/query_engine/polars/__init__.py - Module exports

Documentation & Integration

  • docs/api_reference/api_reference/query_engine/polars.md - API reference documentation
  • docs/examples/query_engine/polars_query_engine.ipynb - Comprehensive Jupyter notebook tutorial
  • docs/src/content/docs/framework/module_guides/deploying/query_engine/modules.md - Added to structured data query engines list
  • llama-index-experimental/llama_index/experimental/__init__.py - Added PolarsQueryEngine export

Testing

  • llama-index-experimental/tests/test_polars.py - Complete test suite (5 tests covering functionality, security, and complex operations)

Cleanup

  • Removed llama-index-experimental/demos/demo_polars.py (replaced with proper Jupyter notebook following LlamaIndex patterns)

Testing Results

5 passed, 0 failed, 1 warning in 12.90s
✅ test_polars_query_engine - Basic functionality validation
✅ test_default_output_processor_rce - Security/RCE protection 
✅ test_default_output_processor_rce2 - Advanced security validation
✅ test_default_output_processor_e2e - End-to-end functionality
✅ test_polars_query_engine_complex_operations - Complex query scenarios

# Usage Example
```python
import polars as pl
from llama_index.experimental.query_engine import PolarsQueryEngine

# Create DataFrame
df = pl.DataFrame({
    "city": ["Toronto", "Tokyo", "Berlin"],
    "population": [2930000, 13960000, 3645000]
})

# Initialize query engine
query_engine = PolarsQueryEngine(df=df, verbose=True)

# Natural language query
response = query_engine.query("What is the city with the highest population?")
print(response)  # Tokyo

Performance Benefits

Columnar Storage: Uses Apache Arrow for efficient memory layout
Lazy Evaluation: Optimizes query plans before execution
Parallel Processing: Multi-threaded operations by default
Memory Efficiency: Lower memory usage compared to pandas for large datasets
Fixes # (N/A - this is a new feature enhancement)

New Package?

Yes
No (extends existing llama-index-experimental package)

Version Bump?

Yes
No (no version bump needed as this is an addition to existing experimental package)

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

I added new unit tests to cover this change
I believe this change is already covered by existing unit tests

Testing Details:

Complete test suite with 5 comprehensive tests covering all functionality
Security validation including RCE protection tests
Complex operations testing (filtering, grouping, aggregations)
Mock LLM testing for reliable CI/CD execution
End-to-end integration testing

Suggested Checklist:

I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added Google Colab support for the newly added notebooks
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I ran uv run make format; uv run make lint to appease the lint gods

Additional Notes

This implementation maintains complete compatibility with the existing LlamaIndex ecosystem while adding Polars support for users who need the performance benefits of columnar data processing. The API is consistent with PandasQueryEngine, making it easy for users to switch between implementations based on their performance requirements.

The documentation follows LlamaIndex patterns exactly, with the Jupyter notebook structured identically to the pandas equivalent for consistency and ease of use.

…y features

- Add new PolarsQueryEngine alongside existing PandasQueryEngine
- Support for Polars DataFrame querying with expression-based API
- Implement PolarsInstructionParser with safe code execution
- Add polars-specific prompts with syntax guidance for LLM
- Comprehensive test suite with 5 test cases covering:
  * Basic query engine functionality
  * RCE protection and security validation
  * End-to-end operations testing
  * Complex operations (filtering, grouping, aggregations)
- Add polars to ALLOWED_IMPORTS in exec_utils.py for secure execution
- Full integration with LlamaIndex ecosystem
- Demo script showing usage examples and comparisons with PandasQueryEngine
- All tests pass with security measures validated

Files added:
- llama_index/experimental/query_engine/polars/__init__.py
- llama_index/experimental/query_engine/polars/polars_query_engine.py
- llama_index/experimental/query_engine/polars/output_parser.py
- llama_index/experimental/query_engine/polars/prompts.py
- tests/test_polars.py
- demos/demo_polars.py

Files modified:
- llama_index/experimental/exec_utils.py (added polars to ALLOWED_IMPORTS)
- llama_index/experimental/query_engine/__init__.py (added exports)
… API integration

- Add PolarsQueryEngine API reference documentation (polars.md)
- Create comprehensive Jupyter notebook following LlamaIndex patterns (polars_query_engine.ipynb)
- Update main experimental __init__.py to export PolarsQueryEngine
- Add PolarsQueryEngine to query engine modules documentation
- Optimize Polars prompts for better LLM code generation
- Remove demo file following LlamaIndex documentation patterns
- All tests passing (5/5) with comprehensive coverage including security tests
… API integration

- Add PolarsQueryEngine API reference documentation (polars.md)
- Create comprehensive Jupyter notebook following LlamaIndex patterns (polars_query_engine.ipynb)
- Update main experimental __init__.py to export PolarsQueryEngine
- Add PolarsQueryEngine to query engine modules documentation
- Optimize Polars prompts for better LLM code generation
- Remove demo file following LlamaIndex documentation patterns
- All tests passing (5/5) with comprehensive coverage including security tests
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@dosubot dosubot bot added the size:XL This PR changes 500-999 lines, ignoring generated files. label Oct 10, 2025
Copy link
Member

@AstraBert AstraBert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good overall, I would add the async implementation tho

Comment on lines 19 to 21
import ast
import sys
import traceback
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we place imports at the top?

Comment on lines +138 to +140
def _get_prompt_modules(self) -> PromptMixinType:
"""Get prompt sub-modules."""
return {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not super sure why we need this function(?) (if it's necessary for inheritance, you can just pass)

Comment on lines 195 to 197
async def _aquery(self, query_bundle: QueryBundle) -> Response:
return self._query(query_bundle)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we actually add an async implementation? Using the async methods for the LLMs (like llm.apredict etc)

…fix import ordering

- Move imports (ast, sys, traceback) to module top in polars/output_parser.py
- Implement proper async _aquery methods using await and llm.apredict() in both:
  - polars/polars_query_engine.py
  - pandas/pandas_query_engine.py (bonus fix)
- Replace simple sync wrapper with full async implementation for true concurrency
- Addresses feedback from @AstraBert on PR review
@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Oct 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants