-
Notifications
You must be signed in to change notification settings - Fork 6.4k
feat: Add PolarsQueryEngine with comprehensive documentation and API integration #20065
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…y features - Add new PolarsQueryEngine alongside existing PandasQueryEngine - Support for Polars DataFrame querying with expression-based API - Implement PolarsInstructionParser with safe code execution - Add polars-specific prompts with syntax guidance for LLM - Comprehensive test suite with 5 test cases covering: * Basic query engine functionality * RCE protection and security validation * End-to-end operations testing * Complex operations (filtering, grouping, aggregations) - Add polars to ALLOWED_IMPORTS in exec_utils.py for secure execution - Full integration with LlamaIndex ecosystem - Demo script showing usage examples and comparisons with PandasQueryEngine - All tests pass with security measures validated Files added: - llama_index/experimental/query_engine/polars/__init__.py - llama_index/experimental/query_engine/polars/polars_query_engine.py - llama_index/experimental/query_engine/polars/output_parser.py - llama_index/experimental/query_engine/polars/prompts.py - tests/test_polars.py - demos/demo_polars.py Files modified: - llama_index/experimental/exec_utils.py (added polars to ALLOWED_IMPORTS) - llama_index/experimental/query_engine/__init__.py (added exports)
… API integration - Add PolarsQueryEngine API reference documentation (polars.md) - Create comprehensive Jupyter notebook following LlamaIndex patterns (polars_query_engine.ipynb) - Update main experimental __init__.py to export PolarsQueryEngine - Add PolarsQueryEngine to query engine modules documentation - Optimize Polars prompts for better LLM code generation - Remove demo file following LlamaIndex documentation patterns - All tests passing (5/5) with comprehensive coverage including security tests
… API integration - Add PolarsQueryEngine API reference documentation (polars.md) - Create comprehensive Jupyter notebook following LlamaIndex patterns (polars_query_engine.ipynb) - Update main experimental __init__.py to export PolarsQueryEngine - Add PolarsQueryEngine to query engine modules documentation - Optimize Polars prompts for better LLM code generation - Remove demo file following LlamaIndex documentation patterns - All tests passing (5/5) with comprehensive coverage including security tests
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good overall, I would add the async implementation tho
import ast | ||
import sys | ||
import traceback |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we place imports at the top?
def _get_prompt_modules(self) -> PromptMixinType: | ||
"""Get prompt sub-modules.""" | ||
return {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not super sure why we need this function(?) (if it's necessary for inheritance, you can just pass)
async def _aquery(self, query_bundle: QueryBundle) -> Response: | ||
return self._query(query_bundle) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we actually add an async implementation? Using the async methods for the LLMs (like llm.apredict etc)
…fix import ordering - Move imports (ast, sys, traceback) to module top in polars/output_parser.py - Implement proper async _aquery methods using await and llm.apredict() in both: - polars/polars_query_engine.py - pandas/pandas_query_engine.py (bonus fix) - Replace simple sync wrapper with full async implementation for true concurrency - Addresses feedback from @AstraBert on PR review
Description
This PR introduces a new PolarsQueryEngine alongside the existing PandasQueryEngine, enabling natural language querying of Polars DataFrames with LLMs. The implementation provides complete feature parity with PandasQueryEngine while leveraging Polars' performance benefits for large-scale columnar data processing.
Key Features Added
exec_utils
sandboxing (same as PandasQueryEngine)_query
) and async (_aquery
) query methodsFiles Added/Modified
Core Implementation
llama-index-experimental/llama_index/experimental/query_engine/polars/polars_query_engine.py
- Main PolarsQueryEngine classllama-index-experimental/llama_index/experimental/query_engine/polars/output_parser.py
- Secure execution with PolarsInstructionParserllama-index-experimental/llama_index/experimental/query_engine/polars/prompts.py
- Optimized Polars-specific promptsllama-index-experimental/llama_index/experimental/query_engine/polars/__init__.py
- Module exportsDocumentation & Integration
docs/api_reference/api_reference/query_engine/polars.md
- API reference documentationdocs/examples/query_engine/polars_query_engine.ipynb
- Comprehensive Jupyter notebook tutorialdocs/src/content/docs/framework/module_guides/deploying/query_engine/modules.md
- Added to structured data query engines listllama-index-experimental/llama_index/experimental/__init__.py
- Added PolarsQueryEngine exportTesting
llama-index-experimental/tests/test_polars.py
- Complete test suite (5 tests covering functionality, security, and complex operations)Cleanup
llama-index-experimental/demos/demo_polars.py
(replaced with proper Jupyter notebook following LlamaIndex patterns)Testing Results
Performance Benefits
Columnar Storage: Uses Apache Arrow for efficient memory layout
Lazy Evaluation: Optimizes query plans before execution
Parallel Processing: Multi-threaded operations by default
Memory Efficiency: Lower memory usage compared to pandas for large datasets
Fixes # (N/A - this is a new feature enhancement)
New Package?
Yes
No (extends existing llama-index-experimental package)
Version Bump?
Yes
No (no version bump needed as this is an addition to existing experimental package)
Type of Change
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update
How Has This Been Tested?
I added new unit tests to cover this change
I believe this change is already covered by existing unit tests
Testing Details:
Complete test suite with 5 comprehensive tests covering all functionality
Security validation including RCE protection tests
Complex operations testing (filtering, grouping, aggregations)
Mock LLM testing for reliable CI/CD execution
End-to-end integration testing
Suggested Checklist:
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added Google Colab support for the newly added notebooks
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I ran uv run make format; uv run make lint to appease the lint gods
Additional Notes
This implementation maintains complete compatibility with the existing LlamaIndex ecosystem while adding Polars support for users who need the performance benefits of columnar data processing. The API is consistent with PandasQueryEngine, making it easy for users to switch between implementations based on their performance requirements.
The documentation follows LlamaIndex patterns exactly, with the Jupyter notebook structured identically to the pandas equivalent for consistency and ease of use.