Improve metadata access for pristine DataStore sources#536
Merged
Conversation
explain() now only prints to stdout and returns None instead of a string. Tests updated to capture stdout via redirect_stdout for assertions.
For unmodified DataStores (freshly created from DataFrame, file, or remote
ClickHouse), dtypes/columns/shape/size/empty now read metadata directly
from the source without triggering full data loading:
- DataFrame source: reads from the in-memory source DataFrame directly
- File/remote source: uses DESCRIBE, COUNT(*), LIMIT 0 SQL queries
- ndim: always returns 2 (constant for DataFrame-like objects)
Once operations are applied (filter, groupby, assign, etc.), all
properties fall back to full execution as before.
This fixes the issue where session.table("trips").columns would load
all data from a remote ClickHouse table into memory.
95f77e6 to
f3b0f37
Compare
37 tests covering all three source types: - DataFrame source: columns, dtypes, shape, size, empty, ndim, edge cases - File source (parquet/csv): metadata via SQL without full data loading - Remote ClickHouse source: system tables and test_db via clickhouse_server fixture, including verification that metadata matches full execution and does not populate _cached_result
f3b0f37 to
39cf238
Compare
Reduces max_server_memory_usage_to_ram_ratio to 0.25 and per-query max_memory_usage to 500MB. Test data is only a few rows; the previous 10GB limit risked OOM-killing the CI runner when multiple test rounds share the same ClickHouse server process.
89943f2 to
23cc817
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
dtypes,columns,shape,size,empty,ndimon pristine (unmodified) DataStore sourcesDESCRIBE,COUNT(*),LIMIT 0SQL queries instead of loading all dataexplain()refactored to print-only (no return value), with all dependent tests updatedFixes the issue reported in the chDB Tutorial notebook where
ds.columns,ds.shape,ds.sizeon a remote ClickHouse table triggered full data loading.Test plan
test_pristine_metadata_optimization.py)clickhouse_serverfixture (local ClickHouse auto-start)_cached_resultis NOT populated by metadata access