Skip to content

Improve metadata access for pristine DataStore sources#536

Merged
auxten merged 4 commits intomainfrom
improve/pristine-metadata-optimization
Mar 10, 2026
Merged

Improve metadata access for pristine DataStore sources#536
auxten merged 4 commits intomainfrom
improve/pristine-metadata-optimization

Conversation

@auxten
Copy link
Member

@auxten auxten commented Mar 10, 2026

Summary

  • Avoid full data loading when accessing dtypes, columns, shape, size, empty, ndim on pristine (unmodified) DataStore sources
  • For DataFrame sources: reads directly from the in-memory source DataFrame (zero-cost)
  • For file/remote sources: uses DESCRIBE, COUNT(*), LIMIT 0 SQL queries instead of loading all data
  • Once operations are applied (filter, groupby, etc.), falls back to full execution as before
  • Also includes: explain() refactored to print-only (no return value), with all dependent tests updated

Fixes the issue reported in the chDB Tutorial notebook where ds.columns, ds.shape, ds.size on a remote ClickHouse table triggered full data loading.

Test plan

  • 37 new tests covering DataFrame, file, and remote ClickHouse sources (test_pristine_metadata_optimization.py)
  • Remote tests use clickhouse_server fixture (local ClickHouse auto-start)
  • Verifies metadata matches full execution results
  • Verifies _cached_result is NOT populated by metadata access
  • Full test suite: 9240 passed, 0 failed

auxten added 2 commits March 10, 2026 14:07
explain() now only prints to stdout and returns None instead of a string.
Tests updated to capture stdout via redirect_stdout for assertions.
For unmodified DataStores (freshly created from DataFrame, file, or remote
ClickHouse), dtypes/columns/shape/size/empty now read metadata directly
from the source without triggering full data loading:

- DataFrame source: reads from the in-memory source DataFrame directly
- File/remote source: uses DESCRIBE, COUNT(*), LIMIT 0 SQL queries
- ndim: always returns 2 (constant for DataFrame-like objects)

Once operations are applied (filter, groupby, assign, etc.), all
properties fall back to full execution as before.

This fixes the issue where session.table("trips").columns would load
all data from a remote ClickHouse table into memory.
@auxten auxten force-pushed the improve/pristine-metadata-optimization branch from 95f77e6 to f3b0f37 Compare March 10, 2026 06:24
37 tests covering all three source types:
- DataFrame source: columns, dtypes, shape, size, empty, ndim, edge cases
- File source (parquet/csv): metadata via SQL without full data loading
- Remote ClickHouse source: system tables and test_db via clickhouse_server
  fixture, including verification that metadata matches full execution and
  does not populate _cached_result
@auxten auxten force-pushed the improve/pristine-metadata-optimization branch from f3b0f37 to 39cf238 Compare March 10, 2026 06:38
Reduces max_server_memory_usage_to_ram_ratio to 0.25 and per-query
max_memory_usage to 500MB. Test data is only a few rows; the previous
10GB limit risked OOM-killing the CI runner when multiple test rounds
share the same ClickHouse server process.
@auxten auxten force-pushed the improve/pristine-metadata-optimization branch from 89943f2 to 23cc817 Compare March 10, 2026 07:20
@auxten auxten merged commit db937b1 into main Mar 10, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant