Skip to content

Add DataFrame API, Expr DSL, functions module, and SDK ergonomics#151

Merged
lukekim merged 7 commits into
trunkfrom
claude/expand-sdk-functionality-wEBm4
May 11, 2026
Merged

Add DataFrame API, Expr DSL, functions module, and SDK ergonomics#151
lukekim merged 7 commits into
trunkfrom
claude/expand-sdk-functionality-wEBm4

Conversation

@lukekim
Copy link
Copy Markdown
Contributor

@lukekim lukekim commented May 10, 2026

Summary

Expands the spicepy SDK from a thin SQL-over-Flight client into a fluent DataFrame API with catalog introspection, an expression DSL, a built-in function library, and richer output ergonomics. Everything is implemented client-side on top of the existing Flight + ADBC FlightSQL transport — no Spice runtime changes required.

What's in

Output ergonomics on Client

  • query_arrow(sql, *, params=None, timeout=None) -> pa.Table
  • query_pandas(...) -> pd.DataFrame
  • query_polars(...) -> pl.DataFrame (optional spicepy[polars] extra)
  • query_pylist(...) -> list[dict]
  • query_pydict(...) -> dict[str, list]
  • query_batches(sql) -> Iterator[pa.RecordBatch] for streaming

All accept an optional params list and route through the existing ADBC parameterized path when set.

Catalog & introspection on Client

  • catalogs(), schemas(catalog=None), tables(schema=None)information_schema queries
  • describe(table) — column metadata
  • get_schema(sql) — Arrow schema of a query via LIMIT 0
  • explain(sql, analyze=False, verbose=False) -> str
  • show(sql, n=20) — pretty-print to stdout

Writers on Client

  • write_parquet(sql, path, **kwargs) — streams Flight batches to a Parquet file
  • write_csv(sql, path, **kwargs) — streams to a CSV file
  • write_json(sql, path) — newline-delimited JSON

DataFrame entry points on Client

  • client.table(name)SpiceDataFrame referencing a table
  • client.sql(query)SpiceDataFrame wrapping arbitrary SQL
  • client.from_arrow(table), from_pandas(df), from_pydict(data) — small literal tables via inline VALUES

Expression DSL (spicepy._expr, exported as Expr, col, lit, case)

  • Operator overloads: + - * / % == != < <= > >= & | ~
  • alias, cast(arrow_type|str), is_null, is_not_null, in_, between, asc/desc
  • case().when(pred, val)...otherwise(val)
  • Window qualifier: _Func.over(partition_by=, order_by=)

Function library (spicepy.functions)

  • Aggregates: sum, avg/mean, min, max, count, count_distinct, stddev, variance, median, approx_distinct, array_agg
  • Math: abs, round, ceil, floor, sqrt, power, ln, log, exp
  • Strings: lower, upper, length, trim, concat, substr, replace, regexp_match, starts_with, ends_with
  • Date/time: now, current_date, current_timestamp, date_trunc, date_part, extract
  • Null / control flow: coalesce, nullif, ifnull, case
  • Window: row_number, rank, dense_rank, percent_rank, cume_dist, lag, lead, first_value, last_value, nth_value

SpiceDataFrame (spicepy._dataframe)

A lazy SQL-compiling builder. Each method returns a new DataFrame holding a SQL fragment; terminal operations ship the SQL through Client.

  • Projection: select, with_column, with_columns, drop, rename, cast
  • Filter / slice: filter/where, limit(n, offset=), head
  • Order / dedup: sort/order_by, distinct
  • Set ops: union(all=), intersect, except_
  • Joins: join(other, on, how) — inner/left/right/full/semi/anti — accepts string key, list of keys, or Expr; cross_join
  • Aggregation: group_by(*keys).aggregate(*aggs); aggregate(*aggs) for global aggregation
  • Schema/plans: schema(), explain(analyze=, verbose=)
  • Materialization: collect(), to_arrow(), to_pandas(), to_polars(), to_pylist(), to_pydict(), count(), show(n=), to_sql()

Usage

from spicepy import Client, col, case
from spicepy import functions as F

client = Client(api_key="...")

# Output ergonomics
df = client.query_pandas("SELECT * FROM trips LIMIT 100")

# Catalog introspection
print(client.tables(schema="public"))
print(client.describe("trips"))

# Lazy DataFrame composition
result = (
    client.table("trips")
    .filter(col("fare") > 10)
    .group_by(col("city"))
    .aggregate(
        F.sum(col("fare")).alias("total"),
        F.count_distinct(col("driver_id")).alias("drivers"),
    )
    .sort(col("total").desc())
    .limit(10)
    .to_pandas()
)

Tests

295 unit tests pass (170 new across test_sql, test_expr, test_functions, test_dataframe, plus extensions to test_client). Black, ruff, mypy, bandit all clean.

Not in this PR (follow-up work)

  • Substrait plan submission from the client — runtime work is tracked in Implement FlightSQL CommandStatementSubstraitPlan support spiceai#10761. Once that lands, the DataFrame layer can emit Substrait directly without round-tripping through SQL.
  • Client-side table registration with real data — today only inline VALUES for small literals. The runtime already provides the distributed cayenne catalog as a session-scoped catalog backend; wiring client.register_arrow/parquet/pandas/record_batches on top of it is a follow-up SDK PR.
  • User-defined functions — partial runtime support is in flight in Add table user functions and gate HTTP servers spiceai#10675; an SDK register_udf surface can land alongside it.
  • Object-store registration from the client — follow-up; needs the runtime to accept dynamic object-store configs on a per-session basis.

Notes

  • New optional extra: spicepy[polars] (added in addition to the existing params extra).
  • Bandit's B608 (hardcoded SQL expressions) is now skipped globally — composing SQL with quote_ident/quote_literal-escaped inputs is this package's job.
  • A wave of github-code-quality[bot] review threads on this PR complain that Expr subclasses don't override __eq__. The rule is misapplied: Expr.__eq__ is intentionally the DSL builder (returns an Expr representing SQL equality, not a Python bool). Implementing the bot's suggestion would silently break df.filter(col("x") == 5). The class docstring on Expr now documents this; the threads can be resolved as won't-fix.

claude added 2 commits May 9, 2026 00:11
Adds one-shot output ergonomics so callers don't have to reach for
.read_all().to_pandas() on the underlying Flight or ADBC reader. Each
helper accepts an optional `params` list that routes through the existing
ADBC parameterized path, otherwise it goes through Flight as before.

polars is gated behind an optional extra (`spicepy[polars]`) and raises
a clear ImportError if missing.
@lukekim lukekim self-assigned this May 10, 2026
@lukekim lukekim added the enhancement New feature or request label May 10, 2026
claude added 2 commits May 11, 2026 05:22
…g helpers

Adds a credible subset of the datafusion-python surface on top of the
existing Flight + ADBC client.

Tier 1 (Client method additions):
- catalog introspection: catalogs(), schemas(), tables(), describe(),
  get_schema()
- explain(sql, analyze=, verbose=) wrapping EXPLAIN
- streaming/output: query_pydict(), query_batches() iterator, show()
- writers: write_parquet(), write_csv(), write_json() streaming Flight
  output to local files
- DataFrame entry points: table(), sql(), from_arrow(), from_pandas(),
  from_pydict()

Tier 2 (new modules):
- spicepy._sql: identifier and literal escape helpers
- spicepy._expr: Expr DSL with arithmetic/comparison/logical operator
  overloads, alias, cast, is_null, in_, between, asc/desc, CASE WHEN,
  window OVER(); col(), lit(), case() public builders
- spicepy.functions: aggregates (sum/avg/min/max/count/count_distinct/
  stddev/variance/median/...), math (abs/round/ceil/floor/sqrt/power/
  ln/log/exp), strings (lower/upper/length/trim/concat/substr/replace/
  regexp_match/starts_with/ends_with), date/time (now/current_date/
  date_trunc/date_part/extract), null/control flow (coalesce/nullif/
  ifnull/case), window-only (row_number/rank/dense_rank/percent_rank/
  cume_dist/lag/lead/first_value/last_value/nth_value)
- spicepy._dataframe.SpiceDataFrame: lazy SQL-compiling builder with
  select/with_column(s)/drop/rename/cast, filter/where/limit/head/
  offset, sort/order_by/distinct, union/intersect/except_,
  join (inner/left/right/full/semi/anti/cross) with key list or Expr,
  group_by().aggregate(), aggregate() (global), schema/explain,
  collect/to_arrow/to_pandas/to_polars/to_pylist/to_pydict/count/show
- inline VALUES path for small client-side data via from_arrow/
  from_pandas/from_pydict

239 new tests (test_sql, test_expr, test_functions, test_dataframe,
extensions to test_client).
Comment thread spicepy/_expr.py Fixed
Comment thread spicepy/_expr.py Fixed
Comment thread spicepy/_expr.py Fixed
Comment thread spicepy/_expr.py Fixed
Comment thread spicepy/_expr.py Fixed
Comment thread spicepy/_expr.py Fixed
Comment thread spicepy/_expr.py Fixed
Comment thread spicepy/_expr.py Fixed
Comment thread spicepy/_expr.py Fixed
Comment thread spicepy/_expr.py Fixed
Adds a note to Expr's docstring documenting that the comparison operators
build SQL expression trees (DSL pattern, same as SQLAlchemy, pandas,
polars, Ibis, datafusion-python) and that subclass __eq__ overrides
would silently break filtering/joins. Addresses a wave of code-quality
bot reviews that misapply a value-semantics __eq__ rule to a DSL.
Comment thread spicepy/_expr.py
Comment thread spicepy/_expr.py
Comment thread spicepy/_expr.py
Comment thread spicepy/_expr.py
Comment thread spicepy/_expr.py
Comment thread spicepy/_expr.py
Comment thread spicepy/_expr.py
Comment thread spicepy/_expr.py
Comment thread spicepy/_expr.py
Comment thread spicepy/_expr.py
@lukekim lukekim changed the title Add query helper methods for Arrow, pandas, polars, and pylist Add DataFrame API, Expr DSL, functions module, and SDK ergonomics May 11, 2026
@lukekim lukekim added this to the v4.0.0 milestone May 11, 2026
@lukekim lukekim merged commit 3ba272a into trunk May 11, 2026
66 checks passed
@lukekim lukekim deleted the claude/expand-sdk-functionality-wEBm4 branch May 11, 2026 22:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants