Add DataFrame API, Expr DSL, functions module, and SDK ergonomics by lukekim · Pull Request #151 · spiceai/spicepy

lukekim · 2026-05-10T01:03:16Z

Summary

Expands the spicepy SDK from a thin SQL-over-Flight client into a fluent DataFrame API with catalog introspection, an expression DSL, a built-in function library, and richer output ergonomics. Everything is implemented client-side on top of the existing Flight + ADBC FlightSQL transport — no Spice runtime changes required.

What's in

Output ergonomics on `Client`

query_arrow(sql, *, params=None, timeout=None) -> pa.Table
query_pandas(...) -> pd.DataFrame
query_polars(...) -> pl.DataFrame (optional spicepy[polars] extra)
query_pylist(...) -> list[dict]
query_pydict(...) -> dict[str, list]
query_batches(sql) -> Iterator[pa.RecordBatch] for streaming

All accept an optional params list and route through the existing ADBC parameterized path when set.

Catalog & introspection on `Client`

catalogs(), schemas(catalog=None), tables(schema=None) — information_schema queries
describe(table) — column metadata
get_schema(sql) — Arrow schema of a query via LIMIT 0
explain(sql, analyze=False, verbose=False) -> str
show(sql, n=20) — pretty-print to stdout

Writers on `Client`

write_parquet(sql, path, **kwargs) — streams Flight batches to a Parquet file
write_csv(sql, path, **kwargs) — streams to a CSV file
write_json(sql, path) — newline-delimited JSON

DataFrame entry points on `Client`

client.table(name) → SpiceDataFrame referencing a table
client.sql(query) → SpiceDataFrame wrapping arbitrary SQL
client.from_arrow(table), from_pandas(df), from_pydict(data) — small literal tables via inline VALUES

Expression DSL (`spicepy._expr`, exported as `Expr`, `col`, `lit`, `case`)

Operator overloads: + - * / % == != < <= > >= & | ~
alias, cast(arrow_type|str), is_null, is_not_null, in_, between, asc/desc
case().when(pred, val)...otherwise(val)
Window qualifier: _Func.over(partition_by=, order_by=)

Function library (`spicepy.functions`)

Aggregates: sum, avg/mean, min, max, count, count_distinct, stddev, variance, median, approx_distinct, array_agg
Math: abs, round, ceil, floor, sqrt, power, ln, log, exp
Strings: lower, upper, length, trim, concat, substr, replace, regexp_match, starts_with, ends_with
Date/time: now, current_date, current_timestamp, date_trunc, date_part, extract
Null / control flow: coalesce, nullif, ifnull, case
Window: row_number, rank, dense_rank, percent_rank, cume_dist, lag, lead, first_value, last_value, nth_value

`SpiceDataFrame` (`spicepy._dataframe`)

A lazy SQL-compiling builder. Each method returns a new DataFrame holding a SQL fragment; terminal operations ship the SQL through Client.

Projection: select, with_column, with_columns, drop, rename, cast
Filter / slice: filter/where, limit(n, offset=), head
Order / dedup: sort/order_by, distinct
Set ops: union(all=), intersect, except_
Joins: join(other, on, how) — inner/left/right/full/semi/anti — accepts string key, list of keys, or Expr; cross_join
Aggregation: group_by(*keys).aggregate(*aggs); aggregate(*aggs) for global aggregation
Schema/plans: schema(), explain(analyze=, verbose=)
Materialization: collect(), to_arrow(), to_pandas(), to_polars(), to_pylist(), to_pydict(), count(), show(n=), to_sql()

Usage

from spicepy import Client, col, case
from spicepy import functions as F

client = Client(api_key="...")

# Output ergonomics
df = client.query_pandas("SELECT * FROM trips LIMIT 100")

# Catalog introspection
print(client.tables(schema="public"))
print(client.describe("trips"))

# Lazy DataFrame composition
result = (
    client.table("trips")
    .filter(col("fare") > 10)
    .group_by(col("city"))
    .aggregate(
        F.sum(col("fare")).alias("total"),
        F.count_distinct(col("driver_id")).alias("drivers"),
    )
    .sort(col("total").desc())
    .limit(10)
    .to_pandas()
)

Tests

295 unit tests pass (170 new across test_sql, test_expr, test_functions, test_dataframe, plus extensions to test_client). Black, ruff, mypy, bandit all clean.

Not in this PR (follow-up work)

Substrait plan submission from the client — runtime work is tracked in Implement FlightSQL CommandStatementSubstraitPlan support spiceai#10761. Once that lands, the DataFrame layer can emit Substrait directly without round-tripping through SQL.
Client-side table registration with real data — today only inline VALUES for small literals. The runtime already provides the distributed cayenne catalog as a session-scoped catalog backend; wiring client.register_arrow/parquet/pandas/record_batches on top of it is a follow-up SDK PR.
User-defined functions — partial runtime support is in flight in Add table user functions and gate HTTP servers spiceai#10675; an SDK register_udf surface can land alongside it.
Object-store registration from the client — follow-up; needs the runtime to accept dynamic object-store configs on a per-session basis.

Notes

New optional extra: spicepy[polars] (added in addition to the existing params extra).
Bandit's B608 (hardcoded SQL expressions) is now skipped globally — composing SQL with quote_ident/quote_literal-escaped inputs is this package's job.
A wave of github-code-quality[bot] review threads on this PR complain that Expr subclasses don't override __eq__. The rule is misapplied: Expr.__eq__ is intentionally the DSL builder (returns an Expr representing SQL equality, not a Python bool). Implementing the bot's suggestion would silently break df.filter(col("x") == 5). The class docstring on Expr now documents this; the threads can be resolved as won't-fix.

Adds one-shot output ergonomics so callers don't have to reach for .read_all().to_pandas() on the underlying Flight or ADBC reader. Each helper accepts an optional `params` list that routes through the existing ADBC parameterized path, otherwise it goes through Flight as before. polars is gated behind an optional extra (`spicepy[polars]`) and raises a clear ImportError if missing.

…g helpers Adds a credible subset of the datafusion-python surface on top of the existing Flight + ADBC client. Tier 1 (Client method additions): - catalog introspection: catalogs(), schemas(), tables(), describe(), get_schema() - explain(sql, analyze=, verbose=) wrapping EXPLAIN - streaming/output: query_pydict(), query_batches() iterator, show() - writers: write_parquet(), write_csv(), write_json() streaming Flight output to local files - DataFrame entry points: table(), sql(), from_arrow(), from_pandas(), from_pydict() Tier 2 (new modules): - spicepy._sql: identifier and literal escape helpers - spicepy._expr: Expr DSL with arithmetic/comparison/logical operator overloads, alias, cast, is_null, in_, between, asc/desc, CASE WHEN, window OVER(); col(), lit(), case() public builders - spicepy.functions: aggregates (sum/avg/min/max/count/count_distinct/ stddev/variance/median/...), math (abs/round/ceil/floor/sqrt/power/ ln/log/exp), strings (lower/upper/length/trim/concat/substr/replace/ regexp_match/starts_with/ends_with), date/time (now/current_date/ date_trunc/date_part/extract), null/control flow (coalesce/nullif/ ifnull/case), window-only (row_number/rank/dense_rank/percent_rank/ cume_dist/lag/lead/first_value/last_value/nth_value) - spicepy._dataframe.SpiceDataFrame: lazy SQL-compiling builder with select/with_column(s)/drop/rename/cast, filter/where/limit/head/ offset, sort/order_by/distinct, union/intersect/except_, join (inner/left/right/full/semi/anti/cross) with key list or Expr, group_by().aggregate(), aggregate() (global), schema/explain, collect/to_arrow/to_pandas/to_polars/to_pylist/to_pydict/count/show - inline VALUES path for small client-side data via from_arrow/ from_pandas/from_pydict 239 new tests (test_sql, test_expr, test_functions, test_dataframe, extensions to test_client).

Adds a note to Expr's docstring documenting that the comparison operators build SQL expression trees (DSL pattern, same as SQLAlchemy, pandas, polars, Ibis, datafusion-python) and that subclass __eq__ overrides would silently break filtering/joins. Addresses a wave of code-quality bot reviews that misapply a value-semantics __eq__ rule to a DSL.

… WSL on Windows

claude added 2 commits May 9, 2026 00:11

style: apply black formatting to _client.py

7477695

lukekim self-assigned this May 10, 2026

lukekim added the enhancement New feature or request label May 10, 2026

claude added 2 commits May 11, 2026 05:22

chore(bandit): skip B608 — escaped SQL composition is the SDK's job

c75fc62

github-code-quality Bot found potential problems May 11, 2026

View reviewed changes

lukekim changed the title ~~Add query helper methods for Arrow, pandas, polars, and pylist~~ Add DataFrame API, Expr DSL, functions module, and SDK ergonomics May 11, 2026

phillipleblanc approved these changes May 11, 2026

View reviewed changes

lukekim added this to the v4.0.0 milestone May 11, 2026

chore(deps): fold dependency updates into feature PR

ceb97d6

This was referenced May 11, 2026

chore(deps): bump the actions group across 1 directory with 2 updates #148

Closed

chore(deps): bump the uv group across 1 directory with 2 updates #149

Closed

chore(deps): bump the python-dependencies group with 8 updates #150

Closed

feat: add GITHUB_TOKEN environment variable for Spice installation in…

2fdf2db

… WSL on Windows

lukekim merged commit 3ba272a into trunk May 11, 2026
66 checks passed

lukekim deleted the claude/expand-sdk-functionality-wEBm4 branch May 11, 2026 22:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DataFrame API, Expr DSL, functions module, and SDK ergonomics#151

Add DataFrame API, Expr DSL, functions module, and SDK ergonomics#151
lukekim merged 7 commits into
trunkfrom
claude/expand-sdk-functionality-wEBm4

lukekim commented May 10, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lukekim commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in

Output ergonomics on Client

Catalog & introspection on Client

Writers on Client

DataFrame entry points on Client

Expression DSL (spicepy._expr, exported as Expr, col, lit, case)

Function library (spicepy.functions)

SpiceDataFrame (spicepy._dataframe)

Usage

Tests

Not in this PR (follow-up work)

Notes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lukekim commented May 10, 2026 •

edited

Loading

Output ergonomics on `Client`

Catalog & introspection on `Client`

Writers on `Client`

DataFrame entry points on `Client`

Expression DSL (`spicepy._expr`, exported as `Expr`, `col`, `lit`, `case`)

Function library (`spicepy.functions`)

`SpiceDataFrame` (`spicepy._dataframe`)