Add DataFrame API, Expr DSL, functions module, and SDK ergonomics#151
Merged
Conversation
Adds one-shot output ergonomics so callers don't have to reach for .read_all().to_pandas() on the underlying Flight or ADBC reader. Each helper accepts an optional `params` list that routes through the existing ADBC parameterized path, otherwise it goes through Flight as before. polars is gated behind an optional extra (`spicepy[polars]`) and raises a clear ImportError if missing.
…g helpers Adds a credible subset of the datafusion-python surface on top of the existing Flight + ADBC client. Tier 1 (Client method additions): - catalog introspection: catalogs(), schemas(), tables(), describe(), get_schema() - explain(sql, analyze=, verbose=) wrapping EXPLAIN - streaming/output: query_pydict(), query_batches() iterator, show() - writers: write_parquet(), write_csv(), write_json() streaming Flight output to local files - DataFrame entry points: table(), sql(), from_arrow(), from_pandas(), from_pydict() Tier 2 (new modules): - spicepy._sql: identifier and literal escape helpers - spicepy._expr: Expr DSL with arithmetic/comparison/logical operator overloads, alias, cast, is_null, in_, between, asc/desc, CASE WHEN, window OVER(); col(), lit(), case() public builders - spicepy.functions: aggregates (sum/avg/min/max/count/count_distinct/ stddev/variance/median/...), math (abs/round/ceil/floor/sqrt/power/ ln/log/exp), strings (lower/upper/length/trim/concat/substr/replace/ regexp_match/starts_with/ends_with), date/time (now/current_date/ date_trunc/date_part/extract), null/control flow (coalesce/nullif/ ifnull/case), window-only (row_number/rank/dense_rank/percent_rank/ cume_dist/lag/lead/first_value/last_value/nth_value) - spicepy._dataframe.SpiceDataFrame: lazy SQL-compiling builder with select/with_column(s)/drop/rename/cast, filter/where/limit/head/ offset, sort/order_by/distinct, union/intersect/except_, join (inner/left/right/full/semi/anti/cross) with key list or Expr, group_by().aggregate(), aggregate() (global), schema/explain, collect/to_arrow/to_pandas/to_polars/to_pylist/to_pydict/count/show - inline VALUES path for small client-side data via from_arrow/ from_pandas/from_pydict 239 new tests (test_sql, test_expr, test_functions, test_dataframe, extensions to test_client).
Adds a note to Expr's docstring documenting that the comparison operators build SQL expression trees (DSL pattern, same as SQLAlchemy, pandas, polars, Ibis, datafusion-python) and that subclass __eq__ overrides would silently break filtering/joins. Addresses a wave of code-quality bot reviews that misapply a value-semantics __eq__ rule to a DSL.
phillipleblanc
approved these changes
May 11, 2026
This was referenced May 11, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Expands the spicepy SDK from a thin SQL-over-Flight client into a fluent DataFrame API with catalog introspection, an expression DSL, a built-in function library, and richer output ergonomics. Everything is implemented client-side on top of the existing Flight + ADBC FlightSQL transport — no Spice runtime changes required.
What's in
Output ergonomics on
Clientquery_arrow(sql, *, params=None, timeout=None) -> pa.Tablequery_pandas(...) -> pd.DataFramequery_polars(...) -> pl.DataFrame(optionalspicepy[polars]extra)query_pylist(...) -> list[dict]query_pydict(...) -> dict[str, list]query_batches(sql) -> Iterator[pa.RecordBatch]for streamingAll accept an optional
paramslist and route through the existing ADBC parameterized path when set.Catalog & introspection on
Clientcatalogs(),schemas(catalog=None),tables(schema=None)—information_schemaqueriesdescribe(table)— column metadataget_schema(sql)— Arrow schema of a query viaLIMIT 0explain(sql, analyze=False, verbose=False) -> strshow(sql, n=20)— pretty-print to stdoutWriters on
Clientwrite_parquet(sql, path, **kwargs)— streams Flight batches to a Parquet filewrite_csv(sql, path, **kwargs)— streams to a CSV filewrite_json(sql, path)— newline-delimited JSONDataFrame entry points on
Clientclient.table(name)→SpiceDataFramereferencing a tableclient.sql(query)→SpiceDataFramewrapping arbitrary SQLclient.from_arrow(table),from_pandas(df),from_pydict(data)— small literal tables via inlineVALUESExpression DSL (
spicepy._expr, exported asExpr,col,lit,case)+ - * / % == != < <= > >= & | ~alias,cast(arrow_type|str),is_null,is_not_null,in_,between,asc/desccase().when(pred, val)...otherwise(val)_Func.over(partition_by=, order_by=)Function library (
spicepy.functions)sum,avg/mean,min,max,count,count_distinct,stddev,variance,median,approx_distinct,array_aggabs,round,ceil,floor,sqrt,power,ln,log,explower,upper,length,trim,concat,substr,replace,regexp_match,starts_with,ends_withnow,current_date,current_timestamp,date_trunc,date_part,extractcoalesce,nullif,ifnull,caserow_number,rank,dense_rank,percent_rank,cume_dist,lag,lead,first_value,last_value,nth_valueSpiceDataFrame(spicepy._dataframe)A lazy SQL-compiling builder. Each method returns a new DataFrame holding a SQL fragment; terminal operations ship the SQL through
Client.select,with_column,with_columns,drop,rename,castfilter/where,limit(n, offset=),headsort/order_by,distinctunion(all=),intersect,except_join(other, on, how)— inner/left/right/full/semi/anti — accepts string key, list of keys, orExpr;cross_joingroup_by(*keys).aggregate(*aggs);aggregate(*aggs)for global aggregationschema(),explain(analyze=, verbose=)collect(),to_arrow(),to_pandas(),to_polars(),to_pylist(),to_pydict(),count(),show(n=),to_sql()Usage
Tests
295 unit tests pass (170 new across
test_sql,test_expr,test_functions,test_dataframe, plus extensions totest_client). Black, ruff, mypy, bandit all clean.Not in this PR (follow-up work)
VALUESfor small literals. The runtime already provides the distributed cayenne catalog as a session-scoped catalog backend; wiringclient.register_arrow/parquet/pandas/record_batcheson top of it is a follow-up SDK PR.register_udfsurface can land alongside it.Notes
spicepy[polars](added in addition to the existingparamsextra).B608(hardcoded SQL expressions) is now skipped globally — composing SQL withquote_ident/quote_literal-escaped inputs is this package's job.github-code-quality[bot]review threads on this PR complain thatExprsubclasses don't override__eq__. The rule is misapplied:Expr.__eq__is intentionally the DSL builder (returns anExprrepresenting SQL equality, not a Pythonbool). Implementing the bot's suggestion would silently breakdf.filter(col("x") == 5). The class docstring onExprnow documents this; the threads can be resolved as won't-fix.