Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: dataframe api #70

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

feat: dataframe api #70

wants to merge 1 commit into from

Conversation

tokoko
Copy link
Contributor

@tokoko tokoko commented Mar 14, 2025

The PR is strictly for demo purposes. It introduces a higher-level dataframe API. It's based on my own subframe project, but I've tried here to make it as unopinionated as possible. The key features are:

  • Introduces a DataFrame class with methods that will map (almost) one-to-one with Rel types.
  • Introduces helper functions to build Expressions (literal, col, scalar_function)
  • Additional layer of convenience function helper methods that wrap scalar_function calls for functions in the default extensions, meaning substrait.dataframe.functions.add(...) will act as an alias to substrait.dataframe.scalar_function("functions_arithmetic.yaml", "add", ...)
  • adbc integration: named_table function can use adbc connection effectively as a catalog to detect ReadRel schemas.

Example usage with adbc:

import adbc_driver_duckdb.dbapi
import pyarrow
from substrait.dataframe import named_table, literal, col, scalar_function
from substrait.dataframe.functions import add

data = pyarrow.record_batch(
    [[1, 2, 3, 4], ["a", "b", "c", "d"]],
    names=["ints", "strs"],
)

with adbc_driver_duckdb.dbapi.connect(":memory:") as conn:
    with conn.cursor() as cur:
        cur.adbc_ingest("AnswerToEverything", data)

        cur.executescript("INSTALL substrait;")
        cur.executescript("LOAD substrait;")

        table = named_table("AnswerToEverything", conn)
        table = table.project(
            literal(1001, type='i64').alias('BigNumber'),
            col("ints").alias('BigNumber2')
        )

        table = table.project(
            scalar_function("functions_arithmetic.yaml", "add",
                add(col("BigNumber"), col("BigNumber2")), 
                col("BigNumber2")
            ).alias('BigNumber3')
        )

        cur.execute(table.plan.SerializeToString())
        print(cur.fetch_arrow_table())```

Copy link

ACTION NEEDED

Substrait follows the Conventional Commits
specification
for
release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant