Skip to content

[Umbrella] Kyuubi Spark TPC-DS Connector #2538

Open
@pan3793

Description

@pan3793

Code of Conduct

Search before asking

  • I have searched in the issues and found no similar issues.

Describe the proposal

Spark DataSource V2 API[1] is available since Spark 3.0, basically, it provides a bunch of APIs for developers to implement a connector, and Spark will expose them to SQL/DataFrame API automatically with few configurations.

TPC-DS[2] dataset is very useful for benchmarking and demonstration. Previously, we need to generate the dataset by using dsdgen or kyuubi-tpcds tool before running queries. With the connector proposed by this PR, users just need

  1. Add jar kyuubi-spark-connector-tpcds_2.12-${kyuubi_version}.jar
  2. Add conf spark.sql.catalog.tpcds=org.apache.kyuubi.spark.connector.tpcds.TPCDSCatalog

Then they can query the different scales of TPC-DS tables under tpcds.sf{scale} database. For instance,

0: jdbc:hive2://0.0.0.0:10009/> show tables in tpcds.sf1;
+------------+-------------------------+--------------+
| namespace  |        tableName        | isTemporary  |
+------------+-------------------------+--------------+
| sf1        | call_center             | false        |
| sf1        | catalog_page            | false        |
| sf1        | catalog_returns         | false        |
| sf1        | catalog_sales           | false        |
| sf1        | customer                | false        |
| sf1        | customer_address        | false        |
| sf1        | customer_demographics   | false        |
| sf1        | date_dim                | false        |
| sf1        | household_demographics  | false        |
| sf1        | income_band             | false        |
| sf1        | inventory               | false        |
| sf1        | item                    | false        |
| sf1        | promotion               | false        |
| sf1        | reason                  | false        |
| sf1        | ship_mode               | false        |
| sf1        | store                   | false        |
| sf1        | store_returns           | false        |
| sf1        | store_sales             | false        |
| sf1        | time_dim                | false        |
| sf1        | warehouse               | false        |
| sf1        | web_page                | false        |
| sf1        | web_returns             | false        |
| sf1        | web_sales               | false        |
| sf1        | web_site                | false        |
+------------+-------------------------+--------------+

[1] https://github.com/apache/spark/tree/v3.2.1/sql/catalyst/src/main/java/org/apache/spark/sql/connector
[2] https://tpc.org/TPC_Documents_Current_Versions/pdf/TPC-DS_v3.2.0.pdf

Task list

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions