Open
Description
Code of Conduct
- I agree to follow this project's Code of Conduct
Search before asking
- I have searched in the issues and found no similar issues.
Describe the proposal
Spark DataSource V2 API[1] is available since Spark 3.0, basically, it provides a bunch of APIs for developers to implement a connector, and Spark will expose them to SQL/DataFrame API automatically with few configurations.
TPC-DS[2] dataset is very useful for benchmarking and demonstration. Previously, we need to generate the dataset by using dsdgen
or kyuubi-tpcds
tool before running queries. With the connector proposed by this PR, users just need
- Add jar
kyuubi-spark-connector-tpcds_2.12-${kyuubi_version}.jar
- Add conf
spark.sql.catalog.tpcds=org.apache.kyuubi.spark.connector.tpcds.TPCDSCatalog
Then they can query the different scales of TPC-DS tables under tpcds.sf{scale}
database. For instance,
0: jdbc:hive2://0.0.0.0:10009/> show tables in tpcds.sf1;
+------------+-------------------------+--------------+
| namespace | tableName | isTemporary |
+------------+-------------------------+--------------+
| sf1 | call_center | false |
| sf1 | catalog_page | false |
| sf1 | catalog_returns | false |
| sf1 | catalog_sales | false |
| sf1 | customer | false |
| sf1 | customer_address | false |
| sf1 | customer_demographics | false |
| sf1 | date_dim | false |
| sf1 | household_demographics | false |
| sf1 | income_band | false |
| sf1 | inventory | false |
| sf1 | item | false |
| sf1 | promotion | false |
| sf1 | reason | false |
| sf1 | ship_mode | false |
| sf1 | store | false |
| sf1 | store_returns | false |
| sf1 | store_sales | false |
| sf1 | time_dim | false |
| sf1 | warehouse | false |
| sf1 | web_page | false |
| sf1 | web_returns | false |
| sf1 | web_sales | false |
| sf1 | web_site | false |
+------------+-------------------------+--------------+
[1] https://github.com/apache/spark/tree/v3.2.1/sql/catalyst/src/main/java/org/apache/spark/sql/connector
[2] https://tpc.org/TPC_Documents_Current_Versions/pdf/TPC-DS_v3.2.0.pdf
Task list
- [Subtask] Kyuubi Spark TPC-DS Connector - Initial implementation #2531
- [Subtask] Kyuubi Spark TPC-DS Connector - SupportsReportStatistics #2539
- [Subtask] Kyuubi Spark TPC-DS Connector - SupportsNamespaces #2540
- [Subtask] Kyuubi Spark TPC-DS Connector - Set nullable in table schema #2541
- [Subtask] Kyuubi Spark TPC-DS Connector - Make useAnsiStringType configurable #2542
- [Subtask] Kyuubi Spark TPC-DS Connector - Make inputPartitionSize configurable #2543
- [Subtask] Kyuubi Spark TPC-DS Connector - Add tiny scale #2553
- [KYUUBI #2672] Check if the table exists #2673
- [Subtask] Handle column name change of customer table #2679
- Handle SPARK-37929 breaking change in TPCDSCatalog #2700
- Kyuubi Spark TPC-DS Connector - Rework SupportsReportStatistics and code refactor #2701
- Fix TPC-DS columns name and add TPC-DS queries verification #2702
- [Subtask] Kyuubi Spark TPC-DS Connector - Verify TPC-DS query output #2704
- Improve TPCDSTable display in Spark Web UI #2709
- [KYUUBI #2543] Add TPCDSTable generate benchmark #2729
- [Subtask] Add excludeDatabases for TPC-DS catalogs #2759
- [KYUUBI #2741] Add kyuubi-spark-connector-common module #2777
Are you willing to submit PR?
- Yes I am willing to submit a PR!