Skip to content

Commit 0bdee8e

Browse files
authored
Merge pull request #143 from osopardo1/unified-catalog
Unified Catalog for Delta and Qbeast Tables.
2 parents d503833 + 8d3354c commit 0bdee8e

File tree

7 files changed

+173
-35
lines changed

7 files changed

+173
-35
lines changed

CONTRIBUTING.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -125,7 +125,7 @@ For example:
125125
sbt assembly
126126

127127
$SPARK_HOME/bin/spark-shell \
128-
--jars ./target/scala-2.12/qbeast-spark-assembly-0.3.0-alpha.jar \
128+
--jars ./target/scala-2.12/qbeast-spark-assembly-0.3.0.jar \
129129
--conf spark.sql.extensions=io.qbeast.spark.internal.QbeastSparkSessionExtension \
130130
--packages io.delta:delta-core_2.12:1.2.0
131131
```

README.md

+7-6
Original file line numberDiff line numberDiff line change
@@ -91,9 +91,10 @@ export SPARK_HOME=$PWD/spark-3.1.1-bin-hadoop3.2
9191

9292
```bash
9393
$SPARK_HOME/bin/spark-shell \
94+
--repositories https://s01.oss.sonatype.org/content/repositories/snapshots \
9495
--conf spark.sql.extensions=io.qbeast.spark.internal.QbeastSparkSessionExtension \
9596
--conf spark.sql.catalog.spark_catalog=io.qbeast.spark.internal.sources.catalog.QbeastCatalog \
96-
--packages io.qbeast:qbeast-spark_2.12:0.2.0,io.delta:delta-core_2.12:1.0.0
97+
--packages io.qbeast:qbeast-spark_2.12:0.3.0-SNAPSHOT,io.delta:delta-core_2.12:1.2.0
9798
```
9899

99100
### 2. Indexing a dataset
@@ -173,11 +174,11 @@ qbeastTable.analyze()
173174
Go to [QbeastTable documentation](./docs/QbeastTable.md) for more detailed information.
174175

175176
# Dependencies and Version Compatibility
176-
| Version | Spark | Hadoop | Delta Lake |
177-
|-------------|:-----:|:------:|:----------:|
178-
| 0.1.0 | 3.0.0 | 3.2.0 | 0.8.0 |
179-
| 0.2.0 | 3.1.x | 3.2.0 | 1.0.0 |
180-
| 0.3.0-alpha | 3.2.x | 3.3.x | 1.2.x |
177+
| Version | Spark | Hadoop | Delta Lake |
178+
|------------|:-----:|:------:|:----------:|
179+
| 0.1.0 | 3.0.0 | 3.2.0 | 0.8.0 |
180+
| 0.2.0 | 3.1.x | 3.2.0 | 1.0.0 |
181+
| 0.3.0 | 3.2.x | 3.3.x | 1.2.x |
181182

182183
Check [here](https://docs.delta.io/latest/releases.html) for **Delta Lake** and **Apache Spark** version compatibility.
183184

build.sbt

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
import Dependencies._
22
import xerial.sbt.Sonatype._
33

4-
val mainVersion = "0.3.0-alpha"
4+
val mainVersion = "0.3.0"
55

66
lazy val qbeastCore = (project in file("core"))
77
.settings(

docs/AdvancedConfiguration.md

+50
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,56 @@
22

33
There's different configurations for the index that can affect the performance on read or the writing process. Here is a resume of some of them.
44

5+
## Catalogs
6+
7+
We designed the `QbeastCatalog` to work as an **entry point for other format's Catalog's** as well.
8+
9+
However, you can also handle different Catalogs simultanously.
10+
11+
### 1. Unified Catalog
12+
13+
```bash
14+
--conf spark.sql.catalog.spark_catalog=io.qbeast.spark.internal.sources.catalog.QbeastCatalog
15+
```
16+
17+
Using the `spark_catalog` configuration, you can write **qbeast** and **delta** ( or upcoming formats ;) ) tables into the `default` namespace.
18+
19+
```scala
20+
df.write
21+
.format("qbeast")
22+
.option("columnsToIndex", "user_id,product_id")
23+
.saveAsTable("qbeast_table")
24+
25+
df.write
26+
.format("delta")
27+
.saveAsTable("delta_table")
28+
```
29+
### 2. Secondary catalog
30+
31+
For using **more than one Catalog in the same session**, you can set it up in a different space.
32+
33+
```bash
34+
--conf spark.sql.catalog.spark_catalog = org.apache.spark.sql.delta.catalog.DeltaCatalog \
35+
--conf spark.sql.catalog.qbeast_catalog=io.qbeast.spark.internal.sources.catalog.QbeastCatalog
36+
```
37+
38+
Notice the `QbeastCatalog` conf parameter is not anymore `spark_catalog`, but has a customized name like `qbeast_catalog`. Each table written using the **qbeast** implementation, should have the prefix `qbeast_catalog`.
39+
40+
For example:
41+
42+
```scala
43+
// DataFrame API
44+
df.write
45+
.format("qbeast")
46+
.option("columnsToIndex", "user_id,product_id")
47+
.saveAsTable("qbeast_catalog.default.qbeast_table")
48+
49+
// SQL
50+
spark.sql("CREATE TABLE qbeast_catalog.default.qbeast_table USING qbeast AS SELECT * FROM ecommerce")
51+
```
52+
53+
54+
555
## ColumnsToIndex
656

757
These are the columns you want to index. Try to find those which are interesting for your queries, or your data pipelines.

docs/Quickstart.md

+42-16
Original file line numberDiff line numberDiff line change
@@ -16,13 +16,8 @@ Inside the project folder, launch a spark-shell with the required **dependencies
1616
```bash
1717
$SPARK_HOME/bin/spark-shell \
1818
--conf spark.sql.extensions=io.qbeast.spark.internal.QbeastSparkSessionExtension \
19-
--conf spark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider \
20-
--packages io.qbeast:qbeast-spark_2.12:0.2.0,\
21-
io.delta:delta-core_2.12:1.0.0,\
22-
com.amazonaws:aws-java-sdk:1.12.20,\
23-
org.apache.hadoop:hadoop-common:3.2.0,\
24-
org.apache.hadoop:hadoop-client:3.2.0,\
25-
org.apache.hadoop:hadoop-aws:3.2.0
19+
--conf spark.sql.catalog.spark_catalog=io.qbeast.spark.internal.sources.catalog.QbeastCatalog \
20+
--packages io.qbeast:qbeast-spark_2.12:0.3.0,io.delta:delta-core_2.12:1.2.0
2621
```
2722
As an **_extra configuration_**, you can also change two global parameters of the index:
2823

@@ -37,26 +32,28 @@ As an **_extra configuration_**, you can also change two global parameters of th
3732
```
3833
Consult the [Qbeast-Spark advanced configuration](AdvancedConfiguration.md) for more information.
3934

40-
Read the ***store_sales*** public dataset from `TPC-DS`, the table has with **23** columns in total and was generated with a `scaleFactor` of 1. Check [The Making of TPC-DS](http://www.tpc.org/tpcds/presentations/the_making_of_tpcds.pdf) for more details on the dataset.
41-
35+
Read the ***ecommerce*** test dataset from [Kaggle](https://www.kaggle.com/code/adilemrebilgic/e-commerce-analytics/data).
4236
```scala
43-
val parquetTablePath = "s3a://qbeast-public-datasets/store_sales"
44-
45-
val parquetDf = spark.read.format("parquet").load(parquetTablePath).na.drop()
37+
val ecommerce = spark.read
38+
.format("csv")
39+
.option("header", "true")
40+
.option("inferSchema", "true")
41+
.load("src/test/resources/ecommerce100K_2019_Oct.csv")
4642
```
4743

48-
Indexing the data with the desired columns, in this case `ss_cdemo_sk` and `ss_cdemo_sk`.
44+
Indexing the data with the desired columns, in this case `user_id` and `product_id`.
4945
```scala
5046
val qbeastTablePath = "/tmp/qbeast-test-data/qtable"
5147

52-
(parquetDf.write
48+
(ecommerce.write
5349
.mode("overwrite")
5450
.format("qbeast") // Saving the dataframe in a qbeast datasource
55-
.option("columnsToIndex", "ss_cdemo_sk,ss_cdemo_sk") // Indexing the table
56-
.option("cubeSize", 300000) // The desired number of records of the resulting files/cubes. Default is 100000
51+
.option("columnsToIndex", "user_id,product_id") // Indexing the table
52+
.option("cubeSize", "500") // The desired number of records of the resulting files/cubes. Default is 5M
5753
.save(qbeastTablePath))
5854
```
5955

56+
6057
## Sampling
6158

6259
Allow the sample operator to be pushed down to the source when sampling, reducing i/o and computational cost.
@@ -80,6 +77,35 @@ qbeastDf.sample(0.1).explain()
8077

8178
Notice that the sample operator is no longer present in the physical plan. It's converted into a `Filter (qbeast_hash)` instead and is used to select files during data scanning(`DataFilters` from `FileScan`). We skip reading many files in this way, involving less I/O.
8279

80+
## SQL
81+
82+
Thanks to the `QbeastCatalog`, you can use plain SQL and `CREATE TABLE` or `INSERT INTO` in qbeast format.
83+
84+
To check the different configuration on the Catalog, please go to [Advanced Configuration](AdvancedConfiguration.md)
85+
86+
```scala
87+
ecommerce.createOrReplaceTmpView("ecommerce_october")
88+
89+
spark.sql("CREATE OR REPLACE TABLE ecommerce_qbeast USING qbeast AS SELECT * FROM ecommerce_october")
90+
91+
//OR
92+
93+
val ecommerceNovember = spark.read
94+
.format("csv")
95+
.option("header", "true")
96+
.option("inferSchema", "true")
97+
.load("./src/test/resources/ecommerce100K_2019_Nov.csv")
98+
99+
ecommerceNovember.createOrReplaceTmpView("ecommerce_november")
100+
101+
spark.sql("INSERT INTO ecommerce_qbeast SELECT * FROM ecommerce_november")
102+
```
103+
Sampling has also an operator called `TABLESAMPLE`, which can be expressed in both terms of rows or percentage.
104+
105+
```scala
106+
spark.sql("SELECT avg(price) FROM ecommerce_qbeast TABLESAMPLE(10 PERCENT)").show()
107+
```
108+
83109

84110
## Analyze and Optimize
85111

src/main/scala/io/qbeast/spark/internal/sources/catalog/QbeastCatalog.scala

+45-10
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ import org.apache.spark.sql.catalyst.analysis.{
1616
import org.apache.spark.sql.{SparkCatalogUtils, SparkSession}
1717
import org.apache.spark.sql.connector.catalog._
1818
import org.apache.spark.sql.connector.expressions.Transform
19+
import org.apache.spark.sql.delta.catalog.DeltaCatalog
1920
import org.apache.spark.sql.types.StructType
2021
import org.apache.spark.sql.util.CaseInsensitiveStringMap
2122

@@ -37,21 +38,44 @@ class QbeastCatalog[T <: TableCatalog with SupportsNamespaces]
3738

3839
private val tableFactory = QbeastContext.indexedTableFactory
3940

41+
private val deltaCatalog: DeltaCatalog = new DeltaCatalog()
42+
4043
private var delegatedCatalog: CatalogPlugin = null
4144

4245
private var catalogName: String = null
4346

44-
private def getSessionCatalog(): T = {
47+
/**
48+
* Gets the delegated catalog of the session
49+
* @return
50+
*/
51+
private def getDelegatedCatalog(): T = {
4552
val sessionCatalog = delegatedCatalog match {
4653
case null =>
4754
// In this case, any catalog has been delegated, so we need to search for the default
4855
SparkCatalogUtils.getV2SessionCatalog(SparkSession.active)
4956
case o => o
5057
}
51-
5258
sessionCatalog.asInstanceOf[T]
5359
}
5460

61+
/**
62+
* Gets the session catalog depending on provider properties, if any
63+
*
64+
* The intention is to include the different catalog providers
65+
* while we add the integrations with the formats.
66+
* For example, for "delta" provider it will return a DeltaCatalog instance.
67+
*
68+
* In this way, users may only need to instantiate one single unified catalog.
69+
* @param properties the properties with the provider parameter
70+
* @return
71+
*/
72+
private def getSessionCatalog(properties: Map[String, String] = Map.empty): T = {
73+
properties.get("provider") match {
74+
case Some("delta") => deltaCatalog.asInstanceOf[T]
75+
case _ => getDelegatedCatalog()
76+
}
77+
}
78+
5579
override def loadTable(ident: Identifier): Table = {
5680
try {
5781
getSessionCatalog().loadTable(ident) match {
@@ -93,7 +117,11 @@ class QbeastCatalog[T <: TableCatalog with SupportsNamespaces]
93117
// Load the table
94118
loadTable(ident)
95119
} else {
96-
getSessionCatalog().createTable(ident, schema, partitions, properties)
120+
getSessionCatalog(properties.asScala.toMap).createTable(
121+
ident,
122+
schema,
123+
partitions,
124+
properties)
97125
}
98126

99127
}
@@ -119,12 +147,13 @@ class QbeastCatalog[T <: TableCatalog with SupportsNamespaces]
119147
properties,
120148
tableFactory)
121149
} else {
122-
if (getSessionCatalog().tableExists(ident)) {
123-
getSessionCatalog().dropTable(ident)
150+
val sessionCatalog = getSessionCatalog(properties.asScala.toMap)
151+
if (sessionCatalog.tableExists(ident)) {
152+
sessionCatalog.dropTable(ident)
124153
}
125154
DefaultStagedTable(
126155
ident,
127-
getSessionCatalog().createTable(ident, schema, partitions, properties),
156+
sessionCatalog.createTable(ident, schema, partitions, properties),
128157
this)
129158
}
130159
}
@@ -143,12 +172,13 @@ class QbeastCatalog[T <: TableCatalog with SupportsNamespaces]
143172
properties,
144173
tableFactory)
145174
} else {
146-
if (getSessionCatalog().tableExists(ident)) {
147-
getSessionCatalog().dropTable(ident)
175+
val sessionCatalog = getSessionCatalog(properties.asScala.toMap)
176+
if (sessionCatalog.tableExists(ident)) {
177+
sessionCatalog.dropTable(ident)
148178
}
149179
DefaultStagedTable(
150180
ident,
151-
getSessionCatalog().createTable(ident, schema, partitions, properties),
181+
sessionCatalog.createTable(ident, schema, partitions, properties),
152182
this)
153183

154184
}
@@ -170,7 +200,8 @@ class QbeastCatalog[T <: TableCatalog with SupportsNamespaces]
170200
} else {
171201
DefaultStagedTable(
172202
ident,
173-
getSessionCatalog().createTable(ident, schema, partitions, properties),
203+
getSessionCatalog(properties.asScala.toMap)
204+
.createTable(ident, schema, partitions, properties),
174205
this)
175206
}
176207
}
@@ -208,6 +239,8 @@ class QbeastCatalog[T <: TableCatalog with SupportsNamespaces]
208239
override def initialize(name: String, options: CaseInsensitiveStringMap): Unit = {
209240
// Initialize the catalog with the corresponding name
210241
this.catalogName = name
242+
// Initialize the catalog in any other provider that we can integrate with
243+
this.deltaCatalog.initialize(name, options)
211244
}
212245

213246
override def name(): String = catalogName
@@ -216,6 +249,8 @@ class QbeastCatalog[T <: TableCatalog with SupportsNamespaces]
216249
// Check if the delegating catalog has Table and SupportsNamespace properties
217250
if (delegate.isInstanceOf[TableCatalog] && delegate.isInstanceOf[SupportsNamespaces]) {
218251
this.delegatedCatalog = delegate
252+
// Set delegated catalog in any other provider that we can integrate with
253+
this.deltaCatalog.setDelegateCatalog(delegate)
219254
} else throw new IllegalArgumentException("Invalid session catalog: " + delegate)
220255
}
221256

src/test/scala/io/qbeast/spark/internal/sources/catalog/QbeastCatalogIntegrationTest.scala

+27-1
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,6 @@ class QbeastCatalogIntegrationTest extends QbeastIntegrationTestSpec with Catalo
2121

2222
data.write.format("delta").saveAsTable("delta_table") // delta catalog
2323

24-
// spark.sql("USE CATALOG qbeast_catalog")
2524
data.write
2625
.format("qbeast")
2726
.option("columnsToIndex", "id")
@@ -41,6 +40,33 @@ class QbeastCatalogIntegrationTest extends QbeastIntegrationTestSpec with Catalo
4140

4241
}))
4342

43+
it should
44+
"coexist with Delta tables in the same catalog" in withQbeastContextSparkAndTmpWarehouse(
45+
(spark, _) => {
46+
47+
val data = createTestData(spark)
48+
49+
data.write.format("delta").saveAsTable("delta_table") // delta catalog
50+
51+
data.write
52+
.format("qbeast")
53+
.option("columnsToIndex", "id")
54+
.saveAsTable("qbeast_table") // qbeast catalog
55+
56+
val tables = spark.sessionState.catalog.listTables("default")
57+
tables.size shouldBe 2
58+
59+
val deltaTable = spark.read.table("delta_table")
60+
val qbeastTable = spark.read.table("qbeast_table")
61+
62+
assertSmallDatasetEquality(
63+
deltaTable,
64+
qbeastTable,
65+
orderedComparison = false,
66+
ignoreNullable = true)
67+
68+
})
69+
4470
it should "crate table" in withQbeastContextSparkAndTmpWarehouse((spark, _) => {
4571

4672
spark.sql(

0 commit comments

Comments
 (0)