@@ -26,24 +26,16 @@ This will build and store the fat JAR in the `target` directory by default.
26
26
27
27
For use with Java and Scala projects, the package can be found [ here] ( https://central.sonatype.com/artifact/io.qdrant/spark ) .
28
28
29
- ``` xml
30
- <dependency >
31
- <groupId >io.qdrant</groupId >
32
- <artifactId >spark</artifactId >
33
- <version >2.0.1</version >
34
- </dependency >
35
- ```
36
-
37
29
## Usage 📝
38
30
39
- ### Creating a Spark session (Single-node) with Qdrant support 🌟
31
+ ### Creating a Spark session (Single-node) with Qdrant support
40
32
41
33
``` python
42
34
from pyspark.sql import SparkSession
43
35
44
36
spark = SparkSession.builder.config(
45
37
" spark.jars" ,
46
- " spark-2.0.1 .jar" , # specify the downloaded JAR file
38
+ " spark-2.1.0 .jar" , # specify the downloaded JAR file
47
39
)
48
40
.master(" local[*]" )
49
41
.appName(" qdrant" )
@@ -52,30 +44,150 @@ spark = SparkSession.builder.config(
52
44
53
45
### Loading data 📊
54
46
55
- To load data into Qdrant, a collection has to be created beforehand with the appropriate vector dimensions and configurations.
47
+ > [ !IMPORTANT]
48
+ > Before loading the data using this connector, a collection has to be [ created] ( https://qdrant.tech/documentation/concepts/collections/#create-a-collection ) in advance with the appropriate vector dimensions and configurations.
49
+
50
+ The connector supports ingesting multiple named/unnamed, dense/sparse vectors.
51
+
52
+ <details >
53
+ <summary ><b >Unnamed/Default vector</b ></summary >
54
+
55
+ ``` python
56
+ < pyspark.sql.DataFrame>
57
+ .write
58
+ .format(" io.qdrant.spark.Qdrant" )
59
+ .option(" qdrant_url" , < QDRANT_GRPC_URL > )
60
+ .option(" collection_name" , < QDRANT_COLLECTION_NAME > )
61
+ .option(" embedding_field" , < EMBEDDING_FIELD_NAME > ) # Expected to be a field of type ArrayType(FloatType)
62
+ .option(" schema" , < pyspark.sql.DataFrame> .schema.json())
63
+ .mode(" append" )
64
+ .save()
65
+ ```
66
+
67
+ </details >
68
+
69
+ <details >
70
+ <summary ><b >Named vector</b ></summary >
71
+
72
+ ``` python
73
+ < pyspark.sql.DataFrame>
74
+ .write
75
+ .format(" io.qdrant.spark.Qdrant" )
76
+ .option(" qdrant_url" , < QDRANT_GRPC_URL > )
77
+ .option(" collection_name" , < QDRANT_COLLECTION_NAME > )
78
+ .option(" embedding_field" , < EMBEDDING_FIELD_NAME > ) # Expected to be a field of type ArrayType(FloatType)
79
+ .option(" vector_name" , < VECTOR_NAME > )
80
+ .option(" schema" , < pyspark.sql.DataFrame> .schema.json())
81
+ .mode(" append" )
82
+ .save()
83
+ ```
84
+
85
+ > #### NOTE
86
+ >
87
+ > The ` embedding_field ` and ` vector_name ` options are maintained for backward compatibility. It is recommended to use ` vector_fields ` and ` vector_names ` for named vectors as shown below.
88
+
89
+ </details >
90
+
91
+ <details >
92
+ <summary ><b >Multiple named vectors</b ></summary >
93
+
94
+ ``` python
95
+ < pyspark.sql.DataFrame>
96
+ .write
97
+ .format(" io.qdrant.spark.Qdrant" )
98
+ .option(" qdrant_url" , " <QDRANT_GRPC_URL>" )
99
+ .option(" collection_name" , " <QDRANT_COLLECTION_NAME>" )
100
+ .option(" vector_fields" , " <COLUMN_NAME>,<ANOTHER_COLUMN_NAME>" )
101
+ .option(" vector_names" , " <VECTOR_NAME>,<ANOTHER_VECTOR_NAME>" )
102
+ .option(" schema" , < pyspark.sql.DataFrame> .schema.json())
103
+ .mode(" append" )
104
+ .save()
105
+ ```
106
+
107
+ </details >
108
+
109
+ <details >
110
+ <summary ><b >Sparse vectors</b ></summary >
111
+
112
+ ``` python
113
+ < pyspark.sql.DataFrame>
114
+ .write
115
+ .format(" io.qdrant.spark.Qdrant" )
116
+ .option(" qdrant_url" , " <QDRANT_GRPC_URL>" )
117
+ .option(" collection_name" , " <QDRANT_COLLECTION_NAME>" )
118
+ .option(" sparse_vector_value_fields" , " <COLUMN_NAME>" )
119
+ .option(" sparse_vector_index_fields" , " <COLUMN_NAME>" )
120
+ .option(" sparse_vector_names" , " <SPARSE_VECTOR_NAME>" )
121
+ .option(" schema" , < pyspark.sql.DataFrame> .schema.json())
122
+ .mode(" append" )
123
+ .save()
124
+ ```
125
+
126
+ </details >
127
+
128
+ <details >
129
+ <summary ><b >Multiple sparse vectors</b ></summary >
130
+
131
+ ``` python
132
+ < pyspark.sql.DataFrame>
133
+ .write
134
+ .format(" io.qdrant.spark.Qdrant" )
135
+ .option(" qdrant_url" , " <QDRANT_GRPC_URL>" )
136
+ .option(" collection_name" , " <QDRANT_COLLECTION_NAME>" )
137
+ .option(" sparse_vector_value_fields" , " <COLUMN_NAME>,<ANOTHER_COLUMN_NAME>" )
138
+ .option(" sparse_vector_index_fields" , " <COLUMN_NAME>,<ANOTHER_COLUMN_NAME>" )
139
+ .option(" sparse_vector_names" , " <SPARSE_VECTOR_NAME>,<ANOTHER_SPARSE_VECTOR_NAME>" )
140
+ .option(" schema" , < pyspark.sql.DataFrame> .schema.json())
141
+ .mode(" append" )
142
+ .save()
143
+ ```
144
+
145
+ </details >
146
+
147
+ <details >
148
+ <summary ><b >Combination of named dense and sparse vectors</b ></summary >
149
+
150
+ ``` python
151
+ < pyspark.sql.DataFrame>
152
+ .write
153
+ .format(" io.qdrant.spark.Qdrant" )
154
+ .option(" qdrant_url" , " <QDRANT_GRPC_URL>" )
155
+ .option(" collection_name" , " <QDRANT_COLLECTION_NAME>" )
156
+ .option(" vector_fields" , " <COLUMN_NAME>,<ANOTHER_COLUMN_NAME>" )
157
+ .option(" vector_names" , " <VECTOR_NAME>,<ANOTHER_VECTOR_NAME>" )
158
+ .option(" sparse_vector_value_fields" , " <COLUMN_NAME>,<ANOTHER_COLUMN_NAME>" )
159
+ .option(" sparse_vector_index_fields" , " <COLUMN_NAME>,<ANOTHER_COLUMN_NAME>" )
160
+ .option(" sparse_vector_names" , " <SPARSE_VECTOR_NAME>,<ANOTHER_SPARSE_VECTOR_NAME>" )
161
+ .option(" schema" , < pyspark.sql.DataFrame> .schema.json())
162
+ .mode(" append" )
163
+ .save()
164
+ ```
165
+
166
+ </details >
167
+
168
+ <details >
169
+ <summary ><b >No vectors - Entire dataframe is stored as payload</b ></summary >
56
170
57
171
``` python
58
- < pyspark.sql.DataFrame>
59
- .write
60
- .format(" io.qdrant.spark.Qdrant" )
61
- .option(" qdrant_url" , < QDRANT_GRPC_URL > )
62
- .option(" collection_name" , < QDRANT_COLLECTION_NAME > )
63
- .option(" embedding_field" , < EMBEDDING_FIELD_NAME > ) # Expected to be a field of type ArrayType(FloatType)
64
- .option(" schema" , < pyspark.sql.DataFrame> .schema.json())
65
- .mode(" append" )
66
- .save()
172
+ < pyspark.sql.DataFrame>
173
+ .write
174
+ .format(" io.qdrant.spark.Qdrant" )
175
+ .option(" qdrant_url" , " <QDRANT_GRPC_URL>" )
176
+ .option(" collection_name" , " <QDRANT_COLLECTION_NAME>" )
177
+ .option(" schema" , < pyspark.sql.DataFrame> .schema.json())
178
+ .mode(" append" )
179
+ .save()
67
180
```
68
181
69
- - By default, UUIDs are generated for each row. If you need to use custom IDs, you can do so by setting the ` id_field ` option.
70
- - An API key can be set using the ` api_key ` option to make authenticated requests.
182
+ </details >
71
183
72
184
## Databricks
73
185
74
- You can use the ` qdrant-spark ` connector as a library in Databricks to ingest data into Qdrant.
186
+ You can use the connector as a library in Databricks to ingest data into Qdrant.
75
187
76
188
- Go to the ` Libraries ` section in your cluster dashboard.
77
189
- Select ` Install New ` to open the library installation modal.
78
- - Search for ` io.qdrant:spark:2.0.1 ` in the Maven packages and click ` Install ` .
190
+ - Search for ` io.qdrant:spark:2.1.0 ` in the Maven packages and click ` Install ` .
79
191
80
192
<img width =" 1064 " alt =" Screenshot 2024-01-05 at 17 20 01 (1) " src =" https://github.com/qdrant/qdrant-spark/assets/46051506/d95773e0-c5c6-4ff2-bf50-8055bb08fd1b " >
81
193
@@ -85,17 +197,22 @@ Qdrant supports all the Spark data types. The appropriate types are mapped based
85
197
86
198
## Options and Spark types 🛠️
87
199
88
- | Option | Description | DataType | Required |
89
- | :---------------- | :------------------------------------------------------------------------ | :--------------------- | :------- |
90
- | ` qdrant_url ` | GRPC URL of the Qdrant instance. Eg: < http://localhost:6334 > | ` StringType ` | ✅ |
91
- | ` collection_name ` | Name of the collection to write data into | ` StringType ` | ✅ |
92
- | ` embedding_field ` | Name of the field holding the embeddings | ` ArrayType(FloatType) ` | ✅ |
93
- | ` schema ` | JSON string of the dataframe schema | ` StringType ` | ✅ |
94
- | ` id_field ` | Name of the field holding the point IDs. Default: Generates a random UUId | ` StringType ` | ❌ |
95
- | ` batch_size ` | Max size of the upload batch. Default: 100 | ` IntType ` | ❌ |
96
- | ` retries ` | Number of upload retries. Default: 3 | ` IntType ` | ❌ |
97
- | ` api_key ` | Qdrant API key to be sent in the header. Default: null | ` StringType ` | ❌ |
98
- | ` vector_name ` | Name of the vector in the collection. Default: null | ` StringType ` | ❌ |
200
+ | Option | Description | Column DataType | Required |
201
+ | :--------------------------- | :------------------------------------------------------------------ | :---------------------------- | :------- |
202
+ | ` qdrant_url ` | GRPC URL of the Qdrant instance. Eg: < http://localhost:6334 > | - | ✅ |
203
+ | ` collection_name ` | Name of the collection to write data into | - | ✅ |
204
+ | ` schema ` | JSON string of the dataframe schema | - | ✅ |
205
+ | ` embedding_field ` | Name of the column holding the embeddings | ` ArrayType(FloatType) ` | ❌ |
206
+ | ` id_field ` | Name of the column holding the point IDs. Default: Random UUID | ` StringType ` or ` IntegerType ` | ❌ |
207
+ | ` batch_size ` | Max size of the upload batch. Default: 64 | - | ❌ |
208
+ | ` retries ` | Number of upload retries. Default: 3 | - | ❌ |
209
+ | ` api_key ` | Qdrant API key for authentication | - | ❌ |
210
+ | ` vector_name ` | Name of the vector in the collection. | - | ❌ |
211
+ | ` vector_fields ` | Comma-separated names of columns holding the vectors. | ` ArrayType(FloatType) ` | ❌ |
212
+ | ` vector_names ` | Comma-separated names of vectors in the collection. | - | ❌ |
213
+ | ` sparse_vector_index_fields ` | Comma-separated names of columns holding the sparse vector indices. | ` ArrayType(IntegerType) ` | ❌ |
214
+ | ` sparse_vector_value_fields ` | Comma-separated names of columns holding the sparse vector values. | ` ArrayType(FloatType) ` | ❌ |
215
+ | ` sparse_vector_names ` | Comma-separated names of the sparse vectors in the collection. | - | ❌ |
99
216
100
217
## LICENSE 📜
101
218
0 commit comments