Skip to content

Commit 246d075

Browse files
committed
Doc for DuckDB
1 parent c7690c2 commit 246d075

File tree

2 files changed

+57
-6
lines changed

2 files changed

+57
-6
lines changed

docs/core_concepts/11_persistent_storage/large_data_files.mdx

Lines changed: 22 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,27 @@ Windmill S3 bucket browser will not work for buckets containing more than 20 fil
3131
ETLs can be easily implemented in Windmill using its integration with Polars and DuckDB for facilitate working with tabular data. In this case, you don't need to manually interact with the S3 bucket, Polars/DuckDB does it natively and in a efficient way. Reading and Writing datasets to S3 can be done seamlessly.
3232

3333
<Tabs className="unique-tabs">
34+
<TabItem value="duckdb-script" label="DuckDB" attributes={{className: "text-xs p-4 !mt-0 !ml-0"}}>
35+
36+
```sql
37+
-- $file1 (s3object)
38+
39+
-- Run queries directly on an S3 parquet file passed as an argument
40+
SELECT * FROM read_parquet($file1)
41+
42+
-- Or using an explicit path in a workspace storage
43+
SELECT * FROM read_json('s3:///demo/data.json')
44+
45+
-- You can also specify a secondary workspace storage
46+
SELECT * FROM read_csv('s3://secondary_storage/demo/data.csv')
47+
48+
-- Write the result of a query to a different parquet file on S3
49+
COPY (
50+
SELECT COUNT(*) FROM read_parquet($file1)
51+
) TO 's3:///demo/output.pq' (FORMAT 'parquet');
52+
```
53+
54+
</TabItem>
3455
<TabItem value="polars" label="Polars" attributes={{className: "text-xs p-4 !mt-0 !ml-0"}}>
3556

3657
```python
@@ -77,7 +98,7 @@ def main(input_file: S3Object):
7798
```
7899

79100
</TabItem>
80-
<TabItem value="duckdb" label="DuckDB" attributes={{className: "text-xs p-4 !mt-0 !ml-0"}}>
101+
<TabItem value="duckdb" label="DuckDB (Python)" attributes={{className: "text-xs p-4 !mt-0 !ml-0"}}>
81102

82103
```python
83104
#requirements:

docs/core_concepts/27_data_pipelines/index.mdx

Lines changed: 35 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -168,7 +168,7 @@ def main(input_file: S3Object):
168168
```
169169

170170
</TabItem>
171-
<TabItem value="duckdb (AWS S3)" label="DuckDB (AWS S3)" attributes={{className: "text-xs p-4 !mt-0 !ml-0"}}>
171+
<TabItem value="duckdb (Python / AWS S3)" label="DuckDB (Python / AWS S3)" attributes={{className: "text-xs p-4 !mt-0 !ml-0"}}>
172172

173173
```python
174174
import wmill
@@ -221,7 +221,7 @@ def main(input_file: S3Object):
221221
```
222222

223223
</TabItem>
224-
<TabItem value="duckdb (Azure Blob Storage)" label="DuckDB (Azure Blob Storage)" attributes={{className: "text-xs p-4 !mt-0 !ml-0"}}>
224+
<TabItem value="duckdb (Python / Azure Blob Storage)" label="DuckDB (Python / Azure Blob Storage)" attributes={{className: "text-xs p-4 !mt-0 !ml-0"}}>
225225

226226
```python
227227
import wmill
@@ -241,7 +241,7 @@ def main(input_file: S3Object):
241241
# create a DuckDB database in memory
242242
# see https://duckdb.org/docs/api/python/dbapi
243243
conn = duckdb.connect()
244-
244+
245245
# connect duck db to the S3 bucket - this will default to the workspace S3 resource
246246
conn.execute(connection_str)
247247

@@ -259,13 +259,34 @@ def main(input_file: S3Object):
259259

260260
# NOTE: DuckDB doesn't support writing to Azure Blob Storage as of Jan 30 2025
261261
# Write the result of a query to a different parquet file on Azure Blob Storage
262-
# using Polars
262+
# using Polars
263263
storage_options = wmill.polars_connection_settings().storage_options
264264
query_result.pl().write_parquet(output_uri, storage_options=storage_options)
265265
conn.close()
266266
return S3Object(s3=output_file)
267267
```
268268

269+
</TabItem>
270+
<TabItem value="duckdb" label="DuckDb (AWS S3)" attributes={{className: "text-xs p-4 !mt-0 !ml-0"}}>
271+
```sql
272+
-- $file1 (s3object)
273+
274+
-- Run queries directly on an S3 parquet file passed as an argument
275+
SELECT * FROM read_parquet($file1);
276+
277+
-- Or using an explicit path in a workspace storage
278+
SELECT * FROM read_json('s3:///demo/data.json');
279+
280+
-- You can also specify a secondary workspace storage
281+
SELECT * FROM read_csv('s3://secondary_storage/demo/data.csv');
282+
283+
-- Write the result of a query to a different parquet file on S3
284+
COPY (
285+
SELECT COUNT(*) FROM read_parquet($file1)
286+
) TO 's3:///demo/output.pq' (FORMAT 'parquet');
287+
288+
```
289+
269290
</TabItem>
270291
</Tabs>
271292

@@ -283,7 +304,16 @@ With S3 as the external store, a transformation script in a flow will typically
283304
2. Running some computation on the data.
284305
3. Storing the result back to S3 for the next scripts to be run.
285306

286-
Windmill SDKs now expose helpers to simplify code and help you connect Polars or DuckDB to the Windmill workspace S3 bucket. In your usual IDE, you would need to write for _each script_:
307+
When running a DuckDB script, Windmill automatically handles connection to your workspace storage :
308+
309+
```sql
310+
-- This queries the windmill api under the hood to figure out the
311+
-- correct connection string
312+
SELECT * FROM read_parquet('s3:///path/to/file.parquet');
313+
SELECT * FROM read_csv('s3://secondary_storage/path/to/file.csv');
314+
```
315+
316+
If you want to use a scripting language, Windmill SDKs now expose helpers to simplify code and help you connect Polars or DuckDB to the Windmill workspace S3 bucket. In your usual IDE, you would need to write for _each script_:
287317

288318
```python
289319
conn = duckdb.connect()

0 commit comments

Comments
 (0)