DuckDB (#955)

diegoimbert · web-flow · commit c1adfccdc21d · 2025-05-27T15:53:13.000+02:00
* DuckDB

* talk about not needing scripting language

* Doc for DuckDB
diff --git a/changelog/2025-05-22-duckdb/duckdb.png b/changelog/2025-05-22-duckdb/duckdb.png
diff --git a/changelog/2025-05-22-duckdb/index.md b/changelog/2025-05-22-duckdb/index.md
@@ -0,0 +1,12 @@
+---
+slug: duckdb
+version: v1.493.0
+title: DuckDB
+tags: ['scripts', 'storage']
+description: You can run DuckDB scripts in-memory, with access to S3 objects and other database resources. You no longer need a scripting language for your ETL pipelines with DuckDB/Polars, you can do it entirely in SQL
+features:
+  - S3 object integration
+  - Attach to BigQuery, Postgres and MySQL database resources with all CRUD operations
+image: ./duckdb.png
+docs: /docs/getting_started/scripts_quickstart/sql#duckdb-1
+---
diff --git a/docs/assets/integrations/duckdb.png b/docs/assets/integrations/duckdb.png
diff --git a/docs/core_concepts/11_persistent_storage/large_data_files.mdx b/docs/core_concepts/11_persistent_storage/large_data_files.mdx
@@ -31,6 +31,27 @@ Windmill S3 bucket browser will not work for buckets containing more than 20 fil
 ETLs can be easily implemented in Windmill using its integration with Polars and DuckDB for facilitate working with tabular data. In this case, you don't need to manually interact with the S3 bucket, Polars/DuckDB does it natively and in a efficient way. Reading and Writing datasets to S3 can be done seamlessly.
 
 <Tabs className="unique-tabs">
+<TabItem value="duckdb-script" label="DuckDB" attributes={{className: "text-xs p-4 !mt-0 !ml-0"}}>
+
+```sql
+-- $file1 (s3object)
+
+-- Run queries directly on an S3 parquet file passed as an argument
+SELECT * FROM read_parquet($file1)
+
+-- Or using an explicit path in a workspace storage
+SELECT * FROM read_json('s3:///demo/data.json')
+
+-- You can also specify a secondary workspace storage
+SELECT * FROM read_csv('s3://secondary_storage/demo/data.csv')
+
+-- Write the result of a query to a different parquet file on S3
+COPY (
+    SELECT COUNT(*) FROM read_parquet($file1)
+) TO 's3:///demo/output.pq' (FORMAT 'parquet');
+```
+
+</TabItem>
 <TabItem value="polars" label="Polars" attributes={{className: "text-xs p-4 !mt-0 !ml-0"}}>
 
 ```python
@@ -77,7 +98,7 @@ def main(input_file: S3Object):
 ```
 
 </TabItem>
-<TabItem value="duckdb" label="DuckDB" attributes={{className: "text-xs p-4 !mt-0 !ml-0"}}>
+<TabItem value="duckdb" label="DuckDB (Python)" attributes={{className: "text-xs p-4 !mt-0 !ml-0"}}>
 
 ```python
 #requirements:
diff --git a/docs/core_concepts/27_data_pipelines/index.mdx b/docs/core_concepts/27_data_pipelines/index.mdx
@@ -168,7 +168,7 @@ def main(input_file: S3Object):
 ```
 
 </TabItem>
-<TabItem value="duckdb (AWS S3)" label="DuckDB (AWS S3)" attributes={{className: "text-xs p-4 !mt-0 !ml-0"}}>
+<TabItem value="duckdb (Python / AWS S3)" label="DuckDB (Python / AWS S3)" attributes={{className: "text-xs p-4 !mt-0 !ml-0"}}>
 
 ```python
 import wmill
@@ -221,7 +221,7 @@ def main(input_file: S3Object):
 ```
 
 </TabItem>
-<TabItem value="duckdb (Azure Blob Storage)" label="DuckDB (Azure Blob Storage)" attributes={{className: "text-xs p-4 !mt-0 !ml-0"}}>
+<TabItem value="duckdb (Python / Azure Blob Storage)" label="DuckDB (Python / Azure Blob Storage)" attributes={{className: "text-xs p-4 !mt-0 !ml-0"}}>
 
 ```python
 import wmill
@@ -241,7 +241,7 @@ def main(input_file: S3Object):
     # create a DuckDB database in memory
     # see https://duckdb.org/docs/api/python/dbapi
     conn = duckdb.connect()
-    
+
     # connect duck db to the S3 bucket - this will default to the workspace S3 resource
     conn.execute(connection_str)
 
@@ -259,13 +259,34 @@ def main(input_file: S3Object):
 
     # NOTE: DuckDB doesn't support writing to Azure Blob Storage as of Jan 30 2025
     # Write the result of a query to a different parquet file on Azure Blob Storage
-    # using Polars 
+    # using Polars
     storage_options = wmill.polars_connection_settings().storage_options
     query_result.pl().write_parquet(output_uri, storage_options=storage_options)
     conn.close()
     return S3Object(s3=output_file)
 ```
 
+</TabItem>
+<TabItem value="duckdb" label="DuckDb (AWS S3)" attributes={{className: "text-xs p-4 !mt-0 !ml-0"}}>
+```sql
+-- $file1 (s3object)
+
+-- Run queries directly on an S3 parquet file passed as an argument
+SELECT * FROM read_parquet($file1);
+
+-- Or using an explicit path in a workspace storage
+SELECT * FROM read_json('s3:///demo/data.json');
+
+-- You can also specify a secondary workspace storage
+SELECT * FROM read_csv('s3://secondary_storage/demo/data.csv');
+
+-- Write the result of a query to a different parquet file on S3
+COPY (
+    SELECT COUNT(*) FROM read_parquet($file1)
+) TO 's3:///demo/output.pq' (FORMAT 'parquet');
+
+```
+
 </TabItem>
 </Tabs>
 
@@ -283,7 +304,16 @@ With S3 as the external store, a transformation script in a flow will typically
 2. Running some computation on the data.
 3. Storing the result back to S3 for the next scripts to be run.
 
-Windmill SDKs now expose helpers to simplify code and help you connect Polars or DuckDB to the Windmill workspace S3 bucket. In your usual IDE, you would need to write for _each script_:
+When running a DuckDB script, Windmill automatically handles connection to your workspace storage :
+
+```sql
+-- This queries the windmill api under the hood to figure out the
+-- correct connection string
+SELECT * FROM read_parquet('s3:///path/to/file.parquet');
+SELECT * FROM read_csv('s3://secondary_storage/path/to/file.csv');
+```
+
+If you want to use a scripting language, Windmill SDKs now expose helpers to simplify code and help you connect Polars or DuckDB to the Windmill workspace S3 bucket. In your usual IDE, you would need to write for _each script_:
 
 ```python
 conn = duckdb.connect()
diff --git a/docs/getting_started/0_scripts_quickstart/5_sql_quickstart/index.mdx b/docs/getting_started/0_scripts_quickstart/5_sql_quickstart/index.mdx
@@ -7,7 +7,7 @@ import DocCard from '@site/src/components/DocCard';
 import Tabs from '@theme/Tabs';
 import TabItem from '@theme/TabItem';
 
-# PostgreSQL, MySQL, MS SQL, BigQuery, Snowflake, Redshift, Oracle
+# PostgreSQL, MySQL, MS SQL, BigQuery, Snowflake, Redshift, Oracle, DuckDB
 
 In this quick start guide, we will write our first script in SQL. We will see how to connect a Windmill instance to an external SQL service and then send queries to the database using Windmill Scripts.
 
@@ -344,6 +344,10 @@ Here's a step-by-step guide on where to find each detail.
 
 You can directly "Test connection" if needed.
 
+### DuckDB
+
+DuckDB scripts run in-memory out-of-the-box.
+
 ## Create script
 
 Next, let's create a script that will use the newly created Resource. From the Home page,
@@ -517,6 +521,30 @@ UPDATE demo SET col2 = :name3 WHERE col2 = :name2;
 
 "name1", "name2", "name3" being the names of the arguments, and "default arg" the optional default value.
 
+### DuckDB
+
+DuckDB arguments need to be passed in the given format:
+```sql
+-- $name1 (text) = default arg
+-- $name2 (int)
+INSERT INTO demo VALUES ($name1, $name2)
+```
+"name1", "name2" being the names of the arguments, and "default arg" the optional default value.  
+
+You can pass a file on S3 as an argument of type s3object. This will automatically setup httpfs with the s3 credentials of the corresponding storage.
+You can then query this file using the standard read_csv/read_parquet/read_json functions :
+```sql
+-- $file (s3object)
+SELECT * FROM read_parquet($file)
+```
+
+You can also attach to other database resources (BigQuery, PostgreSQL and MySQL). We use the official and community DuckDB extensions under the hood :
+```sql
+ATTACH '$res:u/demo/amazed_postgresql' AS db (TYPE postgres);
+SELECT * FROM db.public.friends;
+```
+
+
 Database resource can be specified from the UI or directly within the script with a line `-- database resource_path`.
 
 <video
diff --git a/docs/integrations/0_integrations_on_windmill.mdx b/docs/integrations/0_integrations_on_windmill.mdx
@@ -108,6 +108,7 @@ On [self-hosted instances](../advanced/1_self_host/index.mdx), integrating OAuth
 | [Cloudlare R2](./cloudflare-r2.mdx)               | Cloud object storage service for data-intensive applications                                                              |
 | [Datadog](./datadog.md)                           | Monitoring and analytics platform for cloud-scale infrastructure and applications                                         |
 | [Discord](./discord.md)                           | Voice, video, and text communication platform for gamers                                                                  |
+| [DuckDB](./duckdb.md)                             | Open-source, in-process SQL OLAP database management system                                                                  |
 | [FaunaDB](./faunadb.md)                           | Serverless, document-oriented database for modern applications                                                            |
 | [Funkwhale](./funkwhale.md)                       | Open-source music streaming and sharing platform                                                                          |
 | [Git repository](./git_repository.mdx)            | Remote git repository for distributed version control systems                                                             |
diff --git a/docs/integrations/duckdb.md b/docs/integrations/duckdb.md
@@ -0,0 +1,9 @@
+# DuckDB integration
+
+[DuckDB](https://duckdb.org/) is an open-source, in-process SQL OLAP database management system designed for fast analytical query workloads.
+
+Windmill supports seamless integration with DuckDB, allowing you to manipulate data from S3 (csv, parquet, json), BigQuery, PostgreSQL, and MySQL.
+
+![Integration between DuckDB and Windmill](../assets/integrations/duckdb.png 'Run a DuckDB script with Windmill')
+
+To get started, check out the [SQL Getting Started section](/docs/getting_started/scripts_quickstart/sql#duckdb-1).
diff --git a/sidebars.js b/sidebars.js
@@ -429,6 +429,11 @@ const sidebars = {
 							id: 'integrations/discord',
 							label: 'Discord'
 						},
+						{
+							type: 'doc',
+							id: 'integrations/duckdb',
+							label: 'DuckDB'
+						},
 						{
 							type: 'doc',
 							id: 'integrations/faunadb',
diff --git a/src/landing/IntergrationList.jsx b/src/landing/IntergrationList.jsx
@@ -32,6 +32,7 @@ const integrations = [
 	{ name: 'Cloudflare-r2', src: 'third_party_logos/cloudflare.svg' },
 	{ name: 'Datadog', src: 'third_party_logos/datadog.svg' },
 	{ name: 'Discord', src: 'third_party_logos/discord.svg' },
+	{ name: 'DuckDB', src: 'third_party_logos/duckdb.svg' },
 	{ name: 'FaunaDB', src: 'third_party_logos/faunadb.svg' },
 	{ name: 'Funkwhale', src: 'third_party_logos/funkwhale.svg' },
 	{ name: 'Gcal', src: 'third_party_logos/gcal.svg' },
diff --git a/static/third_party_logos/duckdb.svg b/static/third_party_logos/duckdb.svg
@@ -0,0 +1,14 @@
+<svg width="335" height="335" viewBox="0 0 335 335" fill="none" xmlns="http://www.w3.org/2000/svg">
+	<path
+		d="M167.2 334.4C74.7 334.4 0 259.7 0 167.2C0 74.7 74.7 0 167.2 0C259.7 0 334.4 74.7 334.4 167.2C334.4 259.7 259.7 334.4 167.2 334.4Z"
+		fill="black"
+	/>
+	<path
+		d="M256.6 142.4H223.8V191.9H256.6C270.2 191.9 281.5 180.6 281.5 167C281.5 153.3 270.2 142.4 256.6 142.4Z"
+		fill="#FFF100"
+	/>
+	<path
+		d="M63.1 167.2C63.1 205.4 94.2 236.5 132.4 236.5C170.6 236.5 201.7 205.4 201.7 167.2C201.7 129 170.6 97.9 132.4 97.9C94.2 97.9 63.1 129 63.1 167.2Z"
+		fill="#FFF100"
+	/>
+</svg>