Skip to content

DuckDB #955

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
May 27, 2025
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added changelog/2025-05-22-duckdb/duckdb.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
12 changes: 12 additions & 0 deletions changelog/2025-05-22-duckdb/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
---
slug: duckdb
version: v1.493.0
title: DuckDB
tags: ['scripts', 'storage']
description: You can run DuckDB scripts in-memory, with access to S3 objects and other database resources. You no longer need a scripting language for your ETL pipelines with DuckDB/Polars, you can do it entirely in SQL
features:
- S3 object integration
- Attach to BigQuery, Postgres and MySQL database resources with all CRUD operations
image: ./duckdb.png
docs: /docs/getting_started/scripts_quickstart/sql#duckdb-1
---
Binary file added docs/assets/integrations/duckdb.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
23 changes: 22 additions & 1 deletion docs/core_concepts/11_persistent_storage/large_data_files.mdx
Original file line number Diff line number Diff line change
@@ -31,6 +31,27 @@ Windmill S3 bucket browser will not work for buckets containing more than 20 fil
ETLs can be easily implemented in Windmill using its integration with Polars and DuckDB for facilitate working with tabular data. In this case, you don't need to manually interact with the S3 bucket, Polars/DuckDB does it natively and in a efficient way. Reading and Writing datasets to S3 can be done seamlessly.

<Tabs className="unique-tabs">
<TabItem value="duckdb-script" label="DuckDB" attributes={{className: "text-xs p-4 !mt-0 !ml-0"}}>

```sql
-- $file1 (s3object)

-- Run queries directly on an S3 parquet file passed as an argument
SELECT * FROM read_parquet($file1)

-- Or using an explicit path in a workspace storage
SELECT * FROM read_json('s3:///demo/data.json')

-- You can also specify a secondary workspace storage
SELECT * FROM read_csv('s3://secondary_storage/demo/data.csv')

-- Write the result of a query to a different parquet file on S3
COPY (
SELECT COUNT(*) FROM read_parquet($file1)
) TO 's3:///demo/output.pq' (FORMAT 'parquet');
```

</TabItem>
<TabItem value="polars" label="Polars" attributes={{className: "text-xs p-4 !mt-0 !ml-0"}}>

```python
@@ -77,7 +98,7 @@ def main(input_file: S3Object):
```

</TabItem>
<TabItem value="duckdb" label="DuckDB" attributes={{className: "text-xs p-4 !mt-0 !ml-0"}}>
<TabItem value="duckdb" label="DuckDB (Python)" attributes={{className: "text-xs p-4 !mt-0 !ml-0"}}>

```python
#requirements:
40 changes: 35 additions & 5 deletions docs/core_concepts/27_data_pipelines/index.mdx
Original file line number Diff line number Diff line change
@@ -168,7 +168,7 @@ def main(input_file: S3Object):
```

</TabItem>
<TabItem value="duckdb (AWS S3)" label="DuckDB (AWS S3)" attributes={{className: "text-xs p-4 !mt-0 !ml-0"}}>
<TabItem value="duckdb (Python / AWS S3)" label="DuckDB (Python / AWS S3)" attributes={{className: "text-xs p-4 !mt-0 !ml-0"}}>

```python
import wmill
@@ -221,7 +221,7 @@ def main(input_file: S3Object):
```

</TabItem>
<TabItem value="duckdb (Azure Blob Storage)" label="DuckDB (Azure Blob Storage)" attributes={{className: "text-xs p-4 !mt-0 !ml-0"}}>
<TabItem value="duckdb (Python / Azure Blob Storage)" label="DuckDB (Python / Azure Blob Storage)" attributes={{className: "text-xs p-4 !mt-0 !ml-0"}}>

```python
import wmill
@@ -241,7 +241,7 @@ def main(input_file: S3Object):
# create a DuckDB database in memory
# see https://duckdb.org/docs/api/python/dbapi
conn = duckdb.connect()

# connect duck db to the S3 bucket - this will default to the workspace S3 resource
conn.execute(connection_str)

@@ -259,13 +259,34 @@ def main(input_file: S3Object):

# NOTE: DuckDB doesn't support writing to Azure Blob Storage as of Jan 30 2025
# Write the result of a query to a different parquet file on Azure Blob Storage
# using Polars
# using Polars
storage_options = wmill.polars_connection_settings().storage_options
query_result.pl().write_parquet(output_uri, storage_options=storage_options)
conn.close()
return S3Object(s3=output_file)
```

</TabItem>
<TabItem value="duckdb" label="DuckDb (AWS S3)" attributes={{className: "text-xs p-4 !mt-0 !ml-0"}}>
```sql
-- $file1 (s3object)

-- Run queries directly on an S3 parquet file passed as an argument
SELECT * FROM read_parquet($file1);

-- Or using an explicit path in a workspace storage
SELECT * FROM read_json('s3:///demo/data.json');

-- You can also specify a secondary workspace storage
SELECT * FROM read_csv('s3://secondary_storage/demo/data.csv');

-- Write the result of a query to a different parquet file on S3
COPY (
SELECT COUNT(*) FROM read_parquet($file1)
) TO 's3:///demo/output.pq' (FORMAT 'parquet');

```

</TabItem>
</Tabs>

@@ -283,7 +304,16 @@ With S3 as the external store, a transformation script in a flow will typically
2. Running some computation on the data.
3. Storing the result back to S3 for the next scripts to be run.

Windmill SDKs now expose helpers to simplify code and help you connect Polars or DuckDB to the Windmill workspace S3 bucket. In your usual IDE, you would need to write for _each script_:
When running a DuckDB script, Windmill automatically handles connection to your workspace storage :

```sql
-- This queries the windmill api under the hood to figure out the
-- correct connection string
SELECT * FROM read_parquet('s3:///path/to/file.parquet');
SELECT * FROM read_csv('s3://secondary_storage/path/to/file.csv');
```

If you want to use a scripting language, Windmill SDKs now expose helpers to simplify code and help you connect Polars or DuckDB to the Windmill workspace S3 bucket. In your usual IDE, you would need to write for _each script_:

```python
conn = duckdb.connect()
Original file line number Diff line number Diff line change
@@ -7,7 +7,7 @@ import DocCard from '@site/src/components/DocCard';
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';

# PostgreSQL, MySQL, MS SQL, BigQuery, Snowflake, Redshift, Oracle
# PostgreSQL, MySQL, MS SQL, BigQuery, Snowflake, Redshift, Oracle, DuckDB

In this quick start guide, we will write our first script in SQL. We will see how to connect a Windmill instance to an external SQL service and then send queries to the database using Windmill Scripts.

@@ -344,6 +344,10 @@ Here's a step-by-step guide on where to find each detail.

You can directly "Test connection" if needed.

### DuckDB

DuckDB scripts run in-memory out-of-the-box.

## Create script

Next, let's create a script that will use the newly created Resource. From the Home page,
@@ -517,6 +521,30 @@ UPDATE demo SET col2 = :name3 WHERE col2 = :name2;

"name1", "name2", "name3" being the names of the arguments, and "default arg" the optional default value.

### DuckDB

DuckDB arguments need to be passed in the given format:
```sql
-- $name1 (text) = default arg
-- $name2 (int)
INSERT INTO demo VALUES ($name1, $name2)
```
"name1", "name2" being the names of the arguments, and "default arg" the optional default value.

You can pass a file on S3 as an argument of type s3object. This will automatically setup httpfs with the s3 credentials of the corresponding storage.
You can then query this file using the standard read_csv/read_parquet/read_json functions :
```sql
-- $file (s3object)
SELECT * FROM read_parquet($file)
```

You can also attach to other database resources (BigQuery, PostgreSQL and MySQL). We use the official and community DuckDB extensions under the hood :
```sql
ATTACH '$res:u/demo/amazed_postgresql' AS db (TYPE postgres);
SELECT * FROM db.public.friends;
```


Database resource can be specified from the UI or directly within the script with a line `-- database resource_path`.

<video
1 change: 1 addition & 0 deletions docs/integrations/0_integrations_on_windmill.mdx
Original file line number Diff line number Diff line change
@@ -108,6 +108,7 @@ On [self-hosted instances](../advanced/1_self_host/index.mdx), integrating OAuth
| [Cloudlare R2](./cloudflare-r2.mdx) | Cloud object storage service for data-intensive applications |
| [Datadog](./datadog.md) | Monitoring and analytics platform for cloud-scale infrastructure and applications |
| [Discord](./discord.md) | Voice, video, and text communication platform for gamers |
| [DuckDB](./duckdb.md) | Open-source, in-process SQL OLAP database management system |
| [FaunaDB](./faunadb.md) | Serverless, document-oriented database for modern applications |
| [Funkwhale](./funkwhale.md) | Open-source music streaming and sharing platform |
| [Git repository](./git_repository.mdx) | Remote git repository for distributed version control systems |
9 changes: 9 additions & 0 deletions docs/integrations/duckdb.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# DuckDB integration

[DuckDB](https://duckdb.org/) is an open-source, in-process SQL OLAP database management system designed for fast analytical query workloads.

Windmill supports seamless integration with DuckDB, allowing you to manipulate data from S3 (csv, parquet, json), BigQuery, PostgreSQL, and MySQL.

![Integration between DuckDB and Windmill](../assets/integrations/duckdb.png 'Run a DuckDB script with Windmill')

To get started, check out the [SQL Getting Started section](/docs/getting_started/scripts_quickstart/sql#duckdb-1).
5 changes: 5 additions & 0 deletions sidebars.js
Original file line number Diff line number Diff line change
@@ -429,6 +429,11 @@ const sidebars = {
id: 'integrations/discord',
label: 'Discord'
},
{
type: 'doc',
id: 'integrations/duckdb',
label: 'DuckDB'
},
{
type: 'doc',
id: 'integrations/faunadb',
1 change: 1 addition & 0 deletions src/landing/IntergrationList.jsx
Original file line number Diff line number Diff line change
@@ -32,6 +32,7 @@ const integrations = [
{ name: 'Cloudflare-r2', src: 'third_party_logos/cloudflare.svg' },
{ name: 'Datadog', src: 'third_party_logos/datadog.svg' },
{ name: 'Discord', src: 'third_party_logos/discord.svg' },
{ name: 'DuckDB', src: 'third_party_logos/duckdb.svg' },
{ name: 'FaunaDB', src: 'third_party_logos/faunadb.svg' },
{ name: 'Funkwhale', src: 'third_party_logos/funkwhale.svg' },
{ name: 'Gcal', src: 'third_party_logos/gcal.svg' },
14 changes: 14 additions & 0 deletions static/third_party_logos/duckdb.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.