Skip to content

Add DuckDB Dialect Support #738

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 24 commits into from

Conversation

hughcameron
Copy link

This pull request adds support for the DuckDB SQL dialect to the SQL Formatter library.

Description:

  • Extends the supported dialects to include DuckDB.
  • Updates the documentation (README.md and potentially others) to reflect the addition of DuckDB support.
  • Includes any necessary tests to ensure proper formatting for DuckDB queries.

Benefits:

  • Users working with DuckDB can now leverage the SQL Formatter library for consistent and readable SQL code.
  • Enhances the overall library coverage by including a popular in-memory database.

Testing:

  • A test suite is included to verify accurate formatting for various DuckDB queries.

Please review the changes and provide feedback.

Copy link

codesandbox-ci bot commented Apr 30, 2024

This pull request is automatically built and testable in CodeSandbox.

To see build info of the built libraries, click here or the icon next to each commit SHA.

@nene
Copy link
Collaborator

nene commented Apr 30, 2024

Thanks for the PR. A few quick questions and thoughts:

  • What's the relationship between DuckDB and PostgreSQL? I saw in DuckDB docs that it uses PostgreSQL parser. But does it actually support all of the PostgreSQL syntax? Like, does it support all these operators that you have listed?
  • You probably have listed too many keywords. Best to only include reserved keywords. Otherwise you'll have common field names like id, name, location, type detected as keywords and converted to uppercase when using keywordCase: "upper".
  • Your data types list seems to be missing some basic stuff like INT, CHAR, ARRAY.

I'm pretty busy this week... not sure how much time I have to properly review this.

@nene
Copy link
Collaborator

nene commented Apr 30, 2024

For bonus points, you can update the wiki with information about DuckDB. That will also make it easier for me to review this. Otherwise I'll have to go and figure out all of this about DuckDB by myself.

@PMassicotte
Copy link

I was about to submit a PR also, I will make some comment in your code.

@PMassicotte
Copy link

@hughcameron
Copy link
Author

Thanks for the comments above. The errors from the test suite are now down to five:

DuckDBFormatter
    ✕ supports ARRAY[] literals (2 ms)
    ✕ dataTypeCase option does NOT affect ARRAY[] literal case
    ✕ keywordCase option affects ARRAY[] literal case
    ✕ dataTypeCase option affects ARRAY type case (2 ms)
    ✕ supports array slice operator

@nene - I'll look into filling out the wiki 👍

@nene
Copy link
Collaborator

nene commented May 2, 2024

To fix ARRAY[] tests:

  • add ARRAY to dataTypes and remove it from keywords and functions lists.

To fix the bar[1:] test:

  • remove BAR from list of functions

I would guess the builtin BAR() function is rarely used. Well... at least I failed to find documentation of it, because it's super hard to google as Postgres docs contain loads of foo and bar in example code. I think it's better for the formatter to also support the more common use case of bar as a name used in example code.

PS. Make sure to run yarn pretty. (Looks like that currently also changes the .pre-commit-hooks.yaml file... you can let Prettier to reformat that, or leave it as-is. Either way is fine.)

@hughcameron
Copy link
Author

The test suite is passing completely now 🎉

I'll collate some notes for the wiki over the next week. Are there any other steps needed before merging?

@nene
Copy link
Collaborator

nene commented May 3, 2024

Thanks. I don't think there's anything else.

Copy link
Collaborator

@nene nene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay...

Took a bit of a look at this and noticed that several things (most notably all these operators) are not actually supported by DuckDB.

Also, when I simply compare it to PostgreSQL implementation, it looks almost the same (ignoring keywords and function names lists). But my very brief scanning of DuckDB documentation revealed several things that are different in DuckDB.

'EXPLAIN',
'FETCH',
'GRANT',
'INSTALL',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like INSTALL is the only name added to this list compared to PostgreSQL. I would suspect there are more differences by the statements supported by PostgreSQL and DuckDB.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @nene - you're correct. DuckDB uses the PostgreSQL parser. That's why DuckDB’s SQL dialect closely follows the conventions of the PostgreSQL dialect with only a few exceptions as listed here.

Many features are unsupported so I've removed unsupported elements in the formatter.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @nene - let me know if it's OK to resolve this conversation.

I've also added some specific DuckDB features below.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For those in the future. Diversions away from PostgreSQL dialect are now listed here.

@Zank94
Copy link

Zank94 commented Nov 20, 2024

Hello,
We're are working on a project using DuckDB and have a few issues with the standard SQL format (->> replaced with - > > for instance).
Do you know when this PR will be merged?
Thanks

@nene
Copy link
Collaborator

nene commented Nov 20, 2024

Thanks for the interest. This thing has indeed been sitting here for a while now. Will need to dig back into this to see if there's any reasons why it hasn't been merged yet. I think one of the main reasons was that it seemed really-really similar to PostgreSQL.

So a question to you @Zank94, if you just configure SQL Formatter to treat your SQL as PostgreSQL, would that solve your problems (e.g. the ->> operator is also present in PostgreSQL).

@Zank94
Copy link

Zank94 commented Nov 21, 2024

I see, thank you for the advice I will give it a try 👍

@karanpopat
Copy link

I was using the postgres formatter for duckdb queries, but it fails when the query is having <<= which gets formatted to << =
Any updates, if this PR is going to be merged anytime soon?

@nene
Copy link
Collaborator

nene commented Apr 18, 2025

Well, the thing is that this PR has a load of failing tests. Interestingly all failing tests were fixed at one point in eb7d9b1, but after that bunch of more changes were added and now it has a total of 92 failing tests. (Don't know why Github says that "All checks have passed".)

I personally have no real knowledge of DuckDB and the documentation of DuckDB seems to be lacking. For example I tried to find information about that <<= operator, but was not able to. Really I couldn't even find information about which basic arithmetic operators are supported. The expressions documentation is kinda short. It does mention the == operator, which however this PR does not include, just as it doesn't include the <<= operator.

Initially I had the false impression that DuckDB is pretty much just PostgreSQL with a few minor differences, I have now come to a conclusion that it's more like DuckDB supports some small subset of PostgreSQL syntax (plus some DuckDB-specific additions like CREATE MACRO).

Might be that this whole PR should be started from scratch. I would personally start with filling out the wiki with information about DuckDB. But because of DuckDB lack of documentation, it seems like an inconvenient task to undertake.

@riziles
Copy link

riziles commented Apr 18, 2025

@nene , I have to disagree with you here. I think DuckDB has fantastic documentation, but it has a LOT of functionality above and beyond Postgres, so the documentation can be pretty dense for the uninitiated.

@karanpopat , what the heck is the <<= operator? I use DuckDB every day, but I'm not familiar with that one, and I can't find it anywhere in the docs.

@karanpopat
Copy link

@nene
Copy link
Collaborator

nene commented Apr 18, 2025

@riziles It might very well be that the documentation is great and I just don't know how to use it. Like this page which at first glance seems to document differences from PostgreSQL, but it's a pretty short page. I guess it actually tries to document the "important" differences for ordinary users.

I think I now finally found that most of the operators are documented in the Functions section. But not all operators can be found there. For example the ~ operator isn't mentioned under Regular expression functions while it is in fact equivalent of regexp_full_match.

@riziles
Copy link

riziles commented Apr 18, 2025

@karanpopat , that seems like a pretty niche requirement. Can't you just use the "denseOperators" flag?
https://github.com/sql-formatter-org/sql-formatter/blob/master/docs/denseOperators.md

@nene
Copy link
Collaborator

nene commented Apr 19, 2025

I think the denseOperators flag is really a pretty crappy solution as it forces all operators to be formatted with no spaces around them. I frankly don't know why anyone would ever use it.

One can instead just extends the postgresql formatter configuration with an additional operator. Something like:

import { formatDialect, postgresql } from 'sql-formatter';

const duckdb = {
    ...postgresql,
    tokenizerOptions: {
        ...postgresql.tokenizerOptions
        operators: [...postgresql.tokenizerOptions.operators, '<<=', '>>='],
    }
};

formatDialect('SELECT foo <<= bar FROM tbl', { dialect: duckdb });

@nene
Copy link
Collaborator

nene commented Apr 19, 2025

I have now digged a bit deeper to the DuckDB syntax and I think I was mislead earlier when I read from the docs that DuckDB uses PostgreSQL parser. I frankly can't find that from the docs any more. I guess that part was removed. Turns out they instead used a (likely heavily modified) fork of PostgreSQL parser and for all I know they might be using a completely custom parser by now.

There are just so-so many differences in syntax, that I think it makes no sense to treat it as a completely different dialect. Some most notable DuckDB-specific syntax I've found so far:

  • Prefix aliases: SELECT foo: 1, bar: 2
  • List literals: SELECT [1, 2, 3]
  • Struct literals: SELECT {foo: 1, bar: 2}
  • List slice operator: SELECT list[1:10]
  • POSITIONAL and ASOF joins.
  • Percentage-based limit: SELECT * FROM tbl LIMIT 10%

@riziles
Copy link

riziles commented Apr 19, 2025

It's definitely based on Postgres syntax, i.e. most vanilla Postgres queries would work fine in DuckDB, but it has capabilities far above and beyond for analytics workflows:
https://www.theregister.com/2024/08/20/postgresql_duckdb_extension/

@nene nene mentioned this pull request Apr 20, 2025
@nene
Copy link
Collaborator

nene commented Apr 20, 2025

So, I ended up putting this PR aside and creating a new DuckDB configuration from scratch: #857

I ended up using the functions, keywords and data types lists from this PR. Thanks for that @hughcameron.

Also thanks to everybody else who has provided information about DuckDB in this thread.

@nene nene closed this Apr 20, 2025
nene added a commit that referenced this pull request Apr 20, 2025
Created a brand new DuckDB configuration to replace the old #738 pull
request.

This should now support the most important bits of DuckDB. There are
some caveats though:

- No support for the percentage syntax, like `LIMIT 10%` (conflicts with
modulo operator).
- No support for named parameters like `$foo` (conflicts with $$-quoted
strings).
- No support for array-slice operator (conflicts with `:` in struct
literals and prefix aliases).

There's definitely quite a bit more than the above three little things.
But I think at least for start we have some DuckDB support that should
work for most users.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants