Question about duckdb read performance #6041

Hylianist · 2025-08-14T08:52:54Z

Hylianist
Aug 14, 2025

I'm not used to using notebooks, duckdb, or asking questions, so please bear with me, and assist me with knowing if there is anything more I need to do/provide to resolve.

As a very basic test of Marimo, I have four parquet files that contain 1B rows total and are my version of the 1 billion row challenge (just a string for a weather station name and a double for a temperature). [I know the choice to use double is questionable]

Just using DuckDB from the command line, I have tried:

time duckdb -c "SELECT random(), location, min(measurement) as min, max(measurement) as max, avg(measurement) as average FROM read_parquet('measurements/*') group by location limit 10"

It completed in two seconds.

real 0m1.851s user 0m26.020s sys 0m1.011s

I then tried the same query using Marimo, exactly the same command as above (after a cell of importing Marimo). The SQL cell completed in 9ms (which I assume is just the sending of the command to DuckDB?), but then the output itself above the cell took 20 seconds to appear (I timed it on my phone).

I do not understand why there is such a difference as my understanding was that under the hood these two should be equivalent (I'm using native mode), with Marimo taking a fraction longer to display the 10 rows.

Is this just a misunderstanding on my part (if so please explain), or if the above should be equivalent, how I can help to explore further?

[The random() in the SQL was just to assist me with identifying when the table in Marimo updated]

mscolnick · 2025-08-22T06:16:54Z

mscolnick
Aug 22, 2025
Maintainer

@Hylianist can you share the notebook? do you still see the issue if you change the SQL output type from auto to native in the notebook settings hamburger menu?

0 replies

Hylianist · 2025-09-28T12:05:08Z

Hylianist
Sep 28, 2025
Author

Apologies for the delay in responding. I have iterated on my testing and have come up with the attached file.

I am already using native mode.

If this attached file is run interactively using marimo in the browser, the final print statement takes around 20 seconds to run. If I run the same file using time uv run python marimo_app.py, it completes in about two seconds and displays the output in the console (consistent with just using duckDB).

Changing from native back to auto, and trying lazy polars and Pandas, all took 20 seconds in browser and 3 seconds in console.

I'm using the latest version of marimo, latest version of Python, and Firefox, running in a podman Arch linux container, on a linux host. I'm not an expert with this sort of thing, but this would appear to show it is something related to running it interactively in the Browser is adding the significant slow down. The volume of output is minimal once it will have been processed by DuckDB, so I can't understand why such different timings.

Is there a way for me to extract debug logs from the marimo environment?

My raw data is needed to reproduce, but it is 2.5GB in size. I can try uploading it somewhere if it would be helpful.

marimo_app.py

1 reply

mscolnick Sep 29, 2025
Maintainer

@Hylianist - did you restart the notebook after changing from auto to native? i created a similar expensive function and see similar times for processing in both within the marimo editor and running uv run python my_app.py (about 8 seconds).

here is the sql i used:

SELECT COUNT(*)
FROM generate_series(1, 100000) AS t1(i)
CROSS JOIN generate_series(1, 100000) AS t2(j)
WHERE sin(t1.i) > cos(t2.j)

there is a tracing panel you can maybe inspect which cell is taking a long time. otherwise, im not sure i have any hunches why it would be slower.

Hylianist · 2025-09-29T08:51:15Z

Hylianist
Sep 29, 2025
Author

Yes, I have restarted the notebook many times since setting to native.

Thank you, I ran your code, and although it took 18 seconds on my computer, there was at least parity between the notebook and running directly in Python. I will iterate some more and see if I can get closer on the root cause.

0 replies

Hylianist · 2025-10-04T15:33:30Z

Hylianist
Oct 4, 2025
Author

I have tried a number of things and appear to have finally found the issue.

I have been consistently getting 20 second run times in the notebook, but when I created an empty DuckDB database and used that (as opposed to the in-memory DuckDB instance), the runtime went down to 2 seconds, consistent with running in the terminal. I don't know why there is such a difference, as it's not actually populating that empty database, but I at least have a solution for me.

Edit: In fact, just explicitly creating an in-memory engine = duckdb.connect() provides the speed-up versus using the Marimo in-memory.

1 reply

mscolnick Oct 10, 2025
Maintainer

Oh, that is interesting, we will dig into this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question about duckdb read performance #6041

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Question about duckdb read performance #6041

Uh oh!

Hylianist Aug 14, 2025

Replies: 4 comments · 2 replies

Uh oh!

mscolnick Aug 22, 2025 Maintainer

Uh oh!

Uh oh!

Hylianist Sep 28, 2025 Author

Uh oh!

mscolnick Sep 29, 2025 Maintainer

Uh oh!

Hylianist Sep 29, 2025 Author

Uh oh!

Uh oh!

Hylianist Oct 4, 2025 Author

Uh oh!

mscolnick Oct 10, 2025 Maintainer

Hylianist
Aug 14, 2025

Replies: 4 comments 2 replies

mscolnick
Aug 22, 2025
Maintainer

Hylianist
Sep 28, 2025
Author

mscolnick Sep 29, 2025
Maintainer

Hylianist
Sep 29, 2025
Author

Hylianist
Oct 4, 2025
Author

mscolnick Oct 10, 2025
Maintainer