Replies: 4 comments 2 replies
-
@Hylianist can you share the notebook? do you still see the issue if you change the SQL output type from |
Beta Was this translation helpful? Give feedback.
-
Apologies for the delay in responding. I have iterated on my testing and have come up with the attached file. I am already using native mode. If this attached file is run interactively using marimo in the browser, the final print statement takes around 20 seconds to run. If I run the same file using Changing from native back to auto, and trying lazy polars and Pandas, all took 20 seconds in browser and 3 seconds in console. I'm using the latest version of marimo, latest version of Python, and Firefox, running in a podman Arch linux container, on a linux host. I'm not an expert with this sort of thing, but this would appear to show it is something related to running it interactively in the Browser is adding the significant slow down. The volume of output is minimal once it will have been processed by DuckDB, so I can't understand why such different timings. Is there a way for me to extract debug logs from the marimo environment? My raw data is needed to reproduce, but it is 2.5GB in size. I can try uploading it somewhere if it would be helpful. |
Beta Was this translation helpful? Give feedback.
-
Yes, I have restarted the notebook many times since setting to native. Thank you, I ran your code, and although it took 18 seconds on my computer, there was at least parity between the notebook and running directly in Python. I will iterate some more and see if I can get closer on the root cause. |
Beta Was this translation helpful? Give feedback.
-
I have tried a number of things and appear to have finally found the issue. I have been consistently getting 20 second run times in the notebook, but when I created an empty DuckDB database and used that (as opposed to the in-memory DuckDB instance), the runtime went down to 2 seconds, consistent with running in the terminal. I don't know why there is such a difference, as it's not actually populating that empty database, but I at least have a solution for me. Edit: In fact, just explicitly creating an in-memory |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm not used to using notebooks, duckdb, or asking questions, so please bear with me, and assist me with knowing if there is anything more I need to do/provide to resolve.
As a very basic test of Marimo, I have four parquet files that contain 1B rows total and are my version of the 1 billion row challenge (just a string for a weather station name and a double for a temperature). [I know the choice to use double is questionable]
Just using DuckDB from the command line, I have tried:
time duckdb -c "SELECT random(), location, min(measurement) as min, max(measurement) as max, avg(measurement) as average FROM read_parquet('measurements/*') group by location limit 10"
It completed in two seconds.
real 0m1.851s user 0m26.020s sys 0m1.011s
I then tried the same query using Marimo, exactly the same command as above (after a cell of importing Marimo). The SQL cell completed in 9ms (which I assume is just the sending of the command to DuckDB?), but then the output itself above the cell took 20 seconds to appear (I timed it on my phone).
I do not understand why there is such a difference as my understanding was that under the hood these two should be equivalent (I'm using native mode), with Marimo taking a fraction longer to display the 10 rows.
Is this just a misunderstanding on my part (if so please explain), or if the above should be equivalent, how I can help to explore further?
[The random() in the SQL was just to assist me with identifying when the table in Marimo updated]
Beta Was this translation helpful? Give feedback.
All reactions