Description
Is your feature request related to a problem?
Consider the following script:
import ibis
conn = ibis.duckdb.connect("mydb.duckdb")
if "my_table" not in conn.tables:
conn.create_table("my_table", schema={"c": "int64"}, overwrite=False)
def get_new_data():
t = ibis.memtable([{"a": 1, "b": 2}, {"a": 3, "b": 4}])
t = t.select(c=t.a + t.b)
print(t._find_backends())
# ([], False)
t = t.cache()
print(t._find_backends())
# ([<ibis.backends.duckdb.Backend object at 0x11df823d0>], False)
t = t.mutate(c=t.c + 1)
return t
def ingest(conn, new):
print(conn)
# <ibis.backends.duckdb.Backend object at 0x11fb4ef10>
print(f"adding {new.count().execute()} rows to new_table")
already_there = new.semi_join(conn.table("my_table"), "c")
print(f"skipping {already_there.count().execute()} rows already in my_table")
# IbisError: Multiple backends found for this expression
return conn.insert("my_table", new)
# Catalog Error: Table with name ibis_cached_ao56qedyyva3hjxvhrew7g3utu does not exist!
ingest(conn, get_new_data())
With an original memtable, there is no backend. But as soon as you .cache() it, then it ends up in the default backend. This is a problem when I am trying to make this memtable interact with a backend that is not the default.
I have actually sporadically been encountering this issue for literally 2 years, and I only am finally now realizing what the root cause is. It was so hard to figure out the cause because its such spooky action at a distance, adding the .cache to some distant line of code made the error only pop up much later in the script.
What is the motivation behind your request?
The workaround is to never .cache() any memtables. But some of the intermediate computations I am doing are expensive, so I really do want to be able to cache them in the middle of the computation chain.
Describe the solution you'd like
A few ideas, none of which I love:
1. Optional backend param to memtable
Add a backend=None
param to ibis.memtable. Then, whenever a subsequent .cache() happens, the expression uses this backend. This unfortunately makes it so that if you do mt = ibis.memtable(..., backend=conn)
, then mt is forever only compatible with conn, you can't use it with conn2, which is a little counterintuitive to my mental model of a memtable, which is an ibis table that is in-memory and thus works with any backend.
2. Optional backend param to .cache()
Really, the time at which we need to decide on a backend for a memtable is only when we start running computations. We ideally shouldn't need to specify the backend at expression creation time. So, perhaps an API is Table.cache(backend=None)
. But then this is awkward, because you could do backend1.table("t").cache(backend=backend2)
, which should be illegal. It would be better if our API made this impossible to do.
3. Add Backend.cache() method
This keeps the API of ibis.memtable and Table.cache from needing the backend param, which is nice.
4. On memtable.cache(), you get another backend-agnostic memtable
Like it would use the default backend (usually duckdb), compute the result, but then return a special Op that, when required, actually goes to the backend, calls eg .to_pyarrow on the result, and then hands you back a new ibis.memtable from that.
e.g. equivalent to ibis.memtable(t.to_pyarrow(), schema=t.schema())
. Ideally we could make it so that this was lazy, and it only materialized the memtable on demand when crossing the boundary between two backends.
The most complex solution, but the best user API.
Out of all these, I really want to avoid the middle 2 options, because I want to make my expensive_computation
function backend-agnostic. I don't want to have to worry about what backend the given table is in, and both of the second two require the user to choose a backend at cache time.
What version of ibis are you running?
main
What backend(s) are you using, if any?
No response
Code of Conduct
- I agree to follow this project's Code of Conduct
Metadata
Metadata
Assignees
Type
Projects
Status
backlog