Skip to content

feat: improve API for choosing a backend on memtable.cache() #10942

Open
@NickCrews

Description

@NickCrews

Is your feature request related to a problem?

Consider the following script:

import ibis

conn = ibis.duckdb.connect("mydb.duckdb")
if "my_table" not in conn.tables:
    conn.create_table("my_table", schema={"c": "int64"}, overwrite=False)


def get_new_data():
    t = ibis.memtable([{"a": 1, "b": 2}, {"a": 3, "b": 4}])
    t = t.select(c=t.a + t.b)
    print(t._find_backends())
    # ([], False)
    t = t.cache()
    print(t._find_backends())
    # ([<ibis.backends.duckdb.Backend object at 0x11df823d0>], False)
    t = t.mutate(c=t.c + 1)
    return t

def ingest(conn, new):
    print(conn)
    # <ibis.backends.duckdb.Backend object at 0x11fb4ef10>
    print(f"adding {new.count().execute()} rows to new_table")
    already_there = new.semi_join(conn.table("my_table"), "c")
    print(f"skipping {already_there.count().execute()} rows already in my_table")
    # IbisError: Multiple backends found for this expression
    return conn.insert("my_table", new)
    # Catalog Error: Table with name ibis_cached_ao56qedyyva3hjxvhrew7g3utu does not exist!

ingest(conn, get_new_data())

With an original memtable, there is no backend. But as soon as you .cache() it, then it ends up in the default backend. This is a problem when I am trying to make this memtable interact with a backend that is not the default.

I have actually sporadically been encountering this issue for literally 2 years, and I only am finally now realizing what the root cause is. It was so hard to figure out the cause because its such spooky action at a distance, adding the .cache to some distant line of code made the error only pop up much later in the script.

What is the motivation behind your request?

The workaround is to never .cache() any memtables. But some of the intermediate computations I am doing are expensive, so I really do want to be able to cache them in the middle of the computation chain.

Describe the solution you'd like

A few ideas, none of which I love:

1. Optional backend param to memtable

Add a backend=None param to ibis.memtable. Then, whenever a subsequent .cache() happens, the expression uses this backend. This unfortunately makes it so that if you do mt = ibis.memtable(..., backend=conn), then mt is forever only compatible with conn, you can't use it with conn2, which is a little counterintuitive to my mental model of a memtable, which is an ibis table that is in-memory and thus works with any backend.

2. Optional backend param to .cache()

Really, the time at which we need to decide on a backend for a memtable is only when we start running computations. We ideally shouldn't need to specify the backend at expression creation time. So, perhaps an API is Table.cache(backend=None). But then this is awkward, because you could do backend1.table("t").cache(backend=backend2), which should be illegal. It would be better if our API made this impossible to do.

3. Add Backend.cache() method

This keeps the API of ibis.memtable and Table.cache from needing the backend param, which is nice.

4. On memtable.cache(), you get another backend-agnostic memtable

Like it would use the default backend (usually duckdb), compute the result, but then return a special Op that, when required, actually goes to the backend, calls eg .to_pyarrow on the result, and then hands you back a new ibis.memtable from that.
e.g. equivalent to ibis.memtable(t.to_pyarrow(), schema=t.schema()). Ideally we could make it so that this was lazy, and it only materialized the memtable on demand when crossing the boundary between two backends.
The most complex solution, but the best user API.

Out of all these, I really want to avoid the middle 2 options, because I want to make my expensive_computation function backend-agnostic. I don't want to have to worry about what backend the given table is in, and both of the second two require the user to choose a backend at cache time.

What version of ibis are you running?

main

What backend(s) are you using, if any?

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureFeatures or general enhancements

    Type

    No type

    Projects

    • Status

      backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions