Skip to content

🗝️ Add KeyPresent queries#1025

Merged
Kezzsim merged 24 commits intobluesky:mainfrom
Kezzsim:to_be_or_not_to_be
Aug 25, 2025
Merged

🗝️ Add KeyPresent queries#1025
Kezzsim merged 24 commits intobluesky:mainfrom
Kezzsim:to_be_or_not_to_be

Conversation

@Kezzsim
Copy link
Contributor

@Kezzsim Kezzsim commented Jul 21, 2025

Requested by @cjtitus in NSLS-II Ticket SXSS-592.
This attempts to register queries that can tell if a container's metadata has a KEY matching a particular string in it.

  • Register the queries
  • Finalize the business logic to perform the queries
  • Add specific tests for methods
  • Add queries to Tiled Docs

@Kezzsim
Copy link
Contributor Author

Kezzsim commented Jul 28, 2025

This is in a state where it more or less works as expected:

from tiled.client import from_uri
from tiled.queries import KeyIn, KeyNotIn 
c = from_uri('http://localhost:8000', api_key="secret")
# Put some data in the db
c.write_array([37,83], metadata={"name" : "FB326", "start" : { "Detectors" : "none" }})
c.write_array([24,18], metadata={"sample" : "TEST"})
# Demonstrate searching for a record with a surface level key
c.search(KeyIn("name"))
>> <Container {'96f7f6ca-7a9d-4cea-9f05-9b6c661ce82f'}>
# Demonstrate searching for a record with a given NESTED key
c.search(KeyIn("Detectors"))
>> <Container {'96f7f6ca-7a9d-4cea-9f05-9b6c661ce82f'}>
# Search for a nonsense key that is in no records
c.search(KeyIn("skeytty"))
>> <Container {}>
# Find records that DO NOT contain a given (nested) key
c.search(KeyNotIn("Detectors"))
>> <Container {'02a11fff-5b29-4dab-9080-e80084e88702'}> 
# Find all records that don't have a nonsense key (in this case all of them)
c.search(KeyNotIn("Sketty")) 
<Container {'96f7f6ca-7a9d-4cea-9f05-9b6c661ce82f', ...} ~2 entries>

However I should request some quick stylistic feedback from @danielballan before continuing with docs and tests.

  1. This implementation heavily leverages Full Text Search indexes and methods for efficiency, having about 95% idempotency with the existing FullText query that we have already. Like, the only difference is in FullText for PostGreSQL we use func.jsonb_to_tsvector(cast(["string"], JSONB)) but for KeyIn we can ensure that we search only the keys by specifying func.jsonb_to_tsvector(cast(["key"], JSONB)). Should KeyIn and KeyNotIn be refactored into using FullText under the hood? Or should FullText be extended to have a keys_only argument and an invert argument to cover negation?
  2. SQLite's FTS5 .match() creates a solution that technically works to search for keys, but it cannot limit itself to only searching keys; there is no distinction in the VIRTUAL TABLE. Should we exclude SQLite from this implementation until a solution can be found that only searches keys?
  3. Are KeyIn and KeyNotIn acceptable names for these queries? (Since they aren't direct permutations of _in and _not_in)

@Kezzsim Kezzsim marked this pull request as ready for review July 28, 2025 19:19
@Kezzsim Kezzsim requested a review from danielballan July 28, 2025 19:19
@Kezzsim Kezzsim changed the title Add "KeyIn" & "KeyNotIn" queries 🗝️ Add "KeyIn" & "KeyNotIn" queries Jul 29, 2025
@danielballan
Copy link
Member

danielballan commented Jul 31, 2025

The existing In and NotIn queries ask, "Is the value [not] one of these choices?" As in:

is_primary_color = In("color", ["red", "yellow", "blue"])
c.search(is_primary_color)

I believe SXSS-592 wants to query, "Was any value recorded for this key?"

we_wrote_down_the_color = HasKey("color")
c.search(we_wrote_down_the_color)

The idea is to eliminate results where some specific metadata was not even captured. And of course HasNotKey would do the inverse: find results where a given key was not recorded.

With respect to this PR, then, I would expect:

  • No tsvector or any kind of sub-string matching
  • Looking for a specific key at a specific level of nesting, not searching all possible nesting levels for a string

@cjtitus
Copy link
Contributor

cjtitus commented Jul 31, 2025

Dan's summary of the intent of SXSS-592 is exactly correct. We want to know if a key exists (possibly explicitly nested). This is extremely useful for cleaning up old metadata, e.g,

Find all runs that don't have any data_session key HasNoKey("data_session")
Find all runs that don't have a PI name recorded HasNoKey("proposal.pi_name")

In the latter case, we probably will have done a HasKey("proposal") first, if we are looking for plans that have proposal metadata, but no recorded PI. If you do just search for HasNoKey("something.nested"), I think it should also return runs where "something" isn't a key. But I could equally see an argument for returning only runs that have "something" but not "nested". It's a choice that will need to be clearly documented.

Also useful for looking for metadata from certain instruments, etc, to identify only runs where a certain detector was used.

@cjtitus
Copy link
Contributor

cjtitus commented Jul 31, 2025

And of course, we want the nesting to be explicit. If we search for
HasKey("type"), I don't want to match on "proposal.type", or "my_detector.type". There are many pretty generic key-names that could exist at any level, and we are typically assuming we know exactly what we're looking for.

@Kezzsim
Copy link
Contributor Author

Kezzsim commented Jul 31, 2025

Removing any variety of fulltext makes the function a bit more terse, though some branching is required in order to get consistent behavior. SQLite can resolve nested keys with the standard JSONB -> operator from a string containing . syntax, but Postgres requires the keys to be split up into their own discrete ARRAY datatype to use the #> operator which is the equivalent of multiple -> operators repeated by n to the number of keys.

def has_key(query, tree, invert):
    if tree.engine.url.get_dialect().name == "sqlite":
        condition = orm.Node.metadata_.op("->")(query.key) != None
    else:
        keys = query.key.split(".")
        condition = orm.Node.metadata_.op("#>")(cast(keys, ARRAY(TEXT))) != None
    condition = not_(condition) if invert else condition
    return tree.new_variation(conditions=tree.conditions + [condition])

This appears to yield the desired results:

In [5]: from tiled.client import from_uri
In [6]: from tiled.queries import KeyIn, KeyNotIn
In [7]: c = from_uri('http://localhost:8000', api_key="secret") 
# Write three samples
In [8]: c.write_array([37,83], metadata={"name" : "FB326", "start" : { "Detectors" : "none" }}) 
In [9]: c.write_array([37,83], metadata={"name" : "GSXR750", "start" : { "Detectors" : ["Suzuki"] }})
In [10]: c.write_array([37,83], metadata={"name" : "CBR1000", "start" : { "BHP" : 120 }})
In [11]: c.search(KeyIn("start.Detectors")) 
Out[11]: <Container {'d92a8370-8eb6-4ddc-9410-e327bcaf2b1f', ...} ~2 entries> 
# All containers have start key in root
In [12]: c.search(KeyIn("start")) 
Out[12]: <Container {'d92a8370-8eb6-4ddc-9410-e327bcaf2b1f', ...} ~3 entries>
# No containers have Detectors key in root
In [13]: c.search(KeyIn("Detectors")) 
Out[13]: <Container {}> 
# Only one container has explicitly nested key BHP
In [14]: c.search(KeyNotIn("Detectors.BHP"))
Out[14]: <Container {'d92a8370-8eb6-4ddc-9410-e327bcaf2b1f', ...} ~3 entries>  
# No containers have the nonsense key
c.search(KeyNotIn("Sketty"))
<Container {'d92a8370-8eb6-4ddc-9410-e327bcaf2b1f', ...} ~3 entries> 

@Kezzsim Kezzsim closed this Jul 31, 2025
@Kezzsim Kezzsim reopened this Jul 31, 2025
@danielballan
Copy link
Member

The behavior looks aligned with the requirements now. Seems ready for adding docs and unit tests in tiled._tests.test_queries, particular on nested keys.

We should also give some consideration for the name. I think KeyIn and KeyNotIn is confusing here. Either of @cjtitus' suggestions seem reasonable:

Possible name pairs: HasKey/NotHasKey ? Exists/NotExists? Something better?

I'll add KeyExists / KeyMissing for consideration. (But not sure if that's better than the above.)

@cjtitus
Copy link
Contributor

cjtitus commented Jul 31, 2025

Ooh, I like the KeyX/KeyY naming pattern more than KeyX/KeyNotX. Exists/Missing is a fairly good pair.

Off the top of my head, KeyPresent/KeyAbsent may also be a good option, as "Absent" is a bit more neutral than "Missing", which does somehow carry the implication that the key should be there, and Present/Absent are definitely antonyms in English, whereas exists and missing have just slightly different contexts.

Either option sounds good to me though.

@Kezzsim
Copy link
Contributor Author

Kezzsim commented Jul 31, 2025

Any objection to keyExists("Key", True) and keyExists("Key", False)? since it basically works that way under the hood.
(The boolean defaults to True unless specified)

If not I'll elect KeyExists(), KeyAbsent()

@danielballan
Copy link
Member

Oh, great idea. Let’s go with that.

If we start with just KeyExists we can always elect to add KeyAbsent later as an alias—if we find that the optional boolean parameter is hard for users to discover.

@Kezzsim Kezzsim changed the title 🗝️ Add "KeyIn" & "KeyNotIn" queries 🗝️ Add "KeyExists" queries Jul 31, 2025
@Kezzsim
Copy link
Contributor Author

Kezzsim commented Aug 12, 2025

I'm having problems with postgresql after merging in the latest Tiled code:
image

According to the official docs the #> operator is expecting an array of text to perform an operation on a JSON/JSONB column:
image

However this stopped working after 9b9fb46

Casting to ARRAY(TEXT) is correct here, but something changed which is causing it to try serializing that to JSON

orm.Node.metadata_.op("#>")(cast(keys, ARRAY(TEXT))) != None # noqa: E711

The same exact code works in earlier commit 6fd9ede
image

@genematx or @danielballan can you pull this PR and see what's up when you get a chance?

@genematx
Copy link
Contributor

I think the issue was with the import; there is

from typing import Callable, Dict, List, Optional, Union, cast

and also

from sqlalchemy.sql.expression import cast as sql_cast

@danielballan
Copy link
Member

Thanks, @genematx! Looking good, @Kezzsim.

Last thing: Above, I think @cjtitus made a persuasive case that KeyPresent/KeyAbsent are a nice pair of names, at the same time that you made the case for KeyExists with an optional boolean.

Thoughts on renaming KeyExists to KeyPresent? Then, if later on we find that we want to make the "reverse" more discoverable to users, we can always come back and add KeyAbsent. My feelings is that the boolean is likely good enough, but if we choose a name with a good antonym, we keep the option open for later.

@danielballan danielballan changed the title 🗝️ Add "KeyExists" queries 🗝️ Add KeyPresent queries Aug 13, 2025
@danielballan
Copy link
Member

danielballan commented Aug 13, 2025

Testing interactively, I'm not convinced the nested keys work as advertised in the docstring.

In [29]: x = c.write_dataframe({'a': [1,2,3]}, key='x', metadata={"sample": {"color": "blue"}})

In [30]: c
Out[30]: <Container {'x'}>

In [31]: c.search(KeyPresent("sample.color"))
Out[31]: <Container {}>

Since this is almost certainly important for our applications (e.g. querying on keys nested inside start.) I think we should hold for that.

@danielballan
Copy link
Member

Last time I tested I was having local environment issues, but now I think those are sorted and I still see a problem with nested keys (at least on SQLite).

Server:

❯ uv run --all-extras tiled --version
0.1.0b35.dev22+g6af4cb181
❯ uv run --all-extras tiled serve catalog --temp --api-key secret

Client:

❯ uv run --all-extras --with ipython ipython
Python 3.12.11 | packaged by conda-forge | (main, Jun  4 2025, 14:45:31) [GCC 13.3.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 9.4.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import tiled

In [2]: tiled.__version__
Out[2]: '0.1.0b35.dev22+g6af4cb181'

In [3]: from tiled.queries import KeyPresent

In [4]: from tiled.client import from_uri
c
In [5]: c = from_uri('http://localhost:8000', api_key='secret')

In [6]: y = c.create_container("y", metadata={"sample": {"color": "red"}})

In [7]: c.search(KeyPresent("sample")) # good
Out[7]: <Container {'y'}>

In [8]: c.search(KeyPresent("asdf")) # good
Out[8]: <Container {}>

In [9]: c.search(KeyPresent("sample.color")) # bad
Out[9]: <Container {}>

@danielballan
Copy link
Member

danielballan commented Aug 19, 2025

Same, with pixi instead of uv:

❯ pixi run -e server  tiled --version
0.1.0b35.dev22+g6af4cb181
❯ pixi run -e server tiled serve catalog --temp --api-key secret
❯ pixi run -e client python
Python 3.12.11 | packaged by conda-forge | (main, Jun  4 2025, 14:45:31) [GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tiled
>>> tiled.__version__
'0.1.0b35.dev22+g6af4cb181'
>>> from tiled.queries import KeyPresent
>>> from tiled.client import from_uri
>>> c = from_uri('http://localhost:8000', api_key='secret')
>>> y = c.create_container("y", metadata={"sample": {"color": "red"}})
>>> c.search(KeyPresent("sample")) # good
<Container {'y'}>
>>> c.search(KeyPresent("asdf")) # good
<Container {}>
>>> c.search(KeyPresent("sample.color")) # bad
<Container {}>

@Kezzsim
Copy link
Contributor Author

Kezzsim commented Aug 20, 2025

As discussed with @danielballan there's now a way to insert more specific test data on the fly to tiled which eventually will make db tests more atomic. prep_test exists now:

def prep_test(client, mapping):
    for k, v in mapping.items():
        client.write_array(
            v.read(),
            key=k,
            metadata=dict(v.metadata()),
            specs=v.specs,
        )

It does not work with map adapter presently because that appears to be static, but this does let us test nested keys.
Eventually the goal would also be to add a clear_test function to remove what was inserted prior to each test.

@Kezzsim Kezzsim merged commit 7158cd6 into bluesky:main Aug 25, 2025
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants