🗝️ Add `KeyPresent` queries by Kezzsim · Pull Request #1025 · bluesky/tiled

Kezzsim · 2025-07-21T20:18:41Z

Requested by @cjtitus in NSLS-II Ticket SXSS-592.
This attempts to register queries that can tell if a container's metadata has a KEY matching a particular string in it.

Register the queries
Finalize the business logic to perform the queries
Add specific tests for methods
Add queries to Tiled Docs

Kezzsim · 2025-07-28T19:19:08Z

This is in a state where it more or less works as expected:

from tiled.client import from_uri
from tiled.queries import KeyIn, KeyNotIn 
c = from_uri('http://localhost:8000', api_key="secret")
# Put some data in the db
c.write_array([37,83], metadata={"name" : "FB326", "start" : { "Detectors" : "none" }})
c.write_array([24,18], metadata={"sample" : "TEST"})
# Demonstrate searching for a record with a surface level key
c.search(KeyIn("name"))
>> <Container {'96f7f6ca-7a9d-4cea-9f05-9b6c661ce82f'}>
# Demonstrate searching for a record with a given NESTED key
c.search(KeyIn("Detectors"))
>> <Container {'96f7f6ca-7a9d-4cea-9f05-9b6c661ce82f'}>
# Search for a nonsense key that is in no records
c.search(KeyIn("skeytty"))
>> <Container {}>
# Find records that DO NOT contain a given (nested) key
c.search(KeyNotIn("Detectors"))
>> <Container {'02a11fff-5b29-4dab-9080-e80084e88702'}> 
# Find all records that don't have a nonsense key (in this case all of them)
c.search(KeyNotIn("Sketty")) 
<Container {'96f7f6ca-7a9d-4cea-9f05-9b6c661ce82f', ...} ~2 entries>

However I should request some quick stylistic feedback from @danielballan before continuing with docs and tests.

This implementation heavily leverages Full Text Search indexes and methods for efficiency, having about 95% idempotency with the existing FullText query that we have already. Like, the only difference is in FullText for PostGreSQL we use func.jsonb_to_tsvector(cast(["string"], JSONB)) but for KeyIn we can ensure that we search only the keys by specifying func.jsonb_to_tsvector(cast(["key"], JSONB)). Should KeyIn and KeyNotIn be refactored into using FullText under the hood? Or should FullText be extended to have a keys_only argument and an invert argument to cover negation?
SQLite's FTS5 .match() creates a solution that technically works to search for keys, but it cannot limit itself to only searching keys; there is no distinction in the VIRTUAL TABLE. Should we exclude SQLite from this implementation until a solution can be found that only searches keys?
Are KeyIn and KeyNotIn acceptable names for these queries? (Since they aren't direct permutations of _in and _not_in)

danielballan · 2025-07-31T17:38:50Z

The existing In and NotIn queries ask, "Is the value [not] one of these choices?" As in:

is_primary_color = In("color", ["red", "yellow", "blue"])
c.search(is_primary_color)

I believe SXSS-592 wants to query, "Was any value recorded for this key?"

we_wrote_down_the_color = HasKey("color")
c.search(we_wrote_down_the_color)

The idea is to eliminate results where some specific metadata was not even captured. And of course HasNotKey would do the inverse: find results where a given key was not recorded.

With respect to this PR, then, I would expect:

No tsvector or any kind of sub-string matching
Looking for a specific key at a specific level of nesting, not searching all possible nesting levels for a string

cjtitus · 2025-07-31T18:07:02Z

Dan's summary of the intent of SXSS-592 is exactly correct. We want to know if a key exists (possibly explicitly nested). This is extremely useful for cleaning up old metadata, e.g,

Find all runs that don't have any data_session key HasNoKey("data_session")
Find all runs that don't have a PI name recorded HasNoKey("proposal.pi_name")

In the latter case, we probably will have done a HasKey("proposal") first, if we are looking for plans that have proposal metadata, but no recorded PI. If you do just search for HasNoKey("something.nested"), I think it should also return runs where "something" isn't a key. But I could equally see an argument for returning only runs that have "something" but not "nested". It's a choice that will need to be clearly documented.

Also useful for looking for metadata from certain instruments, etc, to identify only runs where a certain detector was used.

cjtitus · 2025-07-31T18:10:23Z

And of course, we want the nesting to be explicit. If we search for
HasKey("type"), I don't want to match on "proposal.type", or "my_detector.type". There are many pretty generic key-names that could exist at any level, and we are typically assuming we know exactly what we're looking for.

Kezzsim · 2025-07-31T18:47:39Z

Removing any variety of fulltext makes the function a bit more terse, though some branching is required in order to get consistent behavior. SQLite can resolve nested keys with the standard JSONB -> operator from a string containing . syntax, but Postgres requires the keys to be split up into their own discrete ARRAY datatype to use the #> operator which is the equivalent of multiple -> operators repeated by n to the number of keys.

def has_key(query, tree, invert):
    if tree.engine.url.get_dialect().name == "sqlite":
        condition = orm.Node.metadata_.op("->")(query.key) != None
    else:
        keys = query.key.split(".")
        condition = orm.Node.metadata_.op("#>")(cast(keys, ARRAY(TEXT))) != None
    condition = not_(condition) if invert else condition
    return tree.new_variation(conditions=tree.conditions + [condition])

This appears to yield the desired results:

In [5]: from tiled.client import from_uri
In [6]: from tiled.queries import KeyIn, KeyNotIn
In [7]: c = from_uri('http://localhost:8000', api_key="secret") 
# Write three samples
In [8]: c.write_array([37,83], metadata={"name" : "FB326", "start" : { "Detectors" : "none" }}) 
In [9]: c.write_array([37,83], metadata={"name" : "GSXR750", "start" : { "Detectors" : ["Suzuki"] }})
In [10]: c.write_array([37,83], metadata={"name" : "CBR1000", "start" : { "BHP" : 120 }})
In [11]: c.search(KeyIn("start.Detectors")) 
Out[11]: <Container {'d92a8370-8eb6-4ddc-9410-e327bcaf2b1f', ...} ~2 entries> 
# All containers have start key in root
In [12]: c.search(KeyIn("start")) 
Out[12]: <Container {'d92a8370-8eb6-4ddc-9410-e327bcaf2b1f', ...} ~3 entries>
# No containers have Detectors key in root
In [13]: c.search(KeyIn("Detectors")) 
Out[13]: <Container {}> 
# Only one container has explicitly nested key BHP
In [14]: c.search(KeyNotIn("Detectors.BHP"))
Out[14]: <Container {'d92a8370-8eb6-4ddc-9410-e327bcaf2b1f', ...} ~3 entries>  
# No containers have the nonsense key
c.search(KeyNotIn("Sketty"))
<Container {'d92a8370-8eb6-4ddc-9410-e327bcaf2b1f', ...} ~3 entries>

danielballan · 2025-07-31T20:17:33Z

The behavior looks aligned with the requirements now. Seems ready for adding docs and unit tests in tiled._tests.test_queries, particular on nested keys.

We should also give some consideration for the name. I think KeyIn and KeyNotIn is confusing here. Either of @cjtitus' suggestions seem reasonable:

Possible name pairs: HasKey/NotHasKey ? Exists/NotExists? Something better?

I'll add KeyExists / KeyMissing for consideration. (But not sure if that's better than the above.)

cjtitus · 2025-07-31T20:32:48Z

Ooh, I like the KeyX/KeyY naming pattern more than KeyX/KeyNotX. Exists/Missing is a fairly good pair.

Off the top of my head, KeyPresent/KeyAbsent may also be a good option, as "Absent" is a bit more neutral than "Missing", which does somehow carry the implication that the key should be there, and Present/Absent are definitely antonyms in English, whereas exists and missing have just slightly different contexts.

Either option sounds good to me though.

Kezzsim · 2025-07-31T20:34:03Z

Any objection to keyExists("Key", True) and keyExists("Key", False)? since it basically works that way under the hood.
(The boolean defaults to True unless specified)

If not I'll elect KeyExists(), KeyAbsent()

danielballan · 2025-07-31T21:22:51Z

Oh, great idea. Let’s go with that.

If we start with just KeyExists we can always elect to add KeyAbsent later as an alias—if we find that the optional boolean parameter is hard for users to discover.

tiled/catalog/adapter.py

Kezzsim · 2025-08-12T21:23:37Z

I'm having problems with postgresql after merging in the latest Tiled code:

According to the official docs the #> operator is expecting an array of text to perform an operation on a JSON/JSONB column:

However this stopped working after 9b9fb46

Casting to ARRAY(TEXT) is correct here, but something changed which is causing it to try serializing that to JSON

tiled/tiled/catalog/adapter.py

Line 1442 in bc57e94

orm.Node.metadata_.op("#>")(cast(keys, ARRAY(TEXT))) != None # noqa: E711

The same exact code works in earlier commit 6fd9ede

@genematx or @danielballan can you pull this PR and see what's up when you get a chance?

genematx · 2025-08-12T22:01:24Z

I think the issue was with the import; there is

from typing import Callable, Dict, List, Optional, Union, cast

and also

from sqlalchemy.sql.expression import cast as sql_cast

danielballan · 2025-08-13T10:57:14Z

Thanks, @genematx! Looking good, @Kezzsim.

Last thing: Above, I think @cjtitus made a persuasive case that KeyPresent/KeyAbsent are a nice pair of names, at the same time that you made the case for KeyExists with an optional boolean.

Thoughts on renaming KeyExists to KeyPresent? Then, if later on we find that we want to make the "reverse" more discoverable to users, we can always come back and add KeyAbsent. My feelings is that the boolean is likely good enough, but if we choose a name with a good antonym, we keep the option open for later.

tiled/adapters/mapping.py

tiled/catalog/adapter.py

danielballan · 2025-08-13T18:32:03Z

Testing interactively, I'm not convinced the nested keys work as advertised in the docstring.

In [29]: x = c.write_dataframe({'a': [1,2,3]}, key='x', metadata={"sample": {"color": "blue"}})

In [30]: c
Out[30]: <Container {'x'}>

In [31]: c.search(KeyPresent("sample.color"))
Out[31]: <Container {}>

Since this is almost certainly important for our applications (e.g. querying on keys nested inside start.) I think we should hold for that.

danielballan · 2025-08-19T19:21:33Z

Last time I tested I was having local environment issues, but now I think those are sorted and I still see a problem with nested keys (at least on SQLite).

Server:

❯ uv run --all-extras tiled --version
0.1.0b35.dev22+g6af4cb181
❯ uv run --all-extras tiled serve catalog --temp --api-key secret

Client:

❯ uv run --all-extras --with ipython ipython
Python 3.12.11 | packaged by conda-forge | (main, Jun  4 2025, 14:45:31) [GCC 13.3.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 9.4.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import tiled

In [2]: tiled.__version__
Out[2]: '0.1.0b35.dev22+g6af4cb181'

In [3]: from tiled.queries import KeyPresent

In [4]: from tiled.client import from_uri
c
In [5]: c = from_uri('http://localhost:8000', api_key='secret')

In [6]: y = c.create_container("y", metadata={"sample": {"color": "red"}})

In [7]: c.search(KeyPresent("sample")) # good
Out[7]: <Container {'y'}>

In [8]: c.search(KeyPresent("asdf")) # good
Out[8]: <Container {}>

In [9]: c.search(KeyPresent("sample.color")) # bad
Out[9]: <Container {}>

danielballan · 2025-08-19T19:27:13Z

Same, with pixi instead of uv:

❯ pixi run -e server  tiled --version
0.1.0b35.dev22+g6af4cb181
❯ pixi run -e server tiled serve catalog --temp --api-key secret

❯ pixi run -e client python
Python 3.12.11 | packaged by conda-forge | (main, Jun  4 2025, 14:45:31) [GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tiled
>>> tiled.__version__
'0.1.0b35.dev22+g6af4cb181'
>>> from tiled.queries import KeyPresent
>>> from tiled.client import from_uri
>>> c = from_uri('http://localhost:8000', api_key='secret')
>>> y = c.create_container("y", metadata={"sample": {"color": "red"}})
>>> c.search(KeyPresent("sample")) # good
<Container {'y'}>
>>> c.search(KeyPresent("asdf")) # good
<Container {}>
>>> c.search(KeyPresent("sample.color")) # bad
<Container {}>

Kezzsim · 2025-08-20T19:15:19Z

As discussed with @danielballan there's now a way to insert more specific test data on the fly to tiled which eventually will make db tests more atomic. prep_test exists now:

def prep_test(client, mapping):
    for k, v in mapping.items():
        client.write_array(
            v.read(),
            key=k,
            metadata=dict(v.metadata()),
            specs=v.specs,
        )

It does not work with map adapter presently because that appears to be static, but this does let us test nested keys.
Eventually the goal would also be to add a clear_test function to remove what was inserted prior to each test.

Kezzsim added 5 commits July 21, 2025 16:14

📝 Register Queries

2fb5666

🤔 Make it work... (with top level keys only)

a4193d2

🧠 Key search using tsvector

8467a67

😉 FTS5 includes key as a column

f50ba82

✅ Finish MVP

0c2562d

Kezzsim marked this pull request as ready for review July 28, 2025 19:19

Kezzsim requested a review from danielballan July 28, 2025 19:19

Merge branch 'bluesky:main' into to_be_or_not_to_be

76232c3

Kezzsim changed the title ~~Add "KeyIn" & "KeyNotIn" queries~~ 🗝️ Add "KeyIn" & "KeyNotIn" queries Jul 29, 2025

↪️ Change the strategy to align with requirements

2a22e66

Kezzsim closed this Jul 31, 2025

Kezzsim reopened this Jul 31, 2025

Kezzsim changed the title ~~🗝️ Add "KeyIn" & "KeyNotIn" queries~~ 🗝️ Add "KeyExists" queries Jul 31, 2025

Kezzsim added 5 commits July 31, 2025 18:00

➰ Reformat API to keyExists()

6209423

➿ Remove references to KeyNotIn

0b96bb8

🏔️ Define tests and fix logic

1949628

🧹 Black / Flake8

353588e

📚️ Finish docs

6fd9ede

danielballan reviewed Aug 4, 2025

View reviewed changes

tiled/catalog/adapter.py Outdated Show resolved Hide resolved

Kezzsim added 2 commits August 12, 2025 15:21

1️⃣ EZMerge step 1 : parity check

05a7706

Merge branch 'bluesky:main' into to_be_or_not_to_be

2d100d5

Kezzsim added 2 commits August 12, 2025 17:09

2️⃣ Complete EZMerge from upstream

bc57e94

Merge branch 'bluesky:main' into to_be_or_not_to_be

e467e1b

FIX: imported name

a0e9eac

🏷️ Rename KeyExists to KeyPresent

03c2dc1

danielballan reviewed Aug 13, 2025

View reviewed changes

tiled/adapters/mapping.py Outdated Show resolved Hide resolved

tiled/catalog/adapter.py Outdated Show resolved Hide resolved

*️⃣ Renaming of interior functions

c7650d7

danielballan changed the title ~~🗝️ Add "KeyExists" queries~~ 🗝️ Add KeyPresent queries Aug 13, 2025

Merge branch 'bluesky:main' into to_be_or_not_to_be

6af4cb1

Kezzsim added 2 commits August 20, 2025 14:12

💲Add missing sqlite dollarsign string

e5253ec

🚥 Experiment with dynamic test data

c862590

danielballan added 2 commits August 25, 2025 10:35

Revert to existing test paradigm for now.

32291fa

Add test cases

b048aa7

Kezzsim merged commit 7158cd6 into bluesky:main Aug 25, 2025
11 checks passed

Conversation

Kezzsim commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Kezzsim commented Jul 28, 2025

Uh oh!

danielballan commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cjtitus commented Jul 31, 2025

Uh oh!

cjtitus commented Jul 31, 2025

Uh oh!

Kezzsim commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danielballan commented Jul 31, 2025

Uh oh!

cjtitus commented Jul 31, 2025

Uh oh!

Kezzsim commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danielballan commented Jul 31, 2025

Uh oh!

Uh oh!

Kezzsim commented Aug 12, 2025

Uh oh!

genematx commented Aug 12, 2025

Uh oh!

danielballan commented Aug 13, 2025

Uh oh!

Uh oh!

Uh oh!

danielballan commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danielballan commented Aug 19, 2025

Uh oh!

danielballan commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Kezzsim commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Kezzsim commented Jul 21, 2025 •

edited

Loading

danielballan commented Jul 31, 2025 •

edited

Loading

Kezzsim commented Jul 31, 2025 •

edited

Loading

Kezzsim commented Jul 31, 2025 •

edited

Loading

danielballan commented Aug 13, 2025 •

edited

Loading

danielballan commented Aug 19, 2025 •

edited

Loading

Kezzsim commented Aug 20, 2025 •

edited

Loading