Add pandas-style query functionality to `AtomArray` and `AtomArrayStack` objects #815

Croydon-Brixton · 2025-07-20T13:04:24Z

This pull request introduces new query capabilities to the AtomArray class in the biotite.structure module, enabling users to filter, mask, and retrieve indices of atoms based on convenient pandas-like query expressions (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html).

New query functionality in `AtomArray`:

query method: Subsets the AtomArray(Stack) to atoms which satisfy the query expression (e.g. atom_array.query("(chain_id == 'A') & (atom_name == 'CA'))
mask method: Returns a True/False mask with True for atoms that match the query expression (e.g. atom_array.mask("(chain_id == 'A') & (atom_name == 'CA'))
idxs method: Returns the indices of atoms that match a query expression (e.g. atom_array.idxs("(chain_id == 'A') & (atom_name == 'CA'))

Pitch:
This allows easy compound queries such as atom_array.mask("(chain_id in ['A', 'Z']) & (res_name not in ['ALA', 'GLY']) | (b_factor > 0.3)") which can be convenient for structural analysis in jupyter notebooks and for configuring generic, simple tasks via strings in configs (like the yaml configs in hydra).

Would love to hear your thoughts @padix-key - do you think this would be useful enough to add as a feature in our main distribution?

…ck` objects

codspeed-hq · 2025-07-20T13:25:03Z

CodSpeed Performance Report

Merging #815 will not alter performance

_{Comparing Croydon-Brixton:feat/queries (6ede054) with main (ac856a2)}

Summary

✅ 59 untouched benchmarks

padix-key

Hi, to be honest I am generally not a big fan of string expressions, but this is generally true also for other software like PyMOL, Pandas, MDAnalysis, etc. The reason is that I think that this is the job of actual code with all of its advantages (performance, syntax highligthing, clearer exceptions, etc.).

That being said, I know people actually would like to be able to use string-based selections for the reasons you stated above (yes, I am looking at you @BradyAJohnston 😉). So although I probably will personally continue the 'classical' atom selections, I see a clear use case for other people here. And that the syntax is close to the one from pandas helps getting used to this new selection method. @t0mdavid-m Do you have a strong opinion?

Apart of from my conceptual concerns, your code looks really great!
I only have a few discussion points mentioned below.

padix-key · 2025-07-20T15:11:55Z

src/biotite/structure/query.py

+        functions = {
+            "has_nan_coord": lambda: QueryExpression._has_nan_coord(atom_array),
+            "has_bonds": lambda: QueryExpression._has_bonds(atom_array),
+        }


In the current design has_nan_coord and has_bonds would only be accessible from the new query API, while all the filters from filter.py are only accessible from the classical fancy indexing. Could make all filters accessible here and move has_nan_coord and has_bonds to filter.py (maybe as filter_unresolved() and filter_bonded()?

padix-key · 2025-07-20T15:17:34Z

src/biotite/structure/atoms.py

+
+        Select atoms without NaN coordinates:
+
+        >>> valid_atoms = atom_array.query("~has_nan_coord()")


It's nice that all these new methods have doctest examples. However to make users aware of this new feature we should probably mention this atom selection alternative also in the tutorial. I think doc/tutorial/structure/filter.rst would be fitting.

padix-key · 2025-07-20T15:18:40Z

src/biotite/structure/atoms.py

+        --------
+        Select all CA atoms in chain A:
+
+        >>> ca_atoms = atom_array.query("(chain_id == 'A') & (atom_name == 'CA')")


Could you also print the return value here?

BradyAJohnston · 2025-07-20T19:26:21Z

Thanks for the tag @padix-key you are certainly correct that I would love to see some kind of string selection implemented. I too like the current way of creating selections with biotite as string selections don't allow for code completion / hints and requires memorising a whole new syntax without any help from the code editor.

On that note, the question around this kind of feature is always "what syntax to use?". It might make more sense to instead try and match the syntax of other programs (of most use to me personally would be MDAnalysis, but that's purely selfish).

Always the option of going forward of forging your own path and creating your own syntax, but then there is yet another selection syntax to try and remember when switching between software.

Excited with whatever comes though!

padix-key · 2025-07-21T06:33:51Z

Always the option of going forward of forging your own path and creating your own syntax, but then there is yet another selection syntax to try and remember when switching between software.

I would argue that the syntax use here is not a new one: It is actually the same as in pandas selection with annotation categories instead of columns. This makes it also quite close to the shape a classical fancy indexing would look like.

Example:

Fancy indexing: atoms[atoms.atom_name == 'CA']
Atom selection: atoms.select("atom_name == 'CA'")

BradyAJohnston · 2025-07-21T07:11:56Z

For someone who has never used pandas but has used other molecular programs (myself) it would be a new syntax, but I agree that it's probably the better way to go for ease of use by the most number of people.

For Molecular Nodes it'll mean there are multiple selection languages in the same program, but that's a problem for me to figure out not for upstream.

t0mdavid-m · 2025-07-22T08:52:32Z

Generally I agree with @padix-key in the sense that I am personally also not a big fan of these types of queries in general. However, if there is demand for queries I believe it makes sense to support them.

The added value could be greater if we used an established query language that is not captured by the existing fancy indexing. Thus, I would be more in favour of supporting a commonly used DSL like the ones in PyMol/mdanalysis.

The already existing integration with ammolite could be a plus for PyMols DSL.

t0mdavid-m · 2025-07-22T08:52:42Z

Generally I agree with @padix-key in the sense that I am personally also not a big fan of these types of queries in general. However, if there is demand for queries I believe it makes sense to support them.

The added value could be greater if we used an established query language that is not captured by the existing fancy indexing. Thus, I would be more in favour of supporting a commonly used DSL like the ones in PyMol/mdanalysis.

The already existing integration with ammolite could be a plus for PyMols DSL.

Croydon-Brixton added 3 commits July 20, 2025 13:45

Add pandas-style query functionality to AtomArray and `AtomArraySta…

ffd767c

…ck` objects

tests: add further tests

3f4d380

chore: ruff

6ede054

Croydon-Brixton requested a review from padix-key July 20, 2025 13:04

Croydon-Brixton added the enhancement label Jul 20, 2025

padix-key reviewed Jul 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add pandas-style query functionality to `AtomArray` and `AtomArrayStack` objects #815

Add pandas-style query functionality to `AtomArray` and `AtomArrayStack` objects #815

Uh oh!

Croydon-Brixton commented Jul 20, 2025

Uh oh!

codspeed-hq bot commented Jul 20, 2025

Uh oh!

padix-key left a comment •

edited

Loading

Uh oh!

padix-key Jul 20, 2025

Uh oh!

padix-key Jul 20, 2025

Uh oh!

padix-key Jul 20, 2025

Uh oh!

BradyAJohnston commented Jul 20, 2025

Uh oh!

padix-key commented Jul 21, 2025

Uh oh!

BradyAJohnston commented Jul 21, 2025

Uh oh!

t0mdavid-m commented Jul 22, 2025

Uh oh!

t0mdavid-m commented Jul 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants


		Select atoms without NaN coordinates:

		>>> valid_atoms = atom_array.query("~has_nan_coord()")

Add pandas-style query functionality to AtomArray and AtomArrayStack objects #815

Are you sure you want to change the base?

Add pandas-style query functionality to AtomArray and AtomArrayStack objects #815

Uh oh!

Conversation

Croydon-Brixton commented Jul 20, 2025

New query functionality in AtomArray:

Uh oh!

codspeed-hq bot commented Jul 20, 2025

CodSpeed Performance Report

Merging #815 will not alter performance

Summary

Uh oh!

padix-key left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

padix-key Jul 20, 2025

Choose a reason for hiding this comment

Uh oh!

padix-key Jul 20, 2025

Choose a reason for hiding this comment

Uh oh!

padix-key Jul 20, 2025

Choose a reason for hiding this comment

Uh oh!

BradyAJohnston commented Jul 20, 2025

Uh oh!

padix-key commented Jul 21, 2025

Uh oh!

BradyAJohnston commented Jul 21, 2025

Uh oh!

t0mdavid-m commented Jul 22, 2025

Uh oh!

t0mdavid-m commented Jul 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add pandas-style query functionality to `AtomArray` and `AtomArrayStack` objects #815

Add pandas-style query functionality to `AtomArray` and `AtomArrayStack` objects #815

New query functionality in `AtomArray`:

padix-key left a comment •

edited

Loading