Skip to content

Conversation

@Croydon-Brixton
Copy link
Contributor

This pull request introduces new query capabilities to the AtomArray class in the biotite.structure module, enabling users to filter, mask, and retrieve indices of atoms based on convenient pandas-like query expressions (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html).

New query functionality in AtomArray:

  • query method: Subsets the AtomArray(Stack) to atoms which satisfy the query expression (e.g. atom_array.query("(chain_id == 'A') & (atom_name == 'CA'))
  • mask method: Returns a True/False mask with True for atoms that match the query expression (e.g. atom_array.mask("(chain_id == 'A') & (atom_name == 'CA'))
  • idxs method: Returns the indices of atoms that match a query expression (e.g. atom_array.idxs("(chain_id == 'A') & (atom_name == 'CA'))

Pitch:
This allows easy compound queries such as atom_array.mask("(chain_id in ['A', 'Z']) & (res_name not in ['ALA', 'GLY']) | (b_factor > 0.3)") which can be convenient for structural analysis in jupyter notebooks and for configuring generic, simple tasks via strings in configs (like the yaml configs in hydra).

Would love to hear your thoughts @padix-key - do you think this would be useful enough to add as a feature in our main distribution?

@codspeed-hq
Copy link

codspeed-hq bot commented Jul 20, 2025

CodSpeed Performance Report

Merging #815 will not alter performance

Comparing Croydon-Brixton:feat/queries (6ede054) with main (ac856a2)

Summary

✅ 59 untouched benchmarks

Copy link
Member

@padix-key padix-key left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, to be honest I am generally not a big fan of string expressions, but this is generally true also for other software like PyMOL, Pandas, MDAnalysis, etc. The reason is that I think that this is the job of actual code with all of its advantages (performance, syntax highligthing, clearer exceptions, etc.).

That being said, I know people actually would like to be able to use string-based selections for the reasons you stated above (yes, I am looking at you @BradyAJohnston 😉). So although I probably will personally continue the 'classical' atom selections, I see a clear use case for other people here. And that the syntax is close to the one from pandas helps getting used to this new selection method. @t0mdavid-m Do you have a strong opinion?

Apart of from my conceptual concerns, your code looks really great!
I only have a few discussion points mentioned below.

Comment on lines +175 to +178
functions = {
"has_nan_coord": lambda: QueryExpression._has_nan_coord(atom_array),
"has_bonds": lambda: QueryExpression._has_bonds(atom_array),
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current design has_nan_coord and has_bonds would only be accessible from the new query API, while all the filters from filter.py are only accessible from the classical fancy indexing. Could make all filters accessible here and move has_nan_coord and has_bonds to filter.py (maybe as filter_unresolved() and filter_bonded()?

Select atoms without NaN coordinates:
>>> valid_atoms = atom_array.query("~has_nan_coord()")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's nice that all these new methods have doctest examples. However to make users aware of this new feature we should probably mention this atom selection alternative also in the tutorial. I think doc/tutorial/structure/filter.rst would be fitting.

--------
Select all CA atoms in chain A:
>>> ca_atoms = atom_array.query("(chain_id == 'A') & (atom_name == 'CA')")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also print the return value here?

@BradyAJohnston
Copy link

Thanks for the tag @padix-key you are certainly correct that I would love to see some kind of string selection implemented. I too like the current way of creating selections with biotite as string selections don't allow for code completion / hints and requires memorising a whole new syntax without any help from the code editor.

On that note, the question around this kind of feature is always "what syntax to use?". It might make more sense to instead try and match the syntax of other programs (of most use to me personally would be MDAnalysis, but that's purely selfish).

Always the option of going forward of forging your own path and creating your own syntax, but then there is yet another selection syntax to try and remember when switching between software.

Excited with whatever comes though!

@padix-key
Copy link
Member

Always the option of going forward of forging your own path and creating your own syntax, but then there is yet another selection syntax to try and remember when switching between software.

I would argue that the syntax use here is not a new one: It is actually the same as in pandas selection with annotation categories instead of columns. This makes it also quite close to the shape a classical fancy indexing would look like.

Example:

  • Fancy indexing: atoms[atoms.atom_name == 'CA']
  • Atom selection: atoms.select("atom_name == 'CA'")

@BradyAJohnston
Copy link

For someone who has never used pandas but has used other molecular programs (myself) it would be a new syntax, but I agree that it's probably the better way to go for ease of use by the most number of people.

For Molecular Nodes it'll mean there are multiple selection languages in the same program, but that's a problem for me to figure out not for upstream.

@t0mdavid-m
Copy link
Member

Generally I agree with @padix-key in the sense that I am personally also not a big fan of these types of queries in general. However, if there is demand for queries I believe it makes sense to support them.

The added value could be greater if we used an established query language that is not captured by the existing fancy indexing. Thus, I would be more in favour of supporting a commonly used DSL like the ones in PyMol/mdanalysis.

The already existing integration with ammolite could be a plus for PyMols DSL.

1 similar comment
@t0mdavid-m
Copy link
Member

Generally I agree with @padix-key in the sense that I am personally also not a big fan of these types of queries in general. However, if there is demand for queries I believe it makes sense to support them.

The added value could be greater if we used an established query language that is not captured by the existing fancy indexing. Thus, I would be more in favour of supporting a commonly used DSL like the ones in PyMol/mdanalysis.

The already existing integration with ammolite could be a plus for PyMols DSL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants