-
Notifications
You must be signed in to change notification settings - Fork 126
Add pandas-style query functionality to AtomArray and AtomArrayStack objects
#815
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
CodSpeed Performance ReportMerging #815 will not alter performanceComparing Summary
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, to be honest I am generally not a big fan of string expressions, but this is generally true also for other software like PyMOL, Pandas, MDAnalysis, etc. The reason is that I think that this is the job of actual code with all of its advantages (performance, syntax highligthing, clearer exceptions, etc.).
That being said, I know people actually would like to be able to use string-based selections for the reasons you stated above (yes, I am looking at you @BradyAJohnston 😉). So although I probably will personally continue the 'classical' atom selections, I see a clear use case for other people here. And that the syntax is close to the one from pandas helps getting used to this new selection method. @t0mdavid-m Do you have a strong opinion?
Apart of from my conceptual concerns, your code looks really great!
I only have a few discussion points mentioned below.
| functions = { | ||
| "has_nan_coord": lambda: QueryExpression._has_nan_coord(atom_array), | ||
| "has_bonds": lambda: QueryExpression._has_bonds(atom_array), | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current design has_nan_coord and has_bonds would only be accessible from the new query API, while all the filters from filter.py are only accessible from the classical fancy indexing. Could make all filters accessible here and move has_nan_coord and has_bonds to filter.py (maybe as filter_unresolved() and filter_bonded()?
| Select atoms without NaN coordinates: | ||
| >>> valid_atoms = atom_array.query("~has_nan_coord()") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's nice that all these new methods have doctest examples. However to make users aware of this new feature we should probably mention this atom selection alternative also in the tutorial. I think doc/tutorial/structure/filter.rst would be fitting.
| -------- | ||
| Select all CA atoms in chain A: | ||
| >>> ca_atoms = atom_array.query("(chain_id == 'A') & (atom_name == 'CA')") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also print the return value here?
|
Thanks for the tag @padix-key you are certainly correct that I would love to see some kind of string selection implemented. I too like the current way of creating selections with On that note, the question around this kind of feature is always "what syntax to use?". It might make more sense to instead try and match the syntax of other programs (of most use to me personally would be MDAnalysis, but that's purely selfish). Always the option of going forward of forging your own path and creating your own syntax, but then there is yet another selection syntax to try and remember when switching between software. Excited with whatever comes though! |
I would argue that the syntax use here is not a new one: It is actually the same as in Example:
|
|
For someone who has never used For Molecular Nodes it'll mean there are multiple selection languages in the same program, but that's a problem for me to figure out not for upstream. |
|
Generally I agree with @padix-key in the sense that I am personally also not a big fan of these types of queries in general. However, if there is demand for queries I believe it makes sense to support them. The added value could be greater if we used an established query language that is not captured by the existing fancy indexing. Thus, I would be more in favour of supporting a commonly used DSL like the ones in PyMol/mdanalysis. The already existing integration with ammolite could be a plus for PyMols DSL. |
1 similar comment
|
Generally I agree with @padix-key in the sense that I am personally also not a big fan of these types of queries in general. However, if there is demand for queries I believe it makes sense to support them. The added value could be greater if we used an established query language that is not captured by the existing fancy indexing. Thus, I would be more in favour of supporting a commonly used DSL like the ones in PyMol/mdanalysis. The already existing integration with ammolite could be a plus for PyMols DSL. |
This pull request introduces new query capabilities to the
AtomArrayclass in thebiotite.structuremodule, enabling users to filter, mask, and retrieve indices of atoms based on convenient pandas-like query expressions (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html).New query functionality in
AtomArray:querymethod: Subsets the AtomArray(Stack) to atoms which satisfy the query expression (e.g.atom_array.query("(chain_id == 'A') & (atom_name == 'CA'))maskmethod: Returns a True/False mask with True for atoms that match the query expression (e.g.atom_array.mask("(chain_id == 'A') & (atom_name == 'CA'))idxsmethod: Returns the indices of atoms that match a query expression (e.g.atom_array.idxs("(chain_id == 'A') & (atom_name == 'CA'))Pitch:
This allows easy compound queries such as
atom_array.mask("(chain_id in ['A', 'Z']) & (res_name not in ['ALA', 'GLY']) | (b_factor > 0.3)")which can be convenient for structural analysis in jupyter notebooks and for configuring generic, simple tasks via strings in configs (like the yaml configs in hydra).Would love to hear your thoughts @padix-key - do you think this would be useful enough to add as a feature in our main distribution?