Skip to content

Pre-computed files need to be regenerated for each set of parameters #16

@shervinea

Description

@shervinea

Context. Real-time PDB parsing with the BioPython package, e.g. typically:

self.structure = PDBParser().get_structure(pdb_id.upper(), fullfilename)
is expensive and bottlenecks the training process if done on the fly.

For this reason, we put in place a "precomputation stage"

def check_precomputed(self) -> None:
that takes all enzymes beforehand and stores target volumes in a dedicated folder.

Current limitation. This process is repeated for each set of parameters {weights considered, interpolation level between atoms p, volume size}. This is ineffective from the perspectives of:

  • total computations performed: PDB parsing is the same for all these configurations and needs to be identically repeated for each of them. The only remaining operations are relatively cheap: e.g. 2D -> 3D mapping, points interpolation. With a proper implementation, these last steps can easily be done on the fly without becoming a bottleneck.
  • space: the number/size of produced files increases with the same pace as the number of configurations that the user tries out (!).

Desired behavior. Coordinates + weights precomputation from PDB files is done only once and produces a parsed version of the data that is:

  1. Light enough so that it can be transformed to target volumes on the fly
  2. Complete enough so that all configurations' data can be derived from them.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions