Skip to content

How many CPU hours did it cost to generate the pretraining data? #8

Open
@ekiefl

Description

@ekiefl

I ran the following experiment:

  • Domain: GB1 (56AA)
  • Variants: 1000
  • Cores: 12 (all being used)
  • Environment: local MacOS

The runtime was 0.87 hours.

That's a rate of 0.01 CPU hours / variant (or 0.6 minutes / variant), which I would consider best-case-scenario, since 56AA is quite a small protein.

You guys generated 20M variants for 148 proteins = 3B total variants. At that rate, we're looking at least 30M CPU hours... It seems impractical, right? Or does this line up with your runtimes?

A few considerations:

  • I'm using my own re-implementation of this workflow, since I found it hard this codebase hard to plug into. Happy to share my code if you're interested.
  • One of the biggest differences is that I'm running Rosetta through their linux/amd64 Docker image. But since I'm on an M3 mac chip, maybe the emulation is slow?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions