Skip to content

Auto‐update suffix array to newest UniProtKB version

Pieter Verschaffelt edited this page Apr 7, 2025 · 2 revisions

In order to generate a new version of the Unipept suffix array (based on a new version of the UniProtKB database), you can either manually perform all steps that are described in the guides in the wikis of the unipept-database, unipept-index and unipept-api repositories, or you can start the update_uniprot.sh script in this repository on one of the API servers to automatically go through the whole pipeline.

The update_uniprot.sh script has two modes that can either be used to generate the suffix array and all associated files from scratch (update mode), or to clone all files from another server that has generated a new suffix array before (clone mode). Depending on the selected mode, the script requires another set of parameters to be passed.

Important

Remember to start this script in a screen session, since it can take quite some time before finishing!

Usage

Usage: ./update_uniprot.sh <mode> [OPTIONS]

Update mode (update)

In this mode, the whole suffix array construction pipeline will be run from start to finish. The script will clone all required repositories, build the required files, and setup a MariaDB-database required by unipept-database. The only thing that still needs to be done manually, is to start the unipept-api executable and actually allow end users to query the new index.

Required config values

None

Optional config values

  • --scratch-dir: Directory where temporary repositories and executables will be stored (default: ~).
  • --output-dir: Directory where the final output files will be stored (default: /mnt/data).
  • --help: Show the help message and exits.
  • --database-sources: Comma-separated list of database sources (in UniProtKB) that should be downloaded and processed (default: swissprot,trembl).

Examples

On Unipept API server rick

All default values provided by the script are already the ones we need.

./update_uniprot.sh update --scratch-dir "$HOME" --output-dir "/mnt/data" --database-sources "swissprot,trembl"

Clone mode (clone)

In this mode, the script assumes that the suffix array was already constructed on another machine and needs to be transferred to this one. This means that you don't need to wait for the whole suffix array to be constructed on a new machine, it can simply be cloned.

Required config values

  • --local-ssh-key: Path to the private key on this machine used to communicate with the remote server.
  • --remote-address: Address of the remote server from which the suffix array should be cloned.

Optional config values

  • --scratch-dir: Directory where temporary repositories and executables will be stored (default: ~).
  • --output-dir: Directory where the final output files will be stored (default: /mnt/data).
  • --help: Show the help message and exits.
  • --remote-user: Username of the remote server (default: unipept).
  • --remote-port: Port of the remote server available for SCP to communicate over (default: 4840).
  • --remote-output-dir: Directory on the remote server that stores the suffix array and all related files (default: /mnt/data)

Examples

On Unipept API server patty (clone from rick)

./update_uniprot.sh clone --scratch-dir "$HOME" --output-dir "/mnt/ssd" --local-ssh-key "~/.ssh/id_github_tibvdm" --remote-address "rick.ugent.be" --remote-port "4840" --remote-output-dir "/mnt/data"