Skip to content

Conversation

@tibvdm
Copy link
Contributor

@tibvdm tibvdm commented Apr 9, 2025

This PR reformatted the Rust toolset used for processing the UniProtKB proteins.

  • The rust project was replaced by a rust workspace. Each executable can be compiled separately depending on the pipeline.
  • Several tools where deprecated and removed:
    • taxa-by-chunk
    • write-to-chunk
    • filter-taxa
  • The dat-parser and tables-generator where transformed from executable to libraries. Both libraries can now be chained so we don't need extra IO-operations between the two.
  • uniprot-parser combines the dat-parser and tables-generator to generate the uniprot_entries file used by the suffix array and k-mer UMGAP pipeline
  • uniprot-parser-tryptic combines the dat-parser and tables-generator to generate the uniprot_entries and peptides files used by the tryptic UMGAP pipeline
  • All executables use the long-arguments (e.g. --input-files) to be consistent and more readable in the shell scripts.
  • I changed some tool names to better reflect the functionality.
  • The download-streams are combined. Both swissprot and trembl are written to the same executable. The dat-parser now parses the database type and adds it to the output.

More information is available in issue #71

The Shell scripts are also updated to use the new workspace (Old tools are removed). The shell scripts' names and locations are unchanged and should still work in the dev-containers/

@tibvdm tibvdm requested review from Copilot and pverscha April 9, 2025 12:42
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 50 out of 56 changed files in this pull request and generated no comments.

Files not reviewed (6)
  • scripts/generate_sa_tables.sh: Language not supported
  • scripts/generate_tables_helper.sh: Language not supported
  • scripts/generate_umgap_tables.sh: Language not supported
  • scripts/helper_scripts/.gitignore: Language not supported
  • scripts/helper_scripts/filter_taxa.sh: Language not supported
  • scripts/helper_scripts/unipept-database-rs/.gitignore: Language not supported

Copy link
Member

@pverscha pverscha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! I've compared the outputs of the old parser vs the new parser and found that there are some minor differences. Upon closer inspection, the old version of the parser was apparently making some mistaces when trying to extract the UniProt accession ID from the dat-files. These are fixed in the new parser in this pull request, which is very good.

For future reference, the old parser sometimes extracted these IDs PO152_SCHPO Reviewed (containing even a field separator), while the new parser extracts these as O94385 (which is the new identifier used by UniProt for PO152_SCHPO).

@tibvdm tibvdm merged commit 025e1cf into master Apr 10, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants