-
Notifications
You must be signed in to change notification settings - Fork 2
Cleanup/rust helper scripts #73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 50 out of 56 changed files in this pull request and generated no comments.
Files not reviewed (6)
- scripts/generate_sa_tables.sh: Language not supported
- scripts/generate_tables_helper.sh: Language not supported
- scripts/generate_umgap_tables.sh: Language not supported
- scripts/helper_scripts/.gitignore: Language not supported
- scripts/helper_scripts/filter_taxa.sh: Language not supported
- scripts/helper_scripts/unipept-database-rs/.gitignore: Language not supported
…ept-database into cleanup/rust-helper-scripts
pverscha
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! I've compared the outputs of the old parser vs the new parser and found that there are some minor differences. Upon closer inspection, the old version of the parser was apparently making some mistaces when trying to extract the UniProt accession ID from the dat-files. These are fixed in the new parser in this pull request, which is very good.
For future reference, the old parser sometimes extracted these IDs PO152_SCHPO Reviewed (containing even a field separator), while the new parser extracts these as O94385 (which is the new identifier used by UniProt for PO152_SCHPO).
This PR reformatted the Rust toolset used for processing the UniProtKB proteins.
uniprot-parsercombines the dat-parser and tables-generator to generate theuniprot_entriesfile used by the suffix array and k-mer UMGAP pipelineuniprot-parser-trypticcombines the dat-parser and tables-generator to generate theuniprot_entriesandpeptidesfiles used by the tryptic UMGAP pipeline--input-files) to be consistent and more readable in the shell scripts.More information is available in issue #71
The Shell scripts are also updated to use the new workspace (Old tools are removed). The shell scripts' names and locations are unchanged and should still work in the dev-containers/