Closed
Conversation
PhyloPhlAn is an integrated pipeline for large-scale phylogenetic profiling of genomes and metagenomes. PhyloPhlAn GitHub: https://github.com/biobakery/phylophlan The tool suite consists of multiple scripts which need to be wrapped into tools and data managers, but this branch is only concerned with creating the tool wrappers. As of this commit only phylophlan.xml is largely complete. This is the main script to run the Concatenation and Gene-trees pipelines, and allows the user to configure which external tools they want to use at every analysis step. - Support for some preconfigured external tools is missing (Opal, UPP and astrid). - Using aligned markers from StrainPhlAn (--strainphlan) is not supported. - Code and .loc files to access cached datasets are missing. phylophlan_assign_sgbs and phylophlan_draw_metagenomic are used to report and visualize the closest species-level genome bins, for each bin from a metagenomic assembly analysis. My progress on them has recently stalled, because the scripts rely entirely on the presence of a cached database, which I did not implement like above. Besides that they are missing an expanded help section, and the current release of the assign_sgbs script has a bug limiting some of the functionality (pairwise mash distances of the input). phylophlan_strain_finder would be a tool to perform analysis on trees and mutation rate tables build with phylophlan, but I have not wrapped it (yet?). The test data have been created by cutting down example data from the tutorials on the PhyloPhlAn github.
Collaborator
|
@neo417 Great addition! Would you like to move this tool to IUC, as also have MetaPhlAn on that repo? |
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is the other tool I have been working on with @Minamehr for my Bachelor's project.
Because of the unexpected complexity and time constraints we agreed it would be enough for my Project to just wrap the main phylophlan script, but for reference I have included all my progress on the tool.
The main thing missing from all the tools is a way to access cached reference datasets, as I do not understand how to configure and test the .loc files. Your help with that would be much appreciated. Once I know how to add tool-data manually, I would be willing to at least finish up the parts I have started in my free time.
Relatedly, we decided early on that writing data managers for this tool would be out of scope or the project. PhyloPhlAn provides various scripts to download pre-identified core UniRef90 proteins, reference genomes from the Genbank repository and custom SGB databases. I am not sure if compatible references can already be downloaded by Galaxy or if these data sources are too large to cache with a data manager. I did not continue writing wrappers for them once I realized that even the indices are hundreds of MB in size.
Do you think it would be useful to publish just the main script now and add the remaining tools and data managers later?