Conversation
|
Draft PR - still waiting for testdata merge (here |
There was a problem hiding this comment.
looks like these are just python libraries, no need for a dockerfile then. Easiest to use seqera containers instead https://nf-co.re/docs/tutorials/nf-core_components/using_seqera_containers (also please add an environment.ymlfile to also support conda)
There was a problem hiding this comment.
Hi, thank you for taking the time to look at my module!
Unfortunately, Stitchr is not available via conda (bioconda, conda-forge), only via pip. I left the environment.yml file out as well, because the instructions for adding a module (https://nf-co.re/docs/guidelines/components/modules, section 7.2) specify that this file is only required if the tool is available via conda. Of course I would be happy to add the (empty) file back in, if that is the preferred style.
The reason for including our own container is that stitchr requires a data download (and further manipulation of this data), handled by an internal wrapper script. This script requires root permission, which is not available in the container (with the given global configs). I've tried to circumvent this issue in several ways, e.g. with a local config, but couldn't find a way to make it work. The external container already has the necessary data baked in. However, if you have an immediate idea on something I might try, I'd be very open to that.
There was a problem hiding this comment.
for the first part: you can install pip-only dependencies via the environment.yml see for example
There was a problem hiding this comment.
for the second part: why not make the download command a separate module then?
There was a problem hiding this comment.
Thanks for the example! I've added conda just now, but without the container, the species data is missing, so the module cannot run with conda in its current state. Sorry, I forgot that this was an additional concern with conda when I first replied.
About the separate module: We discussed this locally, and decided against it because the download wrapper script (stitchrdl) is so closely intertwined with the rest of the tool, and not really something that would be run on its own. Also, we fear that this might simply move the problem: With two modules, we would first need to run stitchrdl inside a download module (which doesn't work due to permission issues - there is no option to determine where data is saved, stitchrdl deposits the data were stitchr expects to find it), and then move the data to the correct place again, this time manually, in the second module (here, moving over root could perhaps be avoided). Maybe there's something I'm missing though, as this is my first contribution.
There was a problem hiding this comment.
Jumping in at the request of @mashehu !
In metagenomics modules, we've had a lot of cases of tools coming with databases etc.. In the vast majority of cases (actually, all, from the the top of my head), we've ended up splitting data from the execution.
This is partly because the databases are very large - and you don't want to repeatedly pull large containers on each execution node (as this can be very slow), and also embedding data inside the container means you then cannot update the data.
Can you maybe go into more detail what the technical issue is (for someone who is not familiar with the tool/the conversation 😅 )?
I don't understand what you mean by 'due to permission issues', for example. Could you just mount to the container where strichr wants thte data to go the work directory for downloading, for example?
There was a problem hiding this comment.
Hi, thanks for taking the time :)
That makes sense, I'll split the download off into another submodule.
On the topic of permission errors: The issue is that stitchr provides a wrapper script for data downloading, which is not flexible in terms of the destination folder. This is not a problem when running the script with conda, outside of a container. However, when running the script inside of a container, it produces a permission error when it tries to move the downloaded data to the appropriate folder. From what I understand, this is because it tries to move the data into a folder that is immutable in a container setting (/opt/...)? I'm fairly new to containers, and couldn't find a way to circumvent this - if you know of one, I'm very happy to try!
I just reproduced the error by running stitchrdl with a seqera container and profile docker:
"
OSError: [Errno 18] Invalid cross-device link: 'HUMAN' -> '/opt/conda/lib/python3.14/site-packages/Data/HUMAN'
During handling of the above exception, another exception occurred:
PermissionError: [Errno 13] Permission denied: '/opt/conda/lib/python3.14/site-packages/Data/HUMAN'
"
It would probably not be allowed to create modules that only support conda, and don't provide a container at all?
New module: stitchr
Stitchr is a tool that generates full nucleotide and amino acid TCR sequences from V-, J- and CDR3 information.
Testdata is still missing, waiting for PR
PR checklist
topic: versions- See version_topicslabelnf-core modules test <MODULE> --profile docker