Skip to content

Commit d9aaea9

Browse files
authored
Merge pull request #135 from tkchafin/dev
Precomputed BUSCO [WIP]
2 parents 81d26b4 + f8e019b commit d9aaea9

24 files changed

+342
-102
lines changed

.github/workflows/ci.yml

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,3 +41,10 @@ jobs:
4141
# Remember that you can parallelise this by using strategy.matrix
4242
run: |
4343
nextflow run ${GITHUB_WORKSPACE} -profile test,docker --outdir ./results
44+
45+
- name: Run pipeline with test data and precomputed BUSCOs
46+
# You can customise CI pipeline run tests as required
47+
# For example: adding multiple test runs with different parameters
48+
# Remember that you can parallelise this by using strategy.matrix
49+
run: |
50+
nextflow run ${GITHUB_WORKSPACE} -profile test_nobusco,docker --outdir ./results

.github/workflows/linting.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -99,7 +99,7 @@ jobs:
9999

100100
- name: Upload linting log file artifact
101101
if: ${{ always() }}
102-
uses: actions/upload-artifact@v3
102+
uses: actions/upload-artifact@v4
103103
with:
104104
name: linting-logs
105105
path: |

.github/workflows/sanger_test.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ jobs:
2626
"use_work_dir_as_temp": true,
2727
}
2828
profiles: test,sanger,singularity,cleanup
29-
- uses: actions/upload-artifact@v3
29+
- uses: actions/upload-artifact@v4
3030
with:
3131
name: Tower debug log file
3232
path: |

.github/workflows/sanger_test_full.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ jobs:
3434
"outdir": "${{ secrets.TOWER_WORKDIR_PARENT }}/results/${{ github.repository }}/results-${{ env.REVISION }}",
3535
}
3636
profiles: test_full,sanger,singularity,cleanup
37-
- uses: actions/upload-artifact@v3
37+
- uses: actions/upload-artifact@v4
3838
with:
3939
name: Tower debug log file
4040
path: |

CHANGELOG.md

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@ The pipeline is now considered to be a complete and suitable replacement for the
1414
- Updated the Blastn settings to allow 7 days runtime at most, since that
1515
covers 99.7% of the jobs.
1616
- Allow database inputs to be optionally compressed (`.tar.gz`)
17+
- Allow `BUSCO` run outputs to be optionally pre-computed and provided with `--busco_output`
1718

1819
### Software dependencies
1920

@@ -22,11 +23,19 @@ Note, since the pipeline is using Nextflow DSL2, each process will be run with i
2223
| Dependency | Old version | New version |
2324
| ----------- | ----------------- | --------------- |
2425
| blast | 2.14.1 and 2.15.0 | only 2.15.0 |
25-
| blobtoolkit | 4.3.9 | 4.4.0 |
26+
| blobtoolkit | 4.3.9 | 4.4.4 |
2627
| busco | 5.5.0 | 5.7.1 |
2728
| multiqc | 1.20 and 1.21 | 1.20 and 1.25.1 |
2829
| samtools | 1.18 and 1.19.2 | 1.20 and 1.21 |
2930

31+
### Parameters
32+
33+
| Old parameter | New parameter |
34+
| ------------- | -------------- |
35+
| | --busco-output |
36+
37+
> **NB:** Parameter has been **updated** if both old and new parameter information is present. </br> **NB:** Parameter has been **added** if just the new parameter information is present. </br> **NB:** Parameter has been **removed** if new parameter information isn't present.
38+
3039
## [[0.6.0](https://github.com/sanger-tol/blobtoolkit/releases/tag/0.6.0)] – Bellsprout – [2024-09-13]
3140

3241
The pipeline has now been validated for draft (unpublished) assemblies.

bin/generate_config.py

Lines changed: 21 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,7 @@ def parse_args(args=None):
4747
parser.add_argument("--blastx", help="Path to the blastx database", required=True)
4848
parser.add_argument("--blastn", help="Path to the blastn database", required=True)
4949
parser.add_argument("--taxdump", help="Path to the taxonomy database", required=True)
50+
parser.add_argument("--busco_output", action="append", help="Path to BUSCO output directory", required=False)
5051
parser.add_argument("--version", action="version", version="%(prog)s 2.0")
5152
return parser.parse_args(args)
5253

@@ -121,20 +122,33 @@ def get_classification(taxon_info: TaxonInfo) -> typing.Dict[str, str]:
121122
return {r: ancestors[r] for r in RANKS if r in ancestors}
122123

123124

124-
def get_odb(taxon_info: TaxonInfo, lineage_tax_ids: str, requested_buscos: typing.Optional[str]) -> typing.List[str]:
125+
def get_odb(
126+
taxon_info: TaxonInfo,
127+
lineage_tax_ids: str,
128+
requested_buscos: typing.Optional[str],
129+
pre_computed_buscos: typing.List[str],
130+
) -> typing.List[str]:
125131
# Read the mapping between the BUSCO lineages and their taxon_id
126132
with open(lineage_tax_ids) as file_in:
127133
lineage_tax_ids_dict: typing.Dict[int, str] = {}
128134
for line in file_in:
129135
arr = line.split()
130136
lineage_tax_ids_dict[int(arr[0])] = arr[1] + "_odb10"
131137

132-
if requested_buscos:
138+
valid_odbs = set(lineage_tax_ids_dict.values())
139+
140+
if pre_computed_buscos:
141+
# Use pre-computed BUSCO lineages if available
142+
odb_arr = pre_computed_buscos
143+
for odb in odb_arr:
144+
if odb not in valid_odbs:
145+
print(f"Invalid pre-computed BUSCO lineage: {odb}", file=sys.stderr)
146+
sys.exit(1)
147+
elif requested_buscos:
133148
odb_arr = requested_buscos.split(",")
134-
valid_odbs = set(lineage_tax_ids_dict.values())
135149
for odb in odb_arr:
136150
if odb not in valid_odbs:
137-
print(f"Invalid BUSCO lineage: {odb}", file=sys.stderr)
151+
print(f"Invalid requested BUSCO lineage: {odb}", file=sys.stderr)
138152
sys.exit(1)
139153
else:
140154
# Do the intersection to find the ancestors that have a BUSCO lineage
@@ -327,7 +341,9 @@ def main(args=None):
327341

328342
taxon_info = fetch_taxon_info(args.taxon_query)
329343
classification = get_classification(taxon_info)
330-
odb_arr = get_odb(taxon_info, args.lineage_tax_ids, args.busco)
344+
345+
precomputed_busco = [os.path.basename(path).replace("run_", "") for path in (args.busco_output or [])]
346+
odb_arr = get_odb(taxon_info, args.lineage_tax_ids, args.busco, precomputed_busco)
331347
taxon_id = adjust_taxon_id(args.nt, taxon_info)
332348

333349
if sequence_report:

conf/test_nobusco.config

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
/*
2+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
3+
Nextflow config file for running minimal tests
4+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
5+
Defines input files and everything required to run a fast and simple pipeline test.
6+
7+
Use as follows:
8+
nextflow run sanger-tol/blobtoolkit -profile test,<docker/singularity> --outdir <OUTDIR>
9+
10+
----------------------------------------------------------------------------------------
11+
*/
12+
13+
params {
14+
config_profile_name = 'Test profile'
15+
config_profile_description = 'Minimal aligned test dataset to check pipeline function'
16+
17+
// Limit resources so that this can run on GitHub Actions
18+
max_cpus = 2
19+
max_memory = '6.GB'
20+
max_time = '6.h'
21+
22+
// Input test data
23+
// Specify the paths to your test data
24+
// Give any required params for the test so that command line flags are not needed
25+
input = "${projectDir}/assets/test/samplesheet_s3.csv"
26+
27+
// Fasta references
28+
fasta = "https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/assembly/release/mMelMel3.1_paternal_haplotype/GCA_922984935.2.subset.phiXspike.fasta.gz"
29+
accession = "GCA_922984935.2"
30+
taxon = "Meles meles"
31+
32+
// Databases
33+
taxdump = "https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.tar.gz"
34+
busco = "https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/resources/blobtoolkit.GCA_922984935.2.2023-08-03.tar.gz"
35+
blastp = "https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/resources/mMelMel3.1.buscogenes.dmnd.tar.gz"
36+
blastx = "https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/resources/mMelMel3.1.buscoregions.dmnd.tar.gz"
37+
blastn = "https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/resources/nt_mMelMel3.1.tar.gz"
38+
39+
// Precomputed BUSCO outputs
40+
// busco_output_noArchaea.tar.gz deliberately leaves out archaea_odb10 to test the pipeline's detection and filling of missing lineages
41+
// Switch to *_busco_output.tar.gz for fully precomputed BUSCOs
42+
busco_output = "https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/resources/GCA_922984935.2_busco_output_noArchaea.tar.gz"
43+
//busco_output = "https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/resources/GCA_922984935.2_busco_output.tar.gz"
44+
45+
// Need to be set to avoid overfilling /tmp
46+
use_work_dir_as_temp = true
47+
}

docs/usage.md

Lines changed: 21 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,26 @@ An [example samplesheet](assets/test/samplesheet.csv) has been provided with the
5454
The pipeline can also accept a samplesheet generated by the [nf-core/fetchngs](https://nf-co.re/fetchngs) pipeline (tested with version 1.11.0).
5555
The pipeline then needs the `--fetchngs_samplesheet true` option _and_ `--align true`, since the data files would all be unaligned.
5656

57+
### Support for pre-computed `BUSCO` outputs
58+
59+
The pipeline may be optionally run with a set of pre-computed [`BUSCO`](https://busco.ezlab.org) runs, provided using the `--busco_output` parameter. These can be provided as either a directory path, or a `.tar.gz` compressed archive. The contents should be each `run_` output directory (directly from `BUSCO`) named as `run_[odb_dabasase_name]`:
60+
61+
```
62+
GCA_922984935.2_busco_output/
63+
├── run_archaea_odb10
64+
├── run_bacteria_odb10
65+
├── run_carnivora_odb10
66+
├── run_eukaryota_odb10
67+
├── run_eutheria_odb10
68+
├── run_laurasiatheria_odb10
69+
├── run_mammalia_odb10
70+
├── run_metazoa_odb10
71+
├── run_tetrapoda_odb10
72+
└── run_vertebrata_odb10
73+
```
74+
75+
The pipeline minimally requires outputs for the 'basal' lineages (archaea, eukaryota, and bacteria) -- any of these which are not present in the pre-computed outputs will be automatically detected and run.
76+
5777
## Database parameters
5878

5979
Configure access to your local databases with the `--busco`, `--blastp`, `--blastx`, `--blastn`, and `--taxdump` parameters.
@@ -272,7 +292,7 @@ List of tools for any given dataset can be fetched from the API, for example htt
272292

273293
| Dependency | Snakemake | Nextflow |
274294
| ----------------- | --------- | -------- |
275-
| blobtoolkit | 4.3.2 | 4.4.0 |
295+
| blobtoolkit | 4.3.2 | 4.4.4 |
276296
| blast | 2.12.0 | 2.14.1 |
277297
| blobtk | 0.5.0 | 0.5.1 |
278298
| busco | 5.3.2 | 5.5.0 |

modules/local/blobtoolkit/chunk.nf

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ process BLOBTOOLKIT_CHUNK {
55
if (workflow.profile.tokenize(',').intersect(['conda', 'mamba']).size() >= 1) {
66
exit 1, "BLOBTOOLKIT_CHUNK module does not support Conda. Please use Docker / Singularity / Podman instead."
77
}
8-
container "docker.io/genomehubs/blobtoolkit:4.4.0"
8+
container "docker.io/genomehubs/blobtoolkit:4.4.4"
99

1010
input:
1111
tuple val(meta) , path(fasta)

modules/local/blobtoolkit/countbuscos.nf

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ process BLOBTOOLKIT_COUNTBUSCOS {
55
if (workflow.profile.tokenize(',').intersect(['conda', 'mamba']).size() >= 1) {
66
exit 1, "BLOBTOOLKIT_COUNTBUSCOS module does not support Conda. Please use Docker / Singularity / Podman instead."
77
}
8-
container "docker.io/genomehubs/blobtoolkit:4.4.0"
8+
container "docker.io/genomehubs/blobtoolkit:4.4.4"
99

1010
input:
1111
tuple val(meta), path(table, stageAs: 'dir??/*')

0 commit comments

Comments
 (0)