Skip to content

Commit 26392b0

Browse files
authored
Merge pull request #161 from sanger-tol/dev
2.1.0 release
2 parents d0ec90c + ee78df8 commit 26392b0

File tree

129 files changed

+6071
-758
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

129 files changed

+6071
-758
lines changed

.github/workflows/ci.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ jobs:
3131
uses: actions/checkout@v3
3232

3333
- name: Install Nextflow
34-
uses: nf-core/setup-nextflow@v1
34+
uses: nf-core/setup-nextflow@v2
3535
with:
3636
version: "${{ matrix.NXF_VER }}"
3737

CHANGELOG.md

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,41 @@
33
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
44
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
55

6+
## [[2.1.0](https://github.com/sanger-tol/genomenote/releases/tag/2.1.0)] - Pembroke Welsh Corgi [2024-12-11]
7+
8+
### Enhancements & fixes
9+
10+
- New annotation_statistics subworkfow which runs BUSCO in protein mode and generates some basic statistics on the the annotated gene set if provided with a GFF3 file of gene annotations using the `--annotation_set` option.
11+
- The genome_metadata subworkflow now queries Ensembl's GraphQL API to determine if Ensembl has released gene annotation for the assembly being processed.
12+
- Module updates and remove Anaconda channels
13+
- Removed merquryfk completeness metric
14+
15+
### Parameters
16+
17+
| Old parameter | New parameter |
18+
| ------------- | ---------------- |
19+
| | --annotation_set |
20+
21+
> **NB:** Parameter has been **updated** if both old and new parameter information is present. </br> **NB:** Parameter has been **added** if just the new parameter information is present. </br> **NB:** Parameter has been **removed** if new parameter information isn't present.
22+
23+
### Software dependencies
24+
25+
Note, since the pipeline is using Nextflow DSL2, each process will be run with its own [Biocontainer](https://biocontainers.pro/#/registry). This means that on occasion it is entirely possible for the pipeline to be using different versions of the same tool. However, the overall software dependency changes compared to the last release have been listed below for reference. Only `Docker` or `Singularity` containers are supported, `conda` is not supported.
26+
27+
| Dependency | Old version | New version |
28+
| ----------- | ---------------------------------------- | ---------------------------------------- |
29+
| `agat` | | 1.4.0 |
30+
| `bedtools` | 2.30.0 | 2.31.1 |
31+
| `busco` | 5.5.0 | 5.7.1 |
32+
| `cooler` | 0.8.11 | 0.9.2 |
33+
| `fastk` | 427104ea91c78c3b8b8b49f1a7d6bbeaa869ba1c | 666652151335353eef2fcd58880bcef5bc2928e1 |
34+
| `gffread` | | 0.12.7 |
35+
| `merquryfk` | d00d98157618f4e8d1a9190026b19b471055b22e | |
36+
| `multiqc` | 1.14 | 1.25.1 |
37+
| `samtools` | 1.17 | 1.21 |
38+
39+
> **NB:** Dependency has been **updated** if both old and new version information is present. </br> **NB:** Dependency has been **added** if just the new version information is present. </br> **NB:** Dependency has been **removed** if version information isn't present.
40+
641
## [[2.0.0](https://github.com/sanger-tol/genomenote/releases/tag/2.0.0)] - English Cocker Spaniel [2024-10-10]
742

843
### Enhancements & fixes

CITATION.cff

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,8 @@ message: >-
88
metadata from this file.
99
type: software
1010
authors:
11-
- given-names: Sandra
12-
family-names: Babiyre
11+
- given-names: Sandra Ruth
12+
family-names: Babirye
1313
affiliation: Wellcome Sanger Institute
1414
orcid: "https://orcid.org/0009-0004-7773-7008"
1515
- given-names: Tyler

CITATIONS.md

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,10 @@
1212
1313
## Pipeline tools
1414

15+
- [AGAT](https://github.com/NBISweden/AGAT)
16+
17+
> Dainat J. AGAT: Another Gff Analysis Toolkit to handle annotations in any GTF/GFF format. (Version v1.4.0). Zenodo. https://www.doi.org/10.5281/zenodo.3552717
18+
1519
- [BedTools](https://bedtools.readthedocs.io/en/latest/)
1620

1721
> Quinlan, Aaron R., and Ira M. Hall. “BEDTools: A Flexible Suite of Utilities for Comparing Genomic Features.” Bioinformatics, vol. 26, no. 6, 2010, pp. 841–842., https://doi.org/10.1093/bioinformatics/btq033.
@@ -30,6 +34,10 @@
3034
3135
- [FastK](https://github.com/thegenemyers/FASTK)
3236

37+
- [GFFREAD](https://github.com/gpertea/gffread)
38+
39+
> Pertea G and Pertea M. "GFF Utilities: GffRead and GffCompare [version 1; peer review: 3 approved]". F1000Research 2020, 9:304 https://doi.org/10.12688/f1000research.23297.1
40+
3341
- [MerquryFK](https://github.com/thegenemyers/MERQURY.FK)
3442

3543
- [MultiQC](https://multiqc.info)
@@ -48,9 +56,9 @@
4856
4957
## Software packaging/containerisation tools
5058

51-
- [Anaconda](https://anaconda.com)
59+
- [Conda](https://conda.org/)
5260

53-
> Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web.
61+
> conda contributors. conda: A system-level, binary package and environment manager running on all major operating systems and platforms. Computer software. https://github.com/conda/conda
5462
5563
- [Bioconda](https://bioconda.github.io)
5664

README.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
[![Cite with Zenodo](http://img.shields.io/badge/DOI-10.5281/zenodo.7949384-1073c8?labelColor=000000)](https://doi.org/10.5281/zenodo.7949384)
55

66
[![Nextflow](https://img.shields.io/badge/nextflow%20DSL2-%E2%89%A522.10.1-23aa62.svg)](https://www.nextflow.io/)
7-
[![run with conda](http://img.shields.io/badge/run%20with-conda-3EB049?labelColor=000000&logo=anaconda)](https://docs.conda.io/en/latest/)
7+
[![run with conda](http://img.shields.io/badge/run%20with-conda-3EB049?labelColor=000000&logo=conda)](https://docs.conda.io/en/latest/)
88
[![run with docker](https://img.shields.io/badge/run%20with-docker-0db7ed?labelColor=000000&logo=docker)](https://www.docker.com/)
99
[![run with singularity](https://img.shields.io/badge/run%20with-singularity-1d355c.svg?labelColor=000000)](https://sylabs.io/docs/)
1010
[![Launch on Nextflow Tower](https://img.shields.io/badge/Launch%20%F0%9F%9A%80-Nextflow%20Tower-%234256e7)](https://tower.nf/launch?pipeline=https://github.com/sanger-tol/genomenote)
@@ -13,7 +13,7 @@
1313

1414
## Introduction
1515

16-
**sanger-tol/genomenote** is a bioinformatics pipeline that takes aligned HiC reads, creates contact maps and chromosomal grid using Cooler, and display on a [HiGlass server](https://genome-note-higlass.tol.sanger.ac.uk/app). The pipeline also collates (1) assembly information, statistics and chromosome details from NCBI datasets, (2) genome completeness from BUSCO, (3) consensus quality and k-mer completeness from MerquryFK, and (4) HiC primary mapped percentage from samtools flagstat.
16+
**sanger-tol/genomenote** is a bioinformatics pipeline that takes aligned HiC reads, creates contact maps and chromosomal grid using Cooler, and display on a [HiGlass server](https://genome-note-higlass.tol.sanger.ac.uk/app). The pipeline also collates (1) assembly information, statistics and chromosome details from NCBI datasets, (2) genome completeness from BUSCO, (3) consensus quality and k-mer completeness from MerquryFK, (4) HiC primary mapped percentage from samtools flagstat and optionally (5) Annotation statistics from AGAT and BUSCO. The pipeline combines the calculated statistics and collated assembly metadata with a template document to output a genome note document.
1717

1818
<!--![sanger-tol/genomenote workflow](https://raw.githubusercontent.com/sanger-tol/genomenote/main/docs/images/sanger-tol-genomenote_workflow.png)-->
1919

@@ -25,7 +25,9 @@
2525
6. Genome completeness ([`NCBI API`](https://www.ncbi.nlm.nih.gov/datasets/docs/v1/reference-docs/rest-api/), [`BUSCO`](https://busco.ezlab.org))
2626
7. Consensus quality and k-mer completeness ([`FASTK`](https://github.com/thegenemyers/FASTK), [`MERQURY.FK`](https://github.com/thegenemyers/MERQURY.FK))
2727
8. Collated summary table ([`createtable`](bin/create_table.py))
28-
9. Present results and visualisations ([`MultiQC`](http://multiqc.info/), [`R`](https://www.r-project.org/))
28+
9. Optionally calculates some annotation statistics and completeness , ([`AGAT`](https://github.com/NBISweden/AGAT), [`BUSCO`](https://busco.ezlab.org))
29+
10. Combines calculated statisics and assembly metadata with a template file to produce a genome note document.
30+
11. Present results and visualisations ([`MultiQC`](http://multiqc.info/), [`R`](https://www.r-project.org/))
2931

3032
## Usage
3133

assets/genome_note_template.docx

200 Bytes
Binary file not shown.

bin/combine_parsed_data.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@
2121
("COPO_BIOSAMPLE_HIC", "copo_biosample_hic_file"),
2222
("COPO_BIOSAMPLE_RNA", "copo_biosample_rna_file"),
2323
("GBIF_TAXONOMY", "gbif_taxonomy_file"),
24+
("ENSEMBL_ANNOTATION", "ensembl_annotation_file"),
2425
]
2526

2627

@@ -42,6 +43,7 @@ def parse_args(args=None):
4243
parser.add_argument("--copo_biosample_hic_file", help="Input parsed COPO HiC biosample file.", required=False)
4344
parser.add_argument("--copo_biosample_rna_file", help="Input parsed COPO RNASeq biosample file.", required=False)
4445
parser.add_argument("--gbif_taxonomy_file", help="Input parsed GBIF taxonomy file.", required=False)
46+
parser.add_argument("--ensembl_annotation_file", help="Input parsed Ensembl annotation file.", required=False)
4547
parser.add_argument("--out_consistent", help="Output file.", required=True)
4648
parser.add_argument("--out_inconsistent", help="Output file.", required=True)
4749
parser.add_argument("--version", action="version", version="%(prog)s 1.0")

bin/combine_statistics_data.py

Lines changed: 14 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,8 @@
88

99
files = [
1010
("CONSISTENT", "in_consistent"),
11-
("STATISITCS", "in_statistics"),
11+
("GENOME_STATISTICS", "in_genome_statistics"),
12+
("ANNOTATION_STATISITCS", "in_annotation_statistics"),
1213
]
1314

1415

@@ -19,7 +20,13 @@ def parse_args(args=None):
1920
parser = argparse.ArgumentParser(description=Description, epilog=Epilog)
2021
parser.add_argument("--in_consistent", help="Input consistent params file.", required=True)
2122
parser.add_argument("--in_inconsistent", help="Input consistent params file.", required=True)
22-
parser.add_argument("--in_statistics", help="Input parsed genome statistics params file.", required=True)
23+
parser.add_argument("--in_genome_statistics", help="Input parsed genome statistics params file.", required=True)
24+
parser.add_argument(
25+
"--in_annotation_statistics",
26+
help="Input parsed annotation statistics params file.",
27+
required=False,
28+
default=None,
29+
)
2330
parser.add_argument("--out_consistent", help="Output file.", required=True)
2431
parser.add_argument("--out_inconsistent", help="Output file.", required=True)
2532
parser.add_argument("--version", action="version", version="%(prog)s 1.0")
@@ -36,7 +43,7 @@ def process_file(file_in, file_type, params, param_sets):
3643
reader = csv.reader(infile)
3744

3845
for row in reader:
39-
if row[0] == "#paramName":
46+
if row[0].startswith("#"):
4047
continue
4148

4249
key = row.pop(0)
@@ -95,7 +102,10 @@ def main(args=None):
95102
params_inconsistent = {}
96103

97104
for file in files:
98-
(params, param_sets) = process_file(getattr(args, file[1]), file[0], params, param_sets)
105+
if file[0] == "ANNOTATION_STATISITCS" and args.in_annotation_statistics == None:
106+
continue
107+
else:
108+
(params, param_sets) = process_file(getattr(args, file[1]), file[0], params, param_sets)
99109

100110
for key in params.keys():
101111
value_set = {v for v in params[key]}
Lines changed: 154 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,154 @@
1+
#!/usr/bin/env python3
2+
import re
3+
import csv
4+
import sys
5+
import argparse
6+
import json
7+
8+
9+
# Extract CDS information from mrna and transcript sections
10+
def extract_cds_info(file):
11+
# Define regex patterns for different statistics
12+
patterns = {
13+
"TRANSC_MRNA": re.compile(r"Number of mrna\s+(\d+)"),
14+
"PCG": re.compile(r"Number of gene\s+(\d+)"),
15+
"CDS_PER_GENE": re.compile(r"mean mrnas per gene\s+([\d.]+)"),
16+
"EXONS_PER_TRANSC": re.compile(r"mean exons per mrna\s+([\d.]+)"),
17+
"CDS_LENGTH": re.compile(r"mean mrna length \(bp\)\s+([\d.]+)"),
18+
"EXON_SIZE": re.compile(r"mean exon length \(bp\)\s+([\d.]+)"),
19+
"INTRON_SIZE": re.compile(r"mean intron in cds length \(bp\)\s+([\d.]+)"),
20+
}
21+
22+
# Initialize a dictionary to store content for different sections
23+
section_content = {"mrna": "", "transcript": ""}
24+
25+
# Variable to keep track of the current section being processed
26+
current_section = None
27+
28+
with open(file, "r") as f:
29+
lines = f.read().splitlines() # read all lines in the file
30+
31+
for line in lines:
32+
line = line.strip() # Remove any leading/trailing whitespace including newline characters
33+
34+
if "---------------------------------- mrna ----------------------------------" in line:
35+
current_section = "mrna" # Switch to 'mrna' section
36+
elif "---------------------------------- transcript ----------------------------------" in line:
37+
current_section = "transcript" # Switch to 'transcript' section
38+
elif "----------" in line:
39+
current_section = None # End of current section
40+
elif current_section:
41+
section_content[current_section] += (
42+
line + " "
43+
) # Accumulate content for the current section, separate lines by a space
44+
45+
cds_info = {}
46+
47+
for label, pattern in patterns.items():
48+
text_to_search = section_content["mrna"] if label != "EXONS_PER_TRANSC" else section_content["transcript"]
49+
match = re.search(pattern, text_to_search)
50+
if match:
51+
cds_info[label] = match.group(1)
52+
53+
return cds_info
54+
55+
56+
# Function to extract the number of non-coding genes from the second file
57+
def extract_non_coding_genes(file):
58+
non_coding_genes = {"ncrna_gene": 0}
59+
60+
with open(file, "r") as f:
61+
for line in f:
62+
parts = line.split()
63+
if len(parts) < 2:
64+
continue
65+
66+
gene_type = parts[0]
67+
try:
68+
count = int(parts[1])
69+
except ValueError:
70+
continue
71+
72+
if gene_type in non_coding_genes:
73+
non_coding_genes[gene_type] += count
74+
75+
NCG = sum(non_coding_genes.values())
76+
return {"NCG": NCG}
77+
78+
79+
# Extract the one_line_summary from a BUSCO JSON file
80+
def extract_busco_results(busco_stats_file):
81+
try:
82+
with open(busco_stats_file, "r") as file:
83+
busco_data = json.load(file)
84+
# Extract the one_line_summary from the results section
85+
one_line_summary = busco_data.get("results", {}).get("one_line_summary")
86+
if one_line_summary:
87+
# Use regex to extract everything after the first colon
88+
match = re.search(r':\s*"(.*)"', one_line_summary)
89+
if match:
90+
one_line_summary = match.group(1) # Get text after the colon
91+
return {"BUSCO_PROTEIN_SCORES": one_line_summary} if one_line_summary else {}
92+
except (json.JSONDecodeError, FileNotFoundError) as e:
93+
print(f"Error loading BUSCO JSON file: {e}")
94+
return {}
95+
96+
97+
# Function to write the extracted data to a CSV file
98+
def write_to_csv(data, output_file, busco_stats_file):
99+
busco_results = extract_busco_results(busco_stats_file)
100+
101+
descriptions = {
102+
"TRANSC_MRNA": "The number of transcribed mRNAs",
103+
"PCG": "The number of protein coding genes",
104+
"NCG": "The number of non-coding genes",
105+
"CDS_PER_GENE": "The average number of coding transcripts per gene",
106+
"EXONS_PER_TRANSC": "The average number of exons per transcript",
107+
"CDS_LENGTH": "The average length of coding sequence",
108+
"EXON_SIZE": "The average length of a coding exon",
109+
"INTRON_SIZE": "The average length of coding intron size",
110+
"BUSCO_PROTEIN_SCORES": "BUSCO results summary from running BUSCO in protein mode",
111+
}
112+
113+
with open(output_file, "w", newline="") as csvfile:
114+
writer = csv.writer(csvfile)
115+
116+
# Write descriptions at the top of the CSV file
117+
for key, description in descriptions.items():
118+
csvfile.write(f"# {key}: {description}\n")
119+
120+
# Write the Variable and Value columns header
121+
writer.writerow(["#paramName", "paramValue"])
122+
123+
# Write the data
124+
for key, value in data.items():
125+
writer.writerow([key, value])
126+
127+
# Add the BUSCO results summary
128+
for key, value in busco_results.items():
129+
writer.writerow([key, value])
130+
131+
132+
# Main function to take input files and output file as arguments
133+
def main():
134+
Description = "Parse contents of the agat_spstatistics, buscoproteins and agat_sqstatbasic to extract relevant annotation statistics information."
135+
Epilog = (
136+
"Example usage: python extract_annotation_statistics_info.py <basic_stats> <other_stats> <busco_stats> <output>"
137+
)
138+
139+
parser = argparse.ArgumentParser(description=Description, epilog=Epilog)
140+
parser.add_argument("basic_stats", help="Input txt file with basic_feature_statistics.")
141+
parser.add_argument("other_stats", help="Input txt file with other_feature_statistics.")
142+
parser.add_argument("busco_stats", help="Input JSON file for the BUSCO statistics.")
143+
parser.add_argument("output", help="Output file.")
144+
parser.add_argument("--version", action="version", version="%(prog)s 1.0")
145+
args = parser.parse_args()
146+
147+
cds_info = extract_cds_info(args.other_stats)
148+
non_coding_genes = extract_non_coding_genes(args.basic_stats)
149+
data = {**cds_info, **non_coding_genes}
150+
write_to_csv(data, args.output, args.busco_stats)
151+
152+
153+
if __name__ == "__main__":
154+
sys.exit(main())

0 commit comments

Comments
 (0)