Skip to content

Commit 5eace15

Browse files
pinin4fjordsclaude
andcommitted
docs(custom/orfcollapse): clarify GENCODE attribution
The 0.9 amino-acid-similarity threshold and peptide-level dedup are taken from the GENCODE Ribo-seq ORF consolidation (gencode-riboseqORFs collapse_cutoff 0.9); the method here is MMseqs2 sequence-identity clustering, not that tool's longest-shared-string / P-site-overlap metric. State that rather than implying method equivalence. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent 0427305 commit 5eace15

2 files changed

Lines changed: 17 additions & 10 deletions

File tree

modules/nf-core/custom/orfcollapse/meta.yml

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -9,10 +9,14 @@ description: |
99
The coordinate-based merge in `custom/orfmerge` only groups ORFs that overlap
1010
on the genome, so the same micropeptide encoded at several distinct,
1111
non-overlapping loci (typically repetitive regions) survives as separate rows.
12-
Following the GENCODE Ribo-seq ORF catalogue convention (Mudge et al. 2022,
13-
Nat Biotechnol, doi:10.1038/s41587-022-01369-0), small ORFs (orf_class
14-
"smORF", i.e. aa_length <= 100) are clustered by amino-acid identity upstream
15-
and this module folds each multi-member cluster down to one representative.
12+
This adopts the peptide-level deduplication and 0.9 amino-acid-similarity
13+
threshold of the GENCODE Ribo-seq ORF consolidation (Mudge et al. 2022,
14+
Nat Biotechnol, doi:10.1038/s41587-022-01369-0; gencode-riboseqORFs
15+
collapse_cutoff 0.9), implemented here with MMseqs2 sequence-identity
16+
clustering rather than that tool's longest-shared-string / P-site-overlap
17+
metric. Small ORFs (orf_class "smORF", i.e. aa_length <= 100) are clustered by
18+
amino-acid identity upstream and this module folds each multi-member cluster
19+
down to one representative.
1620
1721
Only smORF rows are collapsed; larger ORFs and transcript-anchored classes are
1822
passed through untouched. Among the smORF members of a cluster the

modules/nf-core/custom/orfcollapse/templates/orfcollapse.py

Lines changed: 9 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -6,12 +6,15 @@
66
The coordinate-based merge in custom/orfmerge groups ORFs that overlap on the
77
genome, but the same micropeptide is frequently encoded at several distinct,
88
non-overlapping genomic loci (typically repetitive regions), and those copies
9-
survive as separate catalogue rows. Following the GENCODE Ribo-seq ORF catalogue
10-
convention (Mudge et al. 2022, Nat Biotechnol, doi:10.1038/s41587-022-01369-0;
11-
gencode-riboseqORFs collapse_cutoff 0.9), small ORFs (orf_class == "smORF", i.e.
12-
aa_length <= 100) are clustered by amino-acid sequence identity upstream
13-
(mmseqs/easycluster) and this module folds each multi-member cluster down to one
14-
representative.
9+
survive as separate catalogue rows. This adopts the peptide-level deduplication
10+
and 0.9 amino-acid-similarity threshold of the GENCODE Ribo-seq ORF
11+
consolidation (Mudge et al. 2022, Nat Biotechnol,
12+
doi:10.1038/s41587-022-01369-0; gencode-riboseqORFs collapse_cutoff 0.9),
13+
implemented here with MMseqs2 sequence-identity clustering (--min-seq-id 0.9)
14+
rather than that tool's longest-shared-string / P-site-overlap metric. Small
15+
ORFs (orf_class == "smORF", i.e. aa_length <= 100) are clustered by amino-acid
16+
identity upstream (mmseqs/easycluster) and this module folds each multi-member
17+
cluster down to one representative.
1518
1619
Only smORF rows are collapsed; larger ORFs and transcript-anchored classes pass
1720
through untouched, preserving the deterministic coordinate/transcript merge from

0 commit comments

Comments
 (0)