Skip to content

Commit b8dd31e

Browse files
author
Gordon J. Köhn
committed
feat: upgrade to SILO input format 0.8.0 - sr2silo v1.2.0
This PR upgrades sr2silo to support SILO input format version 0.8.0, implementing a new JSON schema structure that flattens metadata fields to the root level and restructures genomic segments with explicit sequence, insertions, and offset fields. Key changes: - Migrated from nested JSON structure to flat schema with root-level metadata - Replaced padded alignments with offset-based positioning for better efficiency - Updated schema validation to distinguish between nucleotide and amino acid segments - Leading to a major bump in version sr2silo v1.2.0
1 parent 9731a6a commit b8dd31e

File tree

16 files changed

+589
-578
lines changed

16 files changed

+589
-578
lines changed

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -152,3 +152,6 @@ results
152152

153153
# Bioinformatics files
154154
.bai
155+
156+
# References directory (ignore all untracked references)
157+
resources/references/

README.md

Lines changed: 63 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -18,47 +18,53 @@
1818
[![Pyright](https://img.shields.io/badge/type%20checked-pyright-blue.svg)](https://github.com/microsoft/pyright)
1919

2020
### General Use: Convert Nucleotide Alignment Reads - CIGAR in .BAM to Cleartext JSON
21-
sr2silo can convert millions of Short-Read nucleotide read in the form of a .bam CIGAR
22-
alignments to cleartext alignments. Further, it will gracefully extract insertions
23-
and deletions. Optionally, sr2silo can translate and align each read using [diamond / blastX](https://github.com/bbuchfink/diamond). And again handle insertions and deletions.
21+
sr2silo can convert millions of Short-Read nucleotide reads in the form of .bam CIGAR
22+
alignments to cleartext alignments compatible with LAPIS-SILO v0.8.0+. It gracefully extracts insertions
23+
and deletions. Optionally, sr2silo can translate and align each read using [diamond / blastX](https://github.com/bbuchfink/diamond), handling insertions and deletions in amino acid sequences as well.
2424

2525
Your input `.bam/.sam` with one line as:
26-
````
27-
294 163 NC_045512.2 79 60 31S220M = 197 400 CTCTTGTAGAT FGGGHHHHLMM ...
28-
````
26+
```text
27+
294 163 NC_045512.2 79 60 31S220M = 197 400 CTCTTGTAGAT FGGGHHHHLMM ...
28+
```
2929

30-
sr2silo outputs per read a JSON (mock output):
30+
sr2silo outputs per read a JSON (compatible with LAPIS-SILO v0.8.0+):
3131

32-
```
32+
```json
3333
{
34-
"metadata":{
35-
"read_id":"AV233803:AV044:2411515907:1:10805:5199:3294",
36-
...
37-
},
38-
"nucleotideInsertions":{
39-
"main":[10 : ACTG]
40-
},
41-
"aminoAcidInsertions":{
42-
"E":[],
43-
...
44-
"ORF1a":[2323 : TG, 2389 : CA],
45-
...
46-
"S":[23 : A]
47-
},
48-
"alignedNucleotideSequences":
49-
{
50-
"main":"NNNNNNNNNNNNNNNNNNCGGTTTCGTCCGTGTTGCAGCCG...GTGTCAACATCTTAAAGATGGCACTTGTGNNNNNNNNNNNNNNNNNNNNNNNN"
51-
},
52-
"unalignedNucleotideSequences":{
53-
"main":"CGGTTTCGTCCGTGTTGCAGCCGATCATCAGCACATCTAGGTTTTGTCCGGGTGTGA...TACAGGTTCGCGACGTGCTCGTGTGAAAGATGGCACTTGTG"
54-
},
55-
"alignedAminoAcidSequences":{
56-
"E":"",
57-
...
58-
"ORF1a":"...XXXMESLVPGFNEKTHVQLSLPVLQVRVRGFGDSVEEVLSEARQHLKDGTCGLVEVEKGVXXXXXX...",
59-
...
60-
"S":""}
61-
}
34+
"read_id": "AV233803:AV044:2411515907:1:10805:5199:3294",
35+
"sample_id": "A1_05_2024_10_08",
36+
"batch_id": "20241024_2411515907",
37+
"sampling_date": "2024-10-08",
38+
"location_name": "Lugano (TI)",
39+
"read_length": "250",
40+
"location_code": "05",
41+
"main": {
42+
"sequence": "CGGTTTCGTCCGTGTTGCAGCCG...GTGTCAACATCTTAAAGATGGCACTTGTG",
43+
"insertions": ["10:ACTG", "456:TACG"],
44+
"offset": 4545
45+
},
46+
"unaligned_main": "CGGTTTCGTCCGTGTTGCAGCCGATCATCTAGGT...TACAGGTTCGCGACGTGCTCGTGTGAAAGATGGCACTTGTG",
47+
"S": {
48+
"sequence": "MESLVPGFNEKTHVQLSLPVLQVRVRGFGDSVEEVLSEARQHLKDGTCGLVEVEKGV",
49+
"insertions": ["23:A", "145:KLM"],
50+
"offset": 78
51+
},
52+
"ORF1a": {
53+
"sequence": "XXXMESLVPGFNEKTHVQLSLPVLQVRVRGFGDSVEEVLSEARQHLKDGTCGLV",
54+
"insertions": ["2323:TG", "2389:CA"],
55+
"offset": 678
56+
},
57+
"E": null,
58+
"M": null,
59+
"N": null,
60+
"ORF1b": null,
61+
"ORF3a": null,
62+
"ORF6": null,
63+
"ORF7a": null,
64+
"ORF7b": null,
65+
"ORF8": null,
66+
"ORF10": null
67+
}
6268
```
6369

6470
The total output is handled in an `.ndjson.zst`.
@@ -75,22 +81,29 @@ For detailed information about resource requirements, especially for cluster env
7581

7682
### Wrangling Short-Read Genomic Alignments for SILO Database
7783

78-
Originally this was started for wargeling short-read genomic alignments for from wastewater-sampling, into a format for easy import into [Loculus](https://github.com/loculus-project/loculus) and its sequence database SILO.
84+
Originally this was started for wrangling short-read genomic alignments from wastewater-sampling, into a format for easy import into [Loculus](https://github.com/loculus-project/loculus) and its sequence database SILO.
7985

80-
sr2silo is designed to process a nucliotide alignments from `.bam` files with metadata, translate and align reads in amino acids, gracefully handling all insertions and deletions and upload the results to the backend [LAPIS-SILO](https://github.com/GenSpectrum/LAPIS-SILO).
86+
sr2silo is designed to process nucleotide alignments from `.bam` files with metadata, translate and align reads in amino acids, gracefully handling all insertions and deletions and upload the results to the backend [LAPIS-SILO](https://github.com/GenSpectrum/LAPIS-SILO) v0.8.0+.
8187

82-
For the V-Pipe to Silo implementation we carry through the following metadata:
83-
```
84-
"metadata":{
85-
"read_id":"AV233803:AV044:2411515907:1:10805:5199:3294",
86-
"sample_id":"A1_05_2024_10_08",
87-
"batch_id":"20241024_2411515907",
88-
"sampling_date":"2024-10-08",
89-
"location_name":"Lugano (TI)",
90-
"read_length":"250",
91-
"primer_protocol":"v532",
92-
"location_code":"5"
93-
}
88+
**New Output Format for LAPIS-SILO v0.8.0+:**
89+
- Metadata fields are now at the root level (no nested "metadata" object)
90+
- Genomic segments use a structured format with `sequence`, `insertions`, and `offset` fields
91+
- The main nucleotide segment is required and contains the primary alignment
92+
- Gene segments (S, ORF1a, etc.) contain amino acid sequences or `null` if empty
93+
- Insertions use the format `"position:sequence"` (e.g., `"123:ACGT"`)
94+
- Unaligned sequences are prefixed with `unaligned_` (e.g., `unaligned_main`)
95+
96+
For the V-Pipe to Silo implementation we include the following metadata fields at the root level:
97+
```json
98+
{
99+
"read_id": "AV233803:AV044:2411515907:1:10805:5199:3294",
100+
"sample_id": "A1_05_2024_10_08",
101+
"batch_id": "20241024_2411515907",
102+
"sampling_date": "2024-10-08",
103+
"location_name": "Lugano (TI)",
104+
"read_length": "250",
105+
"location_code": "05"
106+
}
94107
```
95108

96109
### Setting up the repository

conda-recipe/meta.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# conda recipe
22
{% set name = "sr2silo" %}
3-
{% set version = "1.1.1" %}
3+
{% set version = "1.2.0" %}
44

55
package:
66
name: {{ name|lower }}

pyproject.toml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
[tool.poetry]
22
name = "sr2silo"
3-
version = "1.1.1"
4-
description = "ETL tool for importing short-read sequencing data into SILO database, powering Loculus."
3+
version = "1.2.0"
4+
description = "ETL tool for importing short-read sequencing data into SILO database (v0.8.0+), powering Loculus."
55
authors = ["Gordon Julian Koehn <gordon.koehn@dbsse.ethz.ch>"]
66
readme = "README.md"
77
packages = [{ include = "sr2silo", from = "src" }]

src/sr2silo/process/__init__.py

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,6 @@
99
bam_to_fasta_query,
1010
bam_to_sam,
1111
get_gene_set_from_ref,
12-
pad_alignment,
1312
sam_to_bam,
1413
sort_and_index_bam,
1514
sort_bam_file,
@@ -37,7 +36,6 @@
3736
"bam_to_sam",
3837
"get_gene_set_from_ref",
3938
"get_gene_set_from_ref",
40-
"pad_alignment",
4139
"sam_to_bam",
4240
"sort_and_index_bam",
4341
"sort_bam_file",

src/sr2silo/process/convert.py

Lines changed: 1 addition & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
import logging
66
import re
77
from pathlib import Path
8-
from typing import List, Tuple, Union
8+
from typing import List, Tuple
99

1010
import pysam
1111

@@ -315,45 +315,6 @@ def parse_cigar(cigar: str) -> List[Tuple[int, str]]:
315315
]
316316

317317

318-
def pad_alignment(
319-
sequence: Union[List[str], str],
320-
reference_start: int,
321-
reference_length: int,
322-
unknown_char: str = "N",
323-
) -> str:
324-
"""
325-
Pad the sequence to match the reference length.
326-
327-
This function takes a sequence and pads it with a specified character to align it
328-
with a reference sequence of a given length. The padding is added to both the
329-
beginning and the end of the sequence as needed.
330-
331-
Args:
332-
sequence (Union[List[str], str]): The sequence to be padded.
333-
reference_start (int): The starting position of the reference sequence.
334-
reference_length (int): The total length of the reference sequence.
335-
unknown_char (str, optional): The character to use for padding. Defaults
336-
to "N" for Nucleotides, choose "X" for Amino Acids.
337-
338-
Returns:
339-
str: The padded sequence as a single string.
340-
"""
341-
342-
# Combine the aligned sequence
343-
aligned_str = "".join(sequence)
344-
345-
# Calculate the padding needed for the left and right
346-
left_padding = unknown_char * reference_start
347-
right_padding = unknown_char * (
348-
reference_length - len(aligned_str) - reference_start
349-
)
350-
351-
# Pad the aligned sequence
352-
padded_alignment = left_padding + aligned_str + right_padding
353-
354-
return padded_alignment
355-
356-
357318
def sam_to_seq_and_indels(
358319
seq: str, cigar: str
359320
) -> Tuple[str, List[Insertion], List[Tuple[int, int]]]:

0 commit comments

Comments
 (0)