Skip to content

Commit 867505d

Browse files
authored
Release v1.8.0 (#436)
2 parents b17bb89 + f302d72 commit 867505d

39 files changed

+940
-354
lines changed

.pre-commit-config.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ repos:
2020
rev: v2.4.1
2121
hooks:
2222
- id: codespell
23-
args: ["--ignore-words-list", "ARTIC"]
23+
args: ["--ignore-words-list", "ARTIC,dependant"]
2424

2525
- repo: https://github.com/PyCQA/isort
2626
rev: 6.0.1

README.md

Lines changed: 32 additions & 205 deletions
Original file line numberDiff line numberDiff line change
@@ -1,209 +1,39 @@
1-
# sr2silo
2-
## Wrangele BAM nucleotide alignments to cleartext alignments
1+
<div align="center">
2+
33
<picture>
4-
<source
5-
media="(prefers-color-scheme: light)"
6-
srcset="resources/graphics/logo.svg">
7-
<source
8-
media="(prefers-color-scheme: dark)"
9-
srcset="resources/graphics/logo_dark_mode.svg">
10-
<img alt="Logo" src="resources/logo.svg" width="15%" />
4+
<source media="(prefers-color-scheme: light)" srcset="resources/graphics/logo.svg">
5+
<source media="(prefers-color-scheme: dark)" srcset="resources/graphics/logo_dark_mode.svg">
6+
<img alt="sr2silo logo" src="resources/graphics/logo.svg" width="200px" />
117
</picture>
128

13-
[![Project Status: Active – The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)
9+
# sr2silo
10+
11+
**Convert BAM nucleotide alignments to cleartext alignments for LAPIS-SILO**
12+
13+
[![Status: Public Beta](https://img.shields.io/badge/Status-Public%20Beta-blue)](https://github.com/cbg-ethz/sr2silo)
1414
[![CI/CD](https://github.com/cbg-ethz/sr2silo/actions/workflows/test.yml/badge.svg)](https://github.com/cbg-ethz/sr2silo/actions/workflows/test.yml)
1515
[![Pytest](https://img.shields.io/badge/tested%20with-pytest-0A9EDC.svg)](https://docs.pytest.org/en/stable/)
1616
[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/charliermarsh/ruff/main/assets/badge/v2.json)](https://github.com/charliermarsh/ruff)
1717
[![Pyright](https://img.shields.io/badge/type%20checked-pyright-blue.svg)](https://github.com/microsoft/pyright)
1818

19-
### General Use: Convert Nucleotide Alignment Reads - CIGAR in .BAM to Cleartext JSON
20-
sr2silo can convert millions of Short-Read nucleotide reads in the form of `.bam` CIGAR
21-
alignments to cleartext alignments compatible with LAPIS-SILO v0.8.0+. It gracefully extracts insertions
22-
and deletions. Optionally, sr2silo can translate and align each read using [diamond / blastX](https://github.com/bbuchfink/diamond), handling insertions and deletions in amino acid sequences as well.
23-
24-
Your input `.bam/.sam` with one line as:
25-
```text
26-
294 163 NC_045512.2 79 60 31S220M = 197 400 CTCTTGTAGAT FGGGHHHHLMM ...
27-
```
28-
29-
sr2silo outputs per read a JSON (compatible with LAPIS-SILO v0.8.0+):
30-
31-
```json
32-
{
33-
"readId": "AV233803:AV044:2411515907:1:10805:5199:3294",
34-
"sampleId": "A1_05_2024_10_08",
35-
"batchId": "20241024_2411515907",
36-
"samplingDate": "2024-10-08",
37-
"locationName": "Lugano (TI)",
38-
"locationCode": "5",
39-
"sr2siloVersion": "1.3.0",
40-
"main": {
41-
"sequence": "CGGTTTCGTCCGTGTTGCAGCCG...GTGTCAACATCTTAAAGATGGCACTTGTG",
42-
"insertions": ["10:ACTG", "456:TACG"],
43-
"offset": 4545
44-
},
45-
"S": {
46-
"sequence": "MESLVPGFNEKTHVQLSLPVLQVRVRGFGDSVEEVLSEARQHLKDGTCGLVEVEKGV",
47-
"insertions": ["23:A", "145:KLM"],
48-
"offset": 78
49-
},
50-
"ORF1a": {
51-
"sequence": "XXXMESLVPGFNEKTHVQLSLPVLQVRVRGFGDSVEEVLSEARQHLKDGTCGLV",
52-
"insertions": ["2323:TG", "2389:CA"],
53-
"offset": 678
54-
},
55-
"E": null,
56-
"M": null,
57-
"N": null,
58-
"ORF1b": null,
59-
"ORF3a": null,
60-
"ORF6": null,
61-
"ORF7a": null,
62-
"ORF7b": null,
63-
"ORF8": null,
64-
"ORF10": null
65-
}
66-
```
67-
68-
The total output is handled in an `.ndjson.zst`.
69-
70-
### Resource Requirements
71-
72-
When running sr2silo, particularly the `process-from-vpipe` command, be aware of memory and storage requirements:
73-
74-
- Standard configuration uses 8GB RAM and one CPU core
75-
- Processing batches of 100k reads requires ~3GB RAM plus ~3GB for Diamond
76-
- Temporary storage needs (especially on clusters) can reach 30-50GB
77-
78-
For detailed information about resource requirements, especially for cluster environments, please refer to the [Resource Requirements documentation](docs/usage/resource_requirements.md).
79-
80-
### Wrangling Short-Read Genomic Alignments for SILO Database
81-
82-
Originally this was started for wrangling short-read genomic alignments from wastewater-sampling, into a format for easy import into [Loculus](https://github.com/loculus-project/loculus) and its sequence database SILO.
83-
84-
sr2silo is designed to process nucleotide alignments from `.bam` files with metadata, translate and align reads in amino acids, gracefully handling all insertions and deletions and upload the results to the backend [LAPIS-SILO](https://github.com/GenSpectrum/LAPIS-SILO) v0.8.0+.
85-
86-
**Output Format for LAPIS-SILO v0.8.0+:**
87-
- Metadata fields use camelCase naming (e.g., `readId`, `sampleId`, `batchId`) to align with Loculus standards
88-
- Metadata fields are at the root level (no nested "metadata" object)
89-
- Genomic segments use a structured format with `sequence`, `insertions`, and `offset` fields
90-
- The main nucleotide segment is required and contains the primary alignment
91-
- Gene segments (S, ORF1a, etc.) contain amino acid sequences or `null` if empty
92-
- Insertions use the format `"position:sequence"` (e.g., `"123:ACGT"`)
93-
94-
**Output Schema Configuration:**
95-
96-
The output schema is defined in `src/sr2silo/silo_read_schema.py` using Pydantic models with field aliases for camelCase output. To modify the metadata fields:
97-
98-
1. Edit `src/sr2silo/silo_read_schema.py` - Add/modify fields in `ReadMetadata` class
99-
2. Update `resources/silo/database_config.yaml` - Ensure field names match the Pydantic aliases
100-
3. Run validation: `python tests/test_database_config_validation.py`
101-
102-
The validation ensures your Pydantic schema matches the SILO database configuration.
103-
104-
For the V-Pipe to Silo implementation we include the following metadata fields at the root level:
105-
```json
106-
{
107-
"readId": "AV233803:AV044:2411515907:1:10805:5199:3294",
108-
"sampleId": "A1_05_2024_10_08",
109-
"batchId": "20241024_2411515907",
110-
"samplingDate": "2024-10-08",
111-
"locationName": "Lugano (TI)",
112-
"locationCode": "5",
113-
"sr2siloVersion": "1.3.0"
114-
}
115-
```
116-
117-
### Setting up the repository
118-
119-
To build the package and maintain dependencies, we use [Poetry](https://python-poetry.org/).
120-
In particular, it's good to install it and become familiar with its basic functionalities by reading the documentation.
121-
122-
### Installation
123-
124-
sr2silo can be installed either from Bioconda or from source.
125-
126-
#### Install from Bioconda
127-
128-
The easiest way to install sr2silo is through the Bioconda channel:
129-
130-
```bash
131-
# Add necessary channels if you haven't already
132-
conda config --add channels defaults
133-
conda config --add channels bioconda
134-
conda config --add channels conda-forge
135-
136-
# Install sr2silo
137-
conda install sr2silo
138-
```
139-
140-
#### Install from Source
19+
[Documentation](https://cbg-ethz.github.io/sr2silo/) · [Installation](#installation) · [Quick Start](#quick-start)
14120

142-
For development purposes or to install the latest version, you can install from source using Poetry:
21+
</div>
14322

144-
The project uses a modular environment system to separate core functionality, development requirements, and workflow dependencies. Environment files are located in the `environments/` directory:
23+
---
14524

146-
##### Core Environment Setup
25+
sr2silo processes short-read nucleotide alignments from `.bam` files, translates and aligns reads in amino acids, and outputs JSON compatible with [LAPIS-SILO](https://github.com/GenSpectrum/LAPIS-SILO) v0.8.0+.
14726

148-
For basic usage of sr2silo:
149-
```bash
150-
make setup
151-
```
152-
This creates the core conda environment with essential dependencies and installs the package using Poetry.
153-
154-
##### Development Environment
155-
156-
For development work:
157-
```bash
158-
make setup-dev
159-
```
160-
This command sets up the development environment with Poetry.
161-
##### Workflow Environment
162-
163-
For working with the snakemake workflow:
164-
```bash
165-
make setup-workflow
166-
```
167-
This creates an environment specifically configured for running the sr2silo in snakemake workflows.
168-
169-
##### All Environments
27+
## Installation
17028

171-
You can set up all environments at once:
17229
```bash
173-
make setup-all
30+
conda install -c bioconda sr2silo
17431
```
17532

176-
### Additional Setup for Development
33+
## Quick Start
17734

178-
After setting up the development environment:
17935
```bash
180-
conda activate sr2silo-dev
181-
poetry install --with dev
182-
poetry run pre-commit install
183-
```
184-
185-
### Run Tests
186-
187-
```bash
188-
make test
189-
```
190-
or
191-
```bash
192-
conda activate sr2silo-dev
193-
pytest
194-
```
195-
196-
### Usage
197-
198-
sr2silo follows a two-step workflow:
199-
200-
1. **Process data:** `sr2silo process-from-vpipe --help`
201-
2. **Submit to Loculus:** `sr2silo submit-to-loculus --help`
202-
203-
#### Quick Start
204-
205-
```bash
206-
# Process data
36+
# Process BAM data
20737
sr2silo process-from-vpipe \
20838
--input-file input.bam \
20939
--sample-id SAMPLE_001 \
@@ -216,27 +46,24 @@ sr2silo submit-to-loculus \
21646
--processed-file output.ndjson.zst
21747
```
21848

219-
**Supported organisms:** `covid`, `rsva` (and others as references are added)
220-
221-
For detailed usage, organism configuration, and environment variables, see the [documentation](docs/usage/).
49+
## Documentation
22250

223-
### Multi-Virus Deployment
51+
Full documentation is available at the [sr2silo documentation site](https://cbg-ethz.github.io/sr2silo/):
22452

225-
For instructions on deploying the workflow for multiple viruses on a cluster with automatic daily resubmission, see the [Deployment Guide](docs/usage/deployment.md) or `deployments/README.md`.
53+
- [Configuration](https://cbg-ethz.github.io/sr2silo/usage/configuration/) - Environment variables and CLI options
54+
- [Multi-Organism Support](https://cbg-ethz.github.io/sr2silo/usage/organisms/) - Supported organisms and adding new ones
55+
- [Deployment](https://cbg-ethz.github.io/sr2silo/usage/deployment/) - Multi-virus cluster deployment
56+
- [API Reference](https://cbg-ethz.github.io/sr2silo/api/loculus/) - Python API documentation
22657

227-
### Environment Variables
228-
229-
sr2silo supports configuration via environment variables (CLI parameters take precedence):
58+
## Development
23059

23160
```bash
232-
export ORGANISM=covid
233-
export KEYCLOAK_TOKEN_URL=https://auth.example.com/token
234-
export BACKEND_URL=https://api.example.com/submit
235-
export GROUP_ID=123
236-
export USERNAME=your-username
237-
export PASSWORD=your-password
238-
239-
sr2silo process-from-vpipe --input-file input.bam --sample-id SAMPLE_001 ...
61+
make setup-dev
62+
conda activate sr2silo-dev
63+
poetry install --with dev
64+
pytest
24065
```
24166

242-
See [docs/usage/](docs/usage/) for complete environment variable reference.
67+
## License
68+
69+
See [LICENSE](LICENSE) for details.

deployments/covid/config.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ KEYCLOAK_TOKEN_URL: "https://auth.db.wasap.genspectrum.org/realms/loculus/protoc
2929
BACKEND_URL: "https://api.db.wasap.genspectrum.org/backend"
3030
GROUP_ID: 1
3131
ORGANISM: "covid"
32-
LAPIS_URL: "https://lapis.wasap.genspectrum.org/"
32+
LAPIS_URL: "https://lapis.wasap.genspectrum.org/covid"
3333

3434
# Auto-release: automatically approve sequences after submission
3535
AUTO_RELEASE: true

deployments/rsva/config.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ KEYCLOAK_TOKEN_URL: "https://auth.db.wasap.genspectrum.org/realms/loculus/protoc
2929
BACKEND_URL: "https://api.db.wasap.genspectrum.org/backend"
3030
GROUP_ID: 1
3131
ORGANISM: "rsva"
32-
# LAPIS_URL:
32+
LAPIS_URL: "https://lapis.wasap.genspectrum.org/rsva"
3333

3434
# Auto-release: automatically approve sequences after submission
3535
AUTO_RELEASE: true

deployments/submit-daily.sbatch

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ set -euo pipefail
2121
: "${VIRUS:?Set VIRUS via --export=VIRUS=covid}"
2222

2323
PROJECT_ROOT="/cluster/project/pangolin/research/W-ASAP"
24-
CONDA_ENV="sr2silo-workflow"
24+
CONDA_ENV="base"
2525
CORES="${SLURM_CPUS_PER_TASK:-20}"
2626

2727
echo "=== sr2silo daily: $VIRUS (job $SLURM_JOB_ID) ==="
@@ -31,7 +31,7 @@ echo "Node: $SLURM_NODELIST | Cores: $CORES | $(date)"
3131
module load eth_proxy 2>/dev/null || true
3232

3333
# Initialize conda (using hook to preserve system PATH including sbatch)
34-
CONDA_EXE="/cluster/work/bewi/members/koehng/miniconda3/bin/conda"
34+
CONDA_EXE="/cluster/project/pangolin/resources/miniconda3/bin/conda"
3535
eval "$("$CONDA_EXE" shell.bash hook)"
3636
conda activate "$CONDA_ENV"
3737

@@ -60,7 +60,7 @@ echo "Next run resources: CPUS=$NEXT_CPUS, MEM=$NEXT_MEM (MEM_PER_CPU=$MEM_PER_C
6060
# Run workflow and capture exit code
6161
cd "$PROJECT_ROOT/sr2silo/workflow"
6262
set +e # Temporarily disable exit on error
63-
snakemake --configfile "../deployments/$VIRUS/config.yaml" -j"$CORES" --rerun-incomplete --keep-going
63+
snakemake --configfile "../deployments/$VIRUS/config.yaml" -j"$CORES" --rerun-incomplete --keep-going --rerun-trigger mtime --conda-frontend conda --conda-prefix "/cluster/project/pangolin/resources/snake-envs" --use-conda
6464
SNAKEMAKE_EXIT=$?
6565
set -e # Re-enable exit on error
6666

docs/api/loculus.md

Lines changed: 14 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,15 @@
1-
# sr2silo.silo
1+
# Loculus Integration
22

3-
:::sr2silo.loculus.LoculusClient
4-
:::sr2silo.loculus.Submission
5-
:::sr2silo.loculus.LapisClient
3+
Client classes for interacting with Loculus/LAPIS backends.
4+
5+
## LoculusClient
6+
7+
::: sr2silo.loculus.LoculusClient
8+
9+
## Submission
10+
11+
::: sr2silo.loculus.Submission
12+
13+
## LapisClient
14+
15+
::: sr2silo.loculus.LapisClient

0 commit comments

Comments
 (0)