You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[](https://www.repostatus.org/#active)
9
+
# sr2silo
10
+
11
+
**Convert BAM nucleotide alignments to cleartext alignments for LAPIS-SILO**
12
+
13
+
[](https://github.com/cbg-ethz/sr2silo)
### General Use: Convert Nucleotide Alignment Reads - CIGAR in .BAM to Cleartext JSON
20
-
sr2silo can convert millions of Short-Read nucleotide reads in the form of `.bam` CIGAR
21
-
alignments to cleartext alignments compatible with LAPIS-SILO v0.8.0+. It gracefully extracts insertions
22
-
and deletions. Optionally, sr2silo can translate and align each read using [diamond / blastX](https://github.com/bbuchfink/diamond), handling insertions and deletions in amino acid sequences as well.
When running sr2silo, particularly the `process-from-vpipe` command, be aware of memory and storage requirements:
73
-
74
-
- Standard configuration uses 8GB RAM and one CPU core
75
-
- Processing batches of 100k reads requires ~3GB RAM plus ~3GB for Diamond
76
-
- Temporary storage needs (especially on clusters) can reach 30-50GB
77
-
78
-
For detailed information about resource requirements, especially for cluster environments, please refer to the [Resource Requirements documentation](docs/usage/resource_requirements.md).
79
-
80
-
### Wrangling Short-Read Genomic Alignments for SILO Database
81
-
82
-
Originally this was started for wrangling short-read genomic alignments from wastewater-sampling, into a format for easy import into [Loculus](https://github.com/loculus-project/loculus) and its sequence database SILO.
83
-
84
-
sr2silo is designed to process nucleotide alignments from `.bam` files with metadata, translate and align reads in amino acids, gracefully handling all insertions and deletions and upload the results to the backend [LAPIS-SILO](https://github.com/GenSpectrum/LAPIS-SILO) v0.8.0+.
85
-
86
-
**Output Format for LAPIS-SILO v0.8.0+:**
87
-
- Metadata fields use camelCase naming (e.g., `readId`, `sampleId`, `batchId`) to align with Loculus standards
88
-
- Metadata fields are at the root level (no nested "metadata" object)
89
-
- Genomic segments use a structured format with `sequence`, `insertions`, and `offset` fields
90
-
- The main nucleotide segment is required and contains the primary alignment
91
-
- Gene segments (S, ORF1a, etc.) contain amino acid sequences or `null` if empty
92
-
- Insertions use the format `"position:sequence"` (e.g., `"123:ACGT"`)
93
-
94
-
**Output Schema Configuration:**
95
-
96
-
The output schema is defined in `src/sr2silo/silo_read_schema.py` using Pydantic models with field aliases for camelCase output. To modify the metadata fields:
97
-
98
-
1. Edit `src/sr2silo/silo_read_schema.py` - Add/modify fields in `ReadMetadata` class
99
-
2. Update `resources/silo/database_config.yaml` - Ensure field names match the Pydantic aliases
100
-
3. Run validation: `python tests/test_database_config_validation.py`
101
-
102
-
The validation ensures your Pydantic schema matches the SILO database configuration.
103
-
104
-
For the V-Pipe to Silo implementation we include the following metadata fields at the root level:
For development purposes or to install the latest version, you can install from source using Poetry:
21
+
</div>
143
22
144
-
The project uses a modular environment system to separate core functionality, development requirements, and workflow dependencies. Environment files are located in the `environments/` directory:
23
+
---
145
24
146
-
##### Core Environment Setup
25
+
sr2silo processes short-read nucleotide alignments from `.bam` files, translates and aligns reads in amino acids, and outputs JSON compatible with [LAPIS-SILO](https://github.com/GenSpectrum/LAPIS-SILO) v0.8.0+.
147
26
148
-
For basic usage of sr2silo:
149
-
```bash
150
-
make setup
151
-
```
152
-
This creates the core conda environment with essential dependencies and installs the package using Poetry.
153
-
154
-
##### Development Environment
155
-
156
-
For development work:
157
-
```bash
158
-
make setup-dev
159
-
```
160
-
This command sets up the development environment with Poetry.
161
-
##### Workflow Environment
162
-
163
-
For working with the snakemake workflow:
164
-
```bash
165
-
make setup-workflow
166
-
```
167
-
This creates an environment specifically configured for running the sr2silo in snakemake workflows.
2.**Submit to Loculus:**`sr2silo submit-to-loculus --help`
202
-
203
-
#### Quick Start
204
-
205
-
```bash
206
-
# Process data
36
+
# Process BAM data
207
37
sr2silo process-from-vpipe \
208
38
--input-file input.bam \
209
39
--sample-id SAMPLE_001 \
@@ -216,27 +46,24 @@ sr2silo submit-to-loculus \
216
46
--processed-file output.ndjson.zst
217
47
```
218
48
219
-
**Supported organisms:**`covid`, `rsva` (and others as references are added)
220
-
221
-
For detailed usage, organism configuration, and environment variables, see the [documentation](docs/usage/).
49
+
## Documentation
222
50
223
-
### Multi-Virus Deployment
51
+
Full documentation is available at the [sr2silo documentation site](https://cbg-ethz.github.io/sr2silo/):
224
52
225
-
For instructions on deploying the workflow for multiple viruses on a cluster with automatic daily resubmission, see the [Deployment Guide](docs/usage/deployment.md) or `deployments/README.md`.
53
+
-[Configuration](https://cbg-ethz.github.io/sr2silo/usage/configuration/) - Environment variables and CLI options
54
+
-[Multi-Organism Support](https://cbg-ethz.github.io/sr2silo/usage/organisms/) - Supported organisms and adding new ones
0 commit comments