cbg-ethz
diff --git a/‎.pre-commit-config.yaml‎
Lines changed: 1 addition & 1 deletion b/‎.pre-commit-config.yaml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎README.md‎
Lines changed: 32 additions & 205 deletions b/‎README.md‎
Lines changed: 32 additions & 205 deletions
diff --git a/‎deployments/covid/config.yaml‎
Lines changed: 1 addition & 1 deletion b/‎deployments/covid/config.yaml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎deployments/rsva/config.yaml‎
Lines changed: 1 addition & 1 deletion b/‎deployments/rsva/config.yaml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎deployments/submit-daily.sbatch‎
Lines changed: 3 additions & 3 deletions b/‎deployments/submit-daily.sbatch‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎docs/api/loculus.md‎
Lines changed: 14 additions & 4 deletions b/‎docs/api/loculus.md‎
Lines changed: 14 additions & 4 deletions
@@ -20,7 +20,7 @@ repos:
     rev: v2.4.1
     hooks:
       - id: codespell
-        args: ["--ignore-words-list", "ARTIC"]
+        args: ["--ignore-words-list", "ARTIC,dependant"]
 
   - repo: https://github.com/PyCQA/isort
     rev: 6.0.1
 
@@ -1,209 +1,39 @@
-# sr2silo
-## Wrangele BAM nucleotide alignments to cleartext alignments
+<div align="center">
+
 <picture>
-  <source
-    media="(prefers-color-scheme: light)"
-    srcset="resources/graphics/logo.svg">
-  <source
-    media="(prefers-color-scheme: dark)"
-    srcset="resources/graphics/logo_dark_mode.svg">
-  <img alt="Logo" src="resources/logo.svg" width="15%" />
+  <source media="(prefers-color-scheme: light)" srcset="resources/graphics/logo.svg">
+  <source media="(prefers-color-scheme: dark)" srcset="resources/graphics/logo_dark_mode.svg">
+  <img alt="sr2silo logo" src="resources/graphics/logo.svg" width="200px" />
 </picture>
 
-[![Project Status: Active – The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)
+# sr2silo
+
+**Convert BAM nucleotide alignments to cleartext alignments for LAPIS-SILO**
+
+[![Status: Public Beta](https://img.shields.io/badge/Status-Public%20Beta-blue)](https://github.com/cbg-ethz/sr2silo)
 [![CI/CD](https://github.com/cbg-ethz/sr2silo/actions/workflows/test.yml/badge.svg)](https://github.com/cbg-ethz/sr2silo/actions/workflows/test.yml)
 [![Pytest](https://img.shields.io/badge/tested%20with-pytest-0A9EDC.svg)](https://docs.pytest.org/en/stable/)
 [![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/charliermarsh/ruff/main/assets/badge/v2.json)](https://github.com/charliermarsh/ruff)
 [![Pyright](https://img.shields.io/badge/type%20checked-pyright-blue.svg)](https://github.com/microsoft/pyright)
 
-### General Use: Convert Nucleotide Alignment Reads - CIGAR in .BAM to Cleartext JSON
-sr2silo can convert millions of Short-Read nucleotide reads in the form of `.bam` CIGAR
-alignments to cleartext alignments compatible with LAPIS-SILO v0.8.0+. It gracefully extracts insertions
-and deletions. Optionally, sr2silo can translate and align each read using [diamond / blastX](https://github.com/bbuchfink/diamond), handling insertions and deletions in amino acid sequences as well.
-
-Your input `.bam/.sam` with one line as:
-```text
-294 163 NC_045512.2 79  60  31S220M =   197 400 CTCTTGTAGAT FGGGHHHHLMM ...
-```
-
-sr2silo outputs per read a JSON (compatible with LAPIS-SILO v0.8.0+):
-
-```json
-{
-  "readId": "AV233803:AV044:2411515907:1:10805:5199:3294",
-  "sampleId": "A1_05_2024_10_08",
-  "batchId": "20241024_2411515907",
-  "samplingDate": "2024-10-08",
-  "locationName": "Lugano (TI)",
-  "locationCode": "5",
-  "sr2siloVersion": "1.3.0",
-  "main": {
-    "sequence": "CGGTTTCGTCCGTGTTGCAGCCG...GTGTCAACATCTTAAAGATGGCACTTGTG",
-    "insertions": ["10:ACTG", "456:TACG"],
-    "offset": 4545
-  },
-  "S": {
-    "sequence": "MESLVPGFNEKTHVQLSLPVLQVRVRGFGDSVEEVLSEARQHLKDGTCGLVEVEKGV",
-    "insertions": ["23:A", "145:KLM"],
-    "offset": 78
-  },
-  "ORF1a": {
-    "sequence": "XXXMESLVPGFNEKTHVQLSLPVLQVRVRGFGDSVEEVLSEARQHLKDGTCGLV",
-    "insertions": ["2323:TG", "2389:CA"],
-    "offset": 678
-  },
-  "E": null,
-  "M": null,
-  "N": null,
-  "ORF1b": null,
-  "ORF3a": null,
-  "ORF6": null,
-  "ORF7a": null,
-  "ORF7b": null,
-  "ORF8": null,
-  "ORF10": null
-}
-```
-
-The total output is handled in an `.ndjson.zst`.
-
-### Resource Requirements
-
-When running sr2silo, particularly the `process-from-vpipe` command, be aware of memory and storage requirements:
-
-- Standard configuration uses 8GB RAM and one CPU core
-- Processing batches of 100k reads requires ~3GB RAM plus ~3GB for Diamond
-- Temporary storage needs (especially on clusters) can reach 30-50GB
-
-For detailed information about resource requirements, especially for cluster environments, please refer to the [Resource Requirements documentation](docs/usage/resource_requirements.md).
-
-### Wrangling Short-Read Genomic Alignments for SILO Database
-
-Originally this was started for wrangling short-read genomic alignments from wastewater-sampling, into a format for easy import into [Loculus](https://github.com/loculus-project/loculus) and its sequence database SILO.
-
-sr2silo is designed to process nucleotide alignments from `.bam` files with metadata, translate and align reads in amino acids, gracefully handling all insertions and deletions and upload the results to the backend [LAPIS-SILO](https://github.com/GenSpectrum/LAPIS-SILO) v0.8.0+.
-
-**Output Format for LAPIS-SILO v0.8.0+:**
-- Metadata fields use camelCase naming (e.g., `readId`, `sampleId`, `batchId`) to align with Loculus standards
-- Metadata fields are at the root level (no nested "metadata" object)
-- Genomic segments use a structured format with `sequence`, `insertions`, and `offset` fields
-- The main nucleotide segment is required and contains the primary alignment
-- Gene segments (S, ORF1a, etc.) contain amino acid sequences or `null` if empty
-- Insertions use the format `"position:sequence"` (e.g., `"123:ACGT"`)
-
-**Output Schema Configuration:**
-
-The output schema is defined in `src/sr2silo/silo_read_schema.py` using Pydantic models with field aliases for camelCase output. To modify the metadata fields:
-
-1. Edit `src/sr2silo/silo_read_schema.py` - Add/modify fields in `ReadMetadata` class
-2. Update `resources/silo/database_config.yaml` - Ensure field names match the Pydantic aliases
-3. Run validation: `python tests/test_database_config_validation.py`
-
-The validation ensures your Pydantic schema matches the SILO database configuration.
-
-For the V-Pipe to Silo implementation we include the following metadata fields at the root level:
-```json
-{
-  "readId": "AV233803:AV044:2411515907:1:10805:5199:3294",
-  "sampleId": "A1_05_2024_10_08",
-  "batchId": "20241024_2411515907",
-  "samplingDate": "2024-10-08",
-  "locationName": "Lugano (TI)",
-  "locationCode": "5",
-  "sr2siloVersion": "1.3.0"
-}
-```
-
-### Setting up the repository
-
-To build the package and maintain dependencies, we use [Poetry](https://python-poetry.org/).
-In particular, it's good to install it and become familiar with its basic functionalities by reading the documentation.
-
-### Installation
-
-sr2silo can be installed either from Bioconda or from source.
-
-#### Install from Bioconda
-
-The easiest way to install sr2silo is through the Bioconda channel:
-
-```bash
-# Add necessary channels if you haven't already
-conda config --add channels defaults
-conda config --add channels bioconda
-conda config --add channels conda-forge
-
-# Install sr2silo
-conda install sr2silo
-```
-
-#### Install from Source
+[Documentation](https://cbg-ethz.github.io/sr2silo/) · [Installation](#installation) · [Quick Start](#quick-start)
 
-For development purposes or to install the latest version, you can install from source using Poetry:
+</div>
 
-The project uses a modular environment system to separate core functionality, development requirements, and workflow dependencies. Environment files are located in the `environments/` directory:
+---
 
-##### Core Environment Setup
+sr2silo processes short-read nucleotide alignments from `.bam` files, translates and aligns reads in amino acids, and outputs JSON compatible with [LAPIS-SILO](https://github.com/GenSpectrum/LAPIS-SILO) v0.8.0+.
 
-For basic usage of sr2silo:
-```bash
-make setup
-```
-This creates the core conda environment with essential dependencies and installs the package using Poetry.
-
-##### Development Environment
-
-For development work:
-```bash
-make setup-dev
-```
-This command sets up the development environment with Poetry.
-##### Workflow Environment
-
-For working with the snakemake workflow:
-```bash
-make setup-workflow
-```
-This creates an environment specifically configured for running the sr2silo in snakemake workflows.
-
-##### All Environments
+## Installation
 
-You can set up all environments at once:
 ```bash
-make setup-all
+conda install -c bioconda sr2silo
 ```
 
-### Additional Setup for Development
+## Quick Start
 
-After setting up the development environment:
 ```bash
-conda activate sr2silo-dev
-poetry install --with dev
-poetry run pre-commit install
-```
-
-### Run Tests
-
-```bash
-make test
-```
-or
-```bash
-conda activate sr2silo-dev
-pytest
-```
-
-### Usage
-
-sr2silo follows a two-step workflow:
-
-1. **Process data:** `sr2silo process-from-vpipe --help`
-2. **Submit to Loculus:** `sr2silo submit-to-loculus --help`
-
-#### Quick Start
-
-```bash
-# Process data
+# Process BAM data
 sr2silo process-from-vpipe \
     --input-file input.bam \
     --sample-id SAMPLE_001 \
@@ -216,27 +46,24 @@ sr2silo submit-to-loculus \
     --processed-file output.ndjson.zst
 ```
 
-**Supported organisms:** `covid`, `rsva` (and others as references are added)
-
-For detailed usage, organism configuration, and environment variables, see the [documentation](docs/usage/).
+## Documentation
 
-### Multi-Virus Deployment
+Full documentation is available at the [sr2silo documentation site](https://cbg-ethz.github.io/sr2silo/):
 
-For instructions on deploying the workflow for multiple viruses on a cluster with automatic daily resubmission, see the [Deployment Guide](docs/usage/deployment.md) or `deployments/README.md`.
+- [Configuration](https://cbg-ethz.github.io/sr2silo/usage/configuration/) - Environment variables and CLI options
+- [Multi-Organism Support](https://cbg-ethz.github.io/sr2silo/usage/organisms/) - Supported organisms and adding new ones
+- [Deployment](https://cbg-ethz.github.io/sr2silo/usage/deployment/) - Multi-virus cluster deployment
+- [API Reference](https://cbg-ethz.github.io/sr2silo/api/loculus/) - Python API documentation
 
-### Environment Variables
-
-sr2silo supports configuration via environment variables (CLI parameters take precedence):
+## Development
 
 ```bash
-export ORGANISM=covid
-export KEYCLOAK_TOKEN_URL=https://auth.example.com/token
-export BACKEND_URL=https://api.example.com/submit
-export GROUP_ID=123
-export USERNAME=your-username
-export PASSWORD=your-password
-
-sr2silo process-from-vpipe --input-file input.bam --sample-id SAMPLE_001 ...
+make setup-dev
+conda activate sr2silo-dev
+poetry install --with dev
+pytest
 ```
 
-See [docs/usage/](docs/usage/) for complete environment variable reference.
+## License
+
+See [LICENSE](LICENSE) for details.
@@ -29,7 +29,7 @@ KEYCLOAK_TOKEN_URL: "https://auth.db.wasap.genspectrum.org/realms/loculus/protoc
 BACKEND_URL: "https://api.db.wasap.genspectrum.org/backend"
 GROUP_ID: 1
 ORGANISM: "covid"
-LAPIS_URL: "https://lapis.wasap.genspectrum.org/"
+LAPIS_URL: "https://lapis.wasap.genspectrum.org/covid"
 
 # Auto-release: automatically approve sequences after submission
 AUTO_RELEASE: true
 
@@ -29,7 +29,7 @@ KEYCLOAK_TOKEN_URL: "https://auth.db.wasap.genspectrum.org/realms/loculus/protoc
 BACKEND_URL: "https://api.db.wasap.genspectrum.org/backend"
 GROUP_ID: 1
 ORGANISM: "rsva"
-# LAPIS_URL:
+LAPIS_URL: "https://lapis.wasap.genspectrum.org/rsva"
 
 # Auto-release: automatically approve sequences after submission
 AUTO_RELEASE: true
 
@@ -21,7 +21,7 @@ set -euo pipefail
 : "${VIRUS:?Set VIRUS via --export=VIRUS=covid}"
 
 PROJECT_ROOT="/cluster/project/pangolin/research/W-ASAP"
-CONDA_ENV="sr2silo-workflow"
+CONDA_ENV="base"
 CORES="${SLURM_CPUS_PER_TASK:-20}"
 
 echo "=== sr2silo daily: $VIRUS (job $SLURM_JOB_ID) ==="
@@ -31,7 +31,7 @@ echo "Node: $SLURM_NODELIST | Cores: $CORES | $(date)"
 module load eth_proxy 2>/dev/null || true
 
 # Initialize conda (using hook to preserve system PATH including sbatch)
-CONDA_EXE="/cluster/work/bewi/members/koehng/miniconda3/bin/conda"
+CONDA_EXE="/cluster/project/pangolin/resources/miniconda3/bin/conda"
 eval "$("$CONDA_EXE" shell.bash hook)"
 conda activate "$CONDA_ENV"
 
@@ -60,7 +60,7 @@ echo "Next run resources: CPUS=$NEXT_CPUS, MEM=$NEXT_MEM (MEM_PER_CPU=$MEM_PER_C
 # Run workflow and capture exit code
 cd "$PROJECT_ROOT/sr2silo/workflow"
 set +e  # Temporarily disable exit on error
-snakemake --configfile "../deployments/$VIRUS/config.yaml" -j"$CORES" --rerun-incomplete --keep-going
+snakemake --configfile "../deployments/$VIRUS/config.yaml" -j"$CORES" --rerun-incomplete --keep-going --rerun-trigger mtime --conda-frontend conda --conda-prefix "/cluster/project/pangolin/resources/snake-envs" --use-conda
 SNAKEMAKE_EXIT=$?
 set -e  # Re-enable exit on error
 
 
@@ -1,5 +1,15 @@
-# sr2silo.silo
+# Loculus Integration
 
-:::sr2silo.loculus.LoculusClient
-:::sr2silo.loculus.Submission
-:::sr2silo.loculus.LapisClient
+Client classes for interacting with Loculus/LAPIS backends.
+
+## LoculusClient
+
+::: sr2silo.loculus.LoculusClient
+
+## Submission
+
+::: sr2silo.loculus.Submission
+
+## LapisClient
+
+::: sr2silo.loculus.LapisClient