seqr-helm

Helm charts for the seqr platform

Overview

This repo consists of helm charts defining the seqr platform. Helm is a package manager for Kubernetes, an open source system for automating deployment and management of containerized applications.

The seqr application chart consists of deployments for the seqr application, the redis cache, the postgresql relational database and the clickhouse columnar analytics database. The redis and postgresql services may be disabled if seqr is running in a cloud environment with access to managed services. Note that this deployment does not include support for elasticsearch.
The pipeline-runner application chart contains the multiple services that make up the seqr loading pipeline. This chart also runs the luigi scheduler user interface to view running pipeline tasks.
A lib library chart for resources shared between the other charts.
The seqr-platform umbrella chart that bundles the composing charts into a single installable.

Instructions for Initial Deployment

The Kubernetes ecosystem contains many standardized and custom solutions across a wide range of cloud and on-premises environments. To avoid the complexity of a full-fledged production environment and to achieve parity with the existing docker-compose, we recommend setting up a simple local Kubernetes cluster on an on-premises server or a cloud Virtual Machine with at least 32GB of memory and 500GB of disk space. While there is no requirement for the minimum number of CPUs, having more available will significantly speed up data loading and some searches.

Install the four required kubernetes infrastructure components:

The docker container engine.
- If running Docker Desktop on a laptop, make sure to set your CPU and Memory limits under Settings > Resources > Advanced.
- If running on linux, make sure docker can be run without sudo (https://docs.docker.com/engine/install/linux-postinstall/)
The kubectl command line client.
The kind local cluster manager.
The helm package manager.

Then:

Create a local /var/seqr directory to be mounted into the Kubernetes cluster. This will host all seqr application data:
```
sudo mkdir -p /var/seqr
sudo chmod 777 /var/seqr 
```
Start a kind cluster:
```
curl https://raw.githubusercontent.com/broadinstitute/seqr-helm/refs/heads/main/kind.yaml > kind.yaml
kind create cluster --config kind.yaml
```
Note that kubernetes can have unexpected behavior when run with sudo. Make sure to run this and all other kubectl/kind/helm commands without it
Create the Required Secrets in your cluster using kubectl.
Migrate any existing application data.

Install the seqr-platform chart with any override values:

helm repo add seqr-helm https://broadinstitute.github.io/seqr-helm
helm install YOUR_INSTITUTION_NAME-seqr seqr-helm/seqr-platform

After install you should expect to something like:

helm install YOUR_INSTITUTION_NAME-seqr seqr-helm/seqr-platform 
NAME: YOUR_INSTITUTION_NAME-seqr
LAST DEPLOYED: Wed Oct 16 14:50:22 2024
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None

The first deployment will include a download of all of the genomic reference data (~350GB as of 6/2025 but variable). It is likely to be slow, but can be monitored by checking the contents of /var/seqr/seqr-reference-data. Additionally, you may check the status of the services with:

kubectl get pods
NAME                                        READY   STATUS      RESTARTS      AGE
seqr-clickhouse-shard0-0                    4/4     Running     0             22m
pipeline-runner-api-5557bbc7-vrtcj          2/2     Running     0             22m
pipeline-runner-ui-749c94468f-62rtv         1/1     Running     0             22m
seqr-68d7b855fb-bjppn                       1/1     Running     0             22m
seqr-check-new-samples-job-28818190-vlhxj   0/1     Completed   0             22m
seqr-postgresql-0                           1/1     Running     0             22m
seqr-redis-master-0                         1/1     Running     0             22m

While the reference data is downloading, the pipeline-runner-api pod should be in the Init state

pipeline-runner-api-5557bbc7-vrtcj        0/2     Init:0/4    0             8m51s

Once services are healthy, you may create a seqr admin user using the pod name from the above output:

kubectl exec seqr-POD-ID -c seqr -it -- bash
python3 /seqr/manage.py createsuperuser

Required Secrets

The seqr application expects a few secrets to be defined for the services to start. The default expected secrets are declared in the default values.yaml file of the seqr application chart. You should create these secrets in your kubernetes cluster prior to attempting to install the chart.

A secret containing a password field for the postgres database password. By default this secret is named postgres-secrets.
A secret containing a django_key field for the django security key. By default this secret is named seqr-secrets.
A secret containing admin_password, writer_password, and reader_password fields for the the clickhouse database passwords. By default this secret is named clickhouse-secrets. Note that due to some complexity in how we handle the ClickHouse password internally, you should use alphanumeric characters only!

Here's how you might create the secrets:

kubectl create secret generic postgres-secrets \
  --from-literal=password='super-secure-password'

kubectl create secret generic seqr-secrets \
  --from-literal=django_key='securely-generated-key'

kubectl create secret generic clickhouse-secrets \
  --from-literal=admin_password='clickhouseadminpassword' \
  --from-literal=reader_password='clickhousereaderpassword' \
  --from-literal=writer_password='clickhousewriterpassword'

Alternatively, you can use your preferred method for defining secrets in kubernetes. For example, you might use External Secrets to synchronize secrets from your cloud provider into your kubernetes cluster.

Values/Environment Overrides

All default values in the seqr-platform chart may be overriden with helm's Values file functionality. For example, to disable the postgresql deployment, you might create a file my-values.yaml with the contents:

seqr:
  postgresql:
    enabled: false

This is also the recommended pattern for overriding any seqr environment variables:

seqr:
  environment:
    GUNICORN_WORKER_THREADS: "8"

A more comprehensive example of what this may look like, and how the different values are formated in practice, is found in the seqr unit tests.

Migrating Application Data from `docker-compose.yaml`

If you wish to preserve your existing application state in postgresql, you may move your existing ./data/postgres to /var/seqr/postgresql-data. You should see:

cat /var/seqr/postgresql-data/PG_VERSION
12

To migrate static files, you may move your existing ./data/seqr_static_files to /var/seqr/seqr-static-media.
To migrate readviz, you may move your existing ./data/readviz directory to /var/seqr/seqr-static-media and additionally run the update_igv_location.py manage.py command:

kubectl exec seqr-POD-ID -c seqr -it -- bash
python3 /seqr/manage.py update_igv_location old_prefix new_prefix

Note that you do not need to migrate any elasticsearch data, although in order to continue to perform searches all data will need to be reloaded from VCF using the standard data loading process.

After re-loading all the data that will continue to be used in search, we recommend running the following one-time command to ensure that previously saved variants stay in sync with the latest available annotations going forward:

kubectl exec seqr-POD-ID -c seqr -it -- bash
python3 /seqr/manage.py set_saved_variant_key

Updating seqr

To fetch the latest versions of the helm infrastructure and seqr application code, you may run:

helm repo update
helm upgrade YOUR_INSTITUTION_NAME-seqr seqr-helm/seqr-platform

To update reference data in seqr, such as OMIM, HPO, etc., run the following. By default, this will be run automatically as a cron job.

kubectl exec seqr-POD-ID -c seqr -it -- bash
python3 /seqr/manage.py update_all_reference_data

To update the ClinVar reference data used in search, run the following. By default, this will be run automatically as a cron job.

kubectl exec seqr-POD-ID -c seqr -it -- bash
python3 /seqr/manage.py reload_clinvar_all_variants

Migrating seqr from the `hail-search` backend to the `clickhouse` backend.

The seqr-platform update from the 1.45.0-hail-search-final to 2.0.0 is breaking and requires manual interventions to potentially update an environment variable and migrate the search data. Here is the full sequence of steps:

Update HAIL_SEARCH_DATA_DIR to PIPELINE_DATA_DIR. The HAIL_SEARCH_DATA_DIR environment variable has been deprecated in favor of a PIPELINE_DATA_DIR variable shared between the application and pipeline. If you have not altered your HAIL_SEARCH_DATA_DIR and wish to continue using the defaults, you should rename your HAIL_SEARCH_DATA_DIR to the default PIPELINE_DATA_DIR.

sudo mv /var/seqr/seqr-hail-search-data /var/seqr/pipeline-data

and proceed to step #2.

If you have altered the default HAIL_SEARCH_DATA_DIR you should set in your override my-values.yaml:

global:
  seqr:
    environment:
      PIPELINE_DATA_DIR: # current value of HAIL_SEARCH_DATA_DIR

Upgrade your helm installation:

helm upgrade YOUR_INSTITUTION_NAME-seqr seqr-helm/seqr-platform -f my-values.yaml

This step should remove the hail-search pod and create a clickhouse pod.

Export the hail-search tables to the clickhouse-ingestable format. Run the following commands:

# Get the POD-ID of the pipeline-runner pod
$ kubectl get pods | grep pipeline-runner-api
pipeline-runner-api-POD-ID            2/2     Running     0          119m

# Login to the pipeline-runner sidecar
$ kubectl exec pipeline-runner-api-POD-ID -c pipeline-runner-api-sidecar -it -- bash

# Run the migration script
$ uv run python3 -m 'v03_pipeline.bin.migrate_all_projects_to_clickhouse'

The migration is fully supported whether or not you have configured your environment to run the loading pipeline on GCP dataproc and will run in the same environment as data loading. It is also idempotent, so can safely be run multiple times in case of failures.

The migration should take a few minutes per project, substantially less than loading directly from VCF. To check the status of the migration and to debug if required:

Each project hail table is exported into the format produced by the loading pipeline as if it were a new run. For each of your loaded projects, you should expect a directory to be created:

$PIPELINE_DATA_DIR/{ReferenceGenome}/{DatasetType}/runs/hail_search_to_clickhouse_migration-{random_str}_{SampleType}_{project_guid}

Once the hail tables for the project have been successfully converted to parquet, you should expect a new file:

$PIPELINE_DATA_DIR/{ReferenceGenome}/{DatasetType}/runs/hail_search_to_clickhouse_migration-{random_str}_{SampleType}_{project_guid}/_SUCCESS

Once the run has been successfully loaded into clickhouse, you should expect a new file:

$PIPELINE_DATA_DIR/{ReferenceGenome}/{DatasetType}/runs/hail_search_to_clickhouse_migration-{random_str}_{SampleType}_{project_guid}/_CLICKHOUSE_LOAD_SUCCESS

After all migrations have successfully completed, run the following command to ensure that previously saved variants stay in sync with the latest available annotations going forward:

kubectl exec seqr-POD-ID -c seqr -it -- bash
python3 /seqr/manage.py set_saved_variant_key

Migrating seqr from the `annotations` and `transcripts` schema to the `reference_data`, `variants`, and `variants/details` schema (`2.x.x` -> `3.x.x` breaking release).

The seqr-platform update from the 2.19.3-annotations-final to 3.0.0 is breaking and requires several manual interventions. If your first install is ~3.x.x you may ignore these instructions.

Here is the full sequence of steps:

If you have an HGMD licence, you must now supply the VCFs as environment variables. You may supply cloud storage links in your helm installation as seqr.environment env vars.

seqr:
  environment:
    HGMD_GRCH37_URL: 'https://storage.googleapis.com/YOUT_BUCKET_NAME/GRCh37/HGMD/HGMD_Pro_2023.1_hg19.vcf.gz'
    HGMD_GRCH38_URL: 'https://storage.googleapis.com/YOUR_BUCKET_NAME/GRCh38/HGMD/HGMD_Pro_2023.1_hg38.vcf.gz'

Clingen Allele Registry support has also been moved from the pipeline into a CronJob that runs from the seqr pod. The secrets should be moved to the seqr.additionalSecrets section of your helm overrides.

seqr:
  additionalSecrets:
    - name: CLINGEN_ALLELE_REGISTRY_LOGIN
    valueFrom:
        secretKeyRef:
            name: pipeline-secrets
            key: clingen_allele_registry_login
    - name: CLINGEN_ALLELE_REGISTRY_PASSWORD
    valueFrom:
        secretKeyRef:
            name: pipeline-secrets
            key: clingen_allele_registry_password

Update your installation to the final supported 2.x.x version and run the migration process.

Run the helm upgrade to release the seqr-platform-2.19.3-annotations-final version.

helm repo update
helm upgrade YOUR_INSTITUTION_NAME-seqr seqr-helm/seqr-platform --version seqr-platform-2.19.3-annotations-final

Login to the pipeline-runner pod:

# Get the POD-ID of the pipeline-runner pod
kubectl get pods | grep pipeline-runner-api

# Login to the pipeline-runner sidecar
kubectl exec pipeline-runner-api-POD-ID -c pipeline-runner-api-sidecar -it -- bash

Run the provided migration:

uv run python3 -m v03_pipeline.bin.migrate_variants_tables

At a high level, this process:

Drops all reference data from your hail annotations table.
Exports the annotations table to the new variants.parquet and the new variant_details.parquet.
Loads those into ClickHouse.
Triggers a refresh of the processes that join each reference dataset against seqr variants.

Note that the schema for the tables themselves, and the asynchronous process that builds the all_variants tables, is managed automatically by the Django migrations. The migration is expected to take a couple of hours. It is idempotent and can safely be run multiple times.

Delete your existing pipeline reference data _SUCCESS files. This will trigger an updated sync with a reduced file set.

rm -rf $REFERENCE_DATASETS_DIR/GRCh37/_SUCCESS 
rm -rf $REFERENCE_DATASETS_DIR/GRCh38/_SUCCESS

OR (if you've set the environment to a google cloud storage bucket)

gsutil rm -rf $REFERENCE_DATASETS_DIR/GRCh37/_SUCCESS 
gsutil rm -rf $REFERENCE_DATASETS_DIR/GRCh38/_SUCCESS

Upgrade your installation to the latest version.

helm repo update
helm upgrade YOUR_INSTITUTION_NAME-seqr seqr-helm/seqr-platform

Debugging FAQ

How do I uninstall seqr and remove all application data?

helm uninstall YOUR_INSTITUTION_NAME-seqr
kind delete cluster
rm -rf /var/seqr/*
rm -rf /var/seqr/.user_scripts_initialized # an additional dotfile left by the bitnami postgresql container

How do I view seqr's disk utilization? You may access the size of each of the on-disk components with:

du -sh /var/seqr/*

How do I tail logs? To tail the logs of the pipeline worker after you have started a pipeline run, for example:

kubectl get pods -o name | grep pipeline-runner-api
pipeline-runner-api-5557bbc7-vrtcj
kubectl logs pipeline-runner-api-5557bbc7-vrtcj -c pipeline-runner-api-sidecar
2024-10-16 18:24:27 - pipeline_worker - INFO - Waiting for work
2024-10-16 18:24:28 - pipeline_worker - INFO - Waiting for work
2024-10-16 18:24:29 - pipeline_worker - INFO - Waiting for work
....
base_hail_table - INFO - UpdatedCachedReferenceDatasetQuery(reference_genome=GRCh37, dataset_type=SNV_INDEL, crdq=CLINVAR_PATH_VARIANTS) start
[Stage 42:========>

Name		Name	Last commit message	Last commit date
Latest commit History 1,987 Commits
.github/workflows		.github/workflows
charts		charts
unit_test		unit_test
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
kind.yaml		kind.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

seqr-helm

Overview

Instructions for Initial Deployment

Required Secrets

Values/Environment Overrides

Migrating Application Data from `docker-compose.yaml`

Updating seqr

Migrating seqr from the `hail-search` backend to the `clickhouse` backend.

Migrating seqr from the `annotations` and `transcripts` schema to the `reference_data`, `variants`, and `variants/details` schema (`2.x.x` -> `3.x.x` breaking release).

Debugging FAQ

About

Uh oh!

Releases 935

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

seqr-helm

Overview

Instructions for Initial Deployment

Required Secrets

Values/Environment Overrides

Migrating Application Data from docker-compose.yaml

Updating seqr

Migrating seqr from the hail-search backend to the clickhouse backend.

Migrating seqr from the annotations and transcripts schema to the reference_data, variants, and variants/details schema (2.x.x -> 3.x.x breaking release).

Debugging FAQ

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases 935

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Migrating Application Data from `docker-compose.yaml`

Migrating seqr from the `hail-search` backend to the `clickhouse` backend.

Migrating seqr from the `annotations` and `transcripts` schema to the `reference_data`, `variants`, and `variants/details` schema (`2.x.x` -> `3.x.x` breaking release).

Packages