Helm charts for the seqr platform
This repo consists of helm charts defining the seqr platform. Helm is a package manager for Kubernetes, an open source system for automating deployment and management of containerized applications.
- The seqr application chart consists of deployments for the seqr application, the
rediscache, thepostgresqlrelational database and theclickhousecolumnar analytics database. Theredisandpostgresqlservices may be disabled ifseqris running in a cloud environment with access to managed services. Note that this deployment does not include support forelasticsearch. - The pipeline-runner application chart contains the multiple services that make up the seqr loading pipeline. This chart also runs the luigi scheduler user interface to view running pipeline tasks.
- A lib library chart for resources shared between the other charts.
- The seqr-platform umbrella chart that bundles the composing charts into a single installable.
The Kubernetes ecosystem contains many standardized and custom solutions across a wide range of cloud and on-premises environments. To avoid the complexity of a full-fledged production environment and to achieve parity with the existing docker-compose, we recommend setting up a simple local Kubernetes cluster on an on-premises server or a cloud Virtual Machine with at least 32GB of memory and 500GB of disk space. While there is no requirement for the minimum number of CPUs, having more available will significantly speed up data loading and some searches.
Install the four required kubernetes infrastructure components:
- The
dockercontainer engine.- If running
Docker Desktopon a laptop, make sure to set your CPU and Memory limits under Settings > Resources > Advanced. - If running on linux, make sure docker can be run without
sudo(https://docs.docker.com/engine/install/linux-postinstall/)
- If running
- The
kubectlcommand line client. - The
kindlocal cluster manager. - The
helmpackage manager.
Then:
- Create a local
/var/seqrdirectory to be mounted into the Kubernetes cluster. This will host all seqr application data:sudo mkdir -p /var/seqr sudo chmod 777 /var/seqr - Start a
kindcluster:Note that kubernetes can have unexpected behavior when run withcurl https://raw.githubusercontent.com/broadinstitute/seqr-helm/refs/heads/main/kind.yaml > kind.yaml kind create cluster --config kind.yamlsudo. Make sure to run this and all otherkubectl/kind/helmcommands without it - Create the Required Secrets in your cluster using
kubectl. - Migrate any existing application data.
- Install the
seqr-platformchart with any override values:helm repo add seqr-helm https://broadinstitute.github.io/seqr-helm helm install YOUR_INSTITUTION_NAME-seqr seqr-helm/seqr-platform
After install you should expect to something like:
helm install YOUR_INSTITUTION_NAME-seqr seqr-helm/seqr-platform
NAME: YOUR_INSTITUTION_NAME-seqr
LAST DEPLOYED: Wed Oct 16 14:50:22 2024
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
The first deployment will include a download of all of the genomic reference data (~350GB as of 6/2025 but variable). It is likely to be slow, but can be monitored by checking the contents of /var/seqr/seqr-reference-data. Additionally, you may check the status of the services with:
kubectl get pods
NAME READY STATUS RESTARTS AGE
seqr-clickhouse-shard0-0 4/4 Running 0 22m
pipeline-runner-api-5557bbc7-vrtcj 2/2 Running 0 22m
pipeline-runner-ui-749c94468f-62rtv 1/1 Running 0 22m
seqr-68d7b855fb-bjppn 1/1 Running 0 22m
seqr-check-new-samples-job-28818190-vlhxj 0/1 Completed 0 22m
seqr-postgresql-0 1/1 Running 0 22m
seqr-redis-master-0 1/1 Running 0 22m
While the reference data is downloading, the pipeline-runner-api pod should be in the Init state
pipeline-runner-api-5557bbc7-vrtcj 0/2 Init:0/4 0 8m51s
Once services are healthy, you may create a seqr admin user using the pod name from the above output:
kubectl exec seqr-POD-ID -c seqr -it -- bash
python3 /seqr/manage.py createsuperuser
The seqr application expects a few secrets to be defined for the services to start. The default expected secrets are declared in the default values.yaml file of the seqr application chart. You should create these secrets in your kubernetes cluster prior to attempting to install the chart.
- A secret containing a
passwordfield for the postgres database password. By default this secret is namedpostgres-secrets. - A secret containing a
django_keyfield for the django security key. By default this secret is namedseqr-secrets. - A secret containing
admin_password,writer_password, andreader_passwordfields for the the clickhouse database passwords. By default this secret is namedclickhouse-secrets. Note that due to some complexity in how we handle the ClickHouse password internally, you should use alphanumeric characters only!
Here's how you might create the secrets:
kubectl create secret generic postgres-secrets \
--from-literal=password='super-secure-password'
kubectl create secret generic seqr-secrets \
--from-literal=django_key='securely-generated-key'
kubectl create secret generic clickhouse-secrets \
--from-literal=admin_password='clickhouseadminpassword' \
--from-literal=reader_password='clickhousereaderpassword' \
--from-literal=writer_password='clickhousewriterpassword'Alternatively, you can use your preferred method for defining secrets in kubernetes. For example, you might use External Secrets to synchronize secrets from your cloud provider into your kubernetes cluster.
All default values in the seqr-platform chart may be overriden with helm's Values file functionality. For example, to disable the postgresql deployment, you might create a file my-values.yaml with the contents:
seqr:
postgresql:
enabled: false
This is also the recommended pattern for overriding any seqr environment variables:
seqr:
environment:
GUNICORN_WORKER_THREADS: "8"
A more comprehensive example of what this may look like, and how the different values are formated in practice, is found in the seqr unit tests.
- If you wish to preserve your existing application state in
postgresql, you may move your existing./data/postgresto/var/seqr/postgresql-data. You should see:
cat /var/seqr/postgresql-data/PG_VERSION
12
-
To migrate static files, you may move your existing
./data/seqr_static_filesto/var/seqr/seqr-static-media. -
To migrate
readviz, you may move your existing./data/readvizdirectory to/var/seqr/seqr-static-mediaand additionally run theupdate_igv_location.pymanage.pycommand:
kubectl exec seqr-POD-ID -c seqr -it -- bash
python3 /seqr/manage.py update_igv_location old_prefix new_prefix
Note that you do not need to migrate any elasticsearch data, although in order to continue to perform searches all data will need to be reloaded from VCF using the standard data loading process.
After re-loading all the data that will continue to be used in search, we recommend running the following one-time command to ensure that previously saved variants stay in sync with the latest available annotations going forward:
kubectl exec seqr-POD-ID -c seqr -it -- bash
python3 /seqr/manage.py set_saved_variant_key
To fetch the latest versions of the helm infrastructure and seqr application code, you may run:
helm repo update
helm upgrade YOUR_INSTITUTION_NAME-seqr seqr-helm/seqr-platform
To update reference data in seqr, such as OMIM, HPO, etc., run the following. By default, this will be run automatically as a cron job.
kubectl exec seqr-POD-ID -c seqr -it -- bash
python3 /seqr/manage.py update_all_reference_dataTo update the ClinVar reference data used in search, run the following. By default, this will be run automatically as a cron job.
kubectl exec seqr-POD-ID -c seqr -it -- bash
python3 /seqr/manage.py reload_clinvar_all_variantsThe seqr-platform update from the 1.45.0-hail-search-final to 2.0.0 is breaking and requires manual interventions to potentially update an environment variable and migrate the search data. Here is the full sequence of steps:
- Update
HAIL_SEARCH_DATA_DIRtoPIPELINE_DATA_DIR. TheHAIL_SEARCH_DATA_DIRenvironment variable has been deprecated in favor of aPIPELINE_DATA_DIRvariable shared between the application and pipeline. If you have not altered yourHAIL_SEARCH_DATA_DIRand wish to continue using the defaults, you should rename yourHAIL_SEARCH_DATA_DIRto the defaultPIPELINE_DATA_DIR.
sudo mv /var/seqr/seqr-hail-search-data /var/seqr/pipeline-data
and proceed to step #2.
If you have altered the default HAIL_SEARCH_DATA_DIR you should set in your override my-values.yaml:
global:
seqr:
environment:
PIPELINE_DATA_DIR: # current value of HAIL_SEARCH_DATA_DIR
- Upgrade your
helminstallation:
helm upgrade YOUR_INSTITUTION_NAME-seqr seqr-helm/seqr-platform -f my-values.yaml
This step should remove the hail-search pod and create a clickhouse pod.
- Export the
hail-searchtables to theclickhouse-ingestable format. Run the following commands:
# Get the POD-ID of the pipeline-runner pod
$ kubectl get pods | grep pipeline-runner-api
pipeline-runner-api-POD-ID 2/2 Running 0 119m
# Login to the pipeline-runner sidecar
$ kubectl exec pipeline-runner-api-POD-ID -c pipeline-runner-api-sidecar -it -- bash
# Run the migration script
$ uv run python3 -m 'v03_pipeline.bin.migrate_all_projects_to_clickhouse'
The migration is fully supported whether or not you have configured your environment to run the loading pipeline on GCP dataproc and will run in the same environment as data loading. It is also idempotent, so can safely be run multiple times in case of failures.
The migration should take a few minutes per project, substantially less than loading directly from VCF. To check the status of the migration and to debug if required:
- Each project hail table is exported into the format produced by the loading pipeline as if it were a new run. For each of your loaded projects, you should expect a directory to be created:
$PIPELINE_DATA_DIR/{ReferenceGenome}/{DatasetType}/runs/hail_search_to_clickhouse_migration-{random_str}_{SampleType}_{project_guid}
- Once the hail tables for the project have been successfully converted to parquet, you should expect a new file:
$PIPELINE_DATA_DIR/{ReferenceGenome}/{DatasetType}/runs/hail_search_to_clickhouse_migration-{random_str}_{SampleType}_{project_guid}/_SUCCESS
- Once the run has been successfully loaded into
clickhouse, you should expect a new file:
$PIPELINE_DATA_DIR/{ReferenceGenome}/{DatasetType}/runs/hail_search_to_clickhouse_migration-{random_str}_{SampleType}_{project_guid}/_CLICKHOUSE_LOAD_SUCCESS
After all migrations have successfully completed, run the following command to ensure that previously saved variants stay in sync with the latest available annotations going forward:
kubectl exec seqr-POD-ID -c seqr -it -- bash
python3 /seqr/manage.py set_saved_variant_key
Migrating seqr from the annotations and transcripts schema to the reference_data, variants, and variants/details schema (2.x.x -> 3.x.x breaking release).
The seqr-platform update from the 2.19.3-annotations-final to 3.0.0 is breaking and requires several manual interventions. If your first install
is ~3.x.x you may ignore these instructions.
Here is the full sequence of steps:
- If you have an HGMD licence, you must now supply the VCFs as environment variables. You may supply cloud storage links in your helm installation as
seqr.environmentenv vars.
seqr:
environment:
HGMD_GRCH37_URL: 'https://storage.googleapis.com/YOUT_BUCKET_NAME/GRCh37/HGMD/HGMD_Pro_2023.1_hg19.vcf.gz'
HGMD_GRCH38_URL: 'https://storage.googleapis.com/YOUR_BUCKET_NAME/GRCh38/HGMD/HGMD_Pro_2023.1_hg38.vcf.gz'
- Clingen Allele Registry support has also been moved from the pipeline into a CronJob that runs from the seqr pod.
The secrets should be moved to the
seqr.additionalSecretssection of your helm overrides.
seqr:
additionalSecrets:
- name: CLINGEN_ALLELE_REGISTRY_LOGIN
valueFrom:
secretKeyRef:
name: pipeline-secrets
key: clingen_allele_registry_login
- name: CLINGEN_ALLELE_REGISTRY_PASSWORD
valueFrom:
secretKeyRef:
name: pipeline-secrets
key: clingen_allele_registry_password
-
Update your installation to the final supported 2.x.x version and run the migration process.
-
Run the helm upgrade to release the
seqr-platform-2.19.3-annotations-finalversion.helm repo update helm upgrade YOUR_INSTITUTION_NAME-seqr seqr-helm/seqr-platform --version seqr-platform-2.19.3-annotations-final
-
Login to the
pipeline-runnerpod:# Get the POD-ID of the pipeline-runner pod kubectl get pods | grep pipeline-runner-api # Login to the pipeline-runner sidecar kubectl exec pipeline-runner-api-POD-ID -c pipeline-runner-api-sidecar -it -- bash
-
Run the provided migration:
uv run python3 -m v03_pipeline.bin.migrate_variants_tables
-
At a high level, this process:
- Drops all reference data from your hail annotations table.
- Exports the annotations table to the new
variants.parquetand the newvariant_details.parquet. - Loads those into ClickHouse.
- Triggers a refresh of the processes that join each reference dataset against seqr variants.
Note that the schema for the tables themselves, and the asynchronous process that builds the all_variants tables, is
managed automatically by the Django migrations. The migration is expected to take a couple of hours. It is idempotent
and can safely be run multiple times.
- Delete your existing pipeline reference data _SUCCESS files. This will trigger an updated sync with a reduced file set.
rm -rf $REFERENCE_DATASETS_DIR/GRCh37/_SUCCESS
rm -rf $REFERENCE_DATASETS_DIR/GRCh38/_SUCCESS
OR (if you've set the environment to a google cloud storage bucket)
gsutil rm -rf $REFERENCE_DATASETS_DIR/GRCh37/_SUCCESS
gsutil rm -rf $REFERENCE_DATASETS_DIR/GRCh38/_SUCCESS
- Upgrade your installation to the latest version.
helm repo update
helm upgrade YOUR_INSTITUTION_NAME-seqr seqr-helm/seqr-platform
- How do I uninstall
seqrand remove all application data?
helm uninstall YOUR_INSTITUTION_NAME-seqr
kind delete cluster
rm -rf /var/seqr/*
rm -rf /var/seqr/.user_scripts_initialized # an additional dotfile left by the bitnami postgresql container
- How do I view
seqr's disk utilization? You may access the size of each of the on-disk components with:
du -sh /var/seqr/*
- How do I tail logs? To tail the logs of the pipeline worker after you have started a pipeline run, for example:
kubectl get pods -o name | grep pipeline-runner-api
pipeline-runner-api-5557bbc7-vrtcj
kubectl logs pipeline-runner-api-5557bbc7-vrtcj -c pipeline-runner-api-sidecar
2024-10-16 18:24:27 - pipeline_worker - INFO - Waiting for work
2024-10-16 18:24:28 - pipeline_worker - INFO - Waiting for work
2024-10-16 18:24:29 - pipeline_worker - INFO - Waiting for work
....
base_hail_table - INFO - UpdatedCachedReferenceDatasetQuery(reference_genome=GRCh37, dataset_type=SNV_INDEL, crdq=CLINVAR_PATH_VARIANTS) start
[Stage 42:========>