Skip to content

broadinstitute/seqr-helm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1,987 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

seqr-helm

Helm charts for the seqr platform

Overview

This repo consists of helm charts defining the seqr platform. Helm is a package manager for Kubernetes, an open source system for automating deployment and management of containerized applications.

  1. The seqr application chart consists of deployments for the seqr application, the redis cache, the postgresql relational database and the clickhouse columnar analytics database. The redis and postgresql services may be disabled if seqr is running in a cloud environment with access to managed services. Note that this deployment does not include support for elasticsearch.
  2. The pipeline-runner application chart contains the multiple services that make up the seqr loading pipeline. This chart also runs the luigi scheduler user interface to view running pipeline tasks.
  3. A lib library chart for resources shared between the other charts.
  4. The seqr-platform umbrella chart that bundles the composing charts into a single installable.

Instructions for Initial Deployment

The Kubernetes ecosystem contains many standardized and custom solutions across a wide range of cloud and on-premises environments. To avoid the complexity of a full-fledged production environment and to achieve parity with the existing docker-compose, we recommend setting up a simple local Kubernetes cluster on an on-premises server or a cloud Virtual Machine with at least 32GB of memory and 500GB of disk space. While there is no requirement for the minimum number of CPUs, having more available will significantly speed up data loading and some searches.

Install the four required kubernetes infrastructure components:

  1. The docker container engine.
  2. The kubectl command line client.
  3. The kind local cluster manager.
  4. The helm package manager.

Then:

  1. Create a local /var/seqr directory to be mounted into the Kubernetes cluster. This will host all seqr application data:
    sudo mkdir -p /var/seqr
    sudo chmod 777 /var/seqr 
    
  2. Start a kind cluster:
    curl https://raw.githubusercontent.com/broadinstitute/seqr-helm/refs/heads/main/kind.yaml > kind.yaml
    kind create cluster --config kind.yaml
    
    Note that kubernetes can have unexpected behavior when run with sudo. Make sure to run this and all other kubectl/kind/helm commands without it
  3. Create the Required Secrets in your cluster using kubectl.
  4. Migrate any existing application data.
  5. Install the seqr-platform chart with any override values:
    helm repo add seqr-helm https://broadinstitute.github.io/seqr-helm
    helm install YOUR_INSTITUTION_NAME-seqr seqr-helm/seqr-platform
    

After install you should expect to something like:

helm install YOUR_INSTITUTION_NAME-seqr seqr-helm/seqr-platform 
NAME: YOUR_INSTITUTION_NAME-seqr
LAST DEPLOYED: Wed Oct 16 14:50:22 2024
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None

The first deployment will include a download of all of the genomic reference data (~350GB as of 6/2025 but variable). It is likely to be slow, but can be monitored by checking the contents of /var/seqr/seqr-reference-data. Additionally, you may check the status of the services with:

kubectl get pods
NAME                                        READY   STATUS      RESTARTS      AGE
seqr-clickhouse-shard0-0                    4/4     Running     0             22m
pipeline-runner-api-5557bbc7-vrtcj          2/2     Running     0             22m
pipeline-runner-ui-749c94468f-62rtv         1/1     Running     0             22m
seqr-68d7b855fb-bjppn                       1/1     Running     0             22m
seqr-check-new-samples-job-28818190-vlhxj   0/1     Completed   0             22m
seqr-postgresql-0                           1/1     Running     0             22m
seqr-redis-master-0                         1/1     Running     0             22m

While the reference data is downloading, the pipeline-runner-api pod should be in the Init state

pipeline-runner-api-5557bbc7-vrtcj        0/2     Init:0/4    0             8m51s

Once services are healthy, you may create a seqr admin user using the pod name from the above output:

kubectl exec seqr-POD-ID -c seqr -it -- bash
python3 /seqr/manage.py createsuperuser

Required Secrets

The seqr application expects a few secrets to be defined for the services to start. The default expected secrets are declared in the default values.yaml file of the seqr application chart. You should create these secrets in your kubernetes cluster prior to attempting to install the chart.

  1. A secret containing a password field for the postgres database password. By default this secret is named postgres-secrets.
  2. A secret containing a django_key field for the django security key. By default this secret is named seqr-secrets.
  3. A secret containing admin_password, writer_password, and reader_password fields for the the clickhouse database passwords. By default this secret is named clickhouse-secrets. Note that due to some complexity in how we handle the ClickHouse password internally, you should use alphanumeric characters only!

Here's how you might create the secrets:

kubectl create secret generic postgres-secrets \
  --from-literal=password='super-secure-password'

kubectl create secret generic seqr-secrets \
  --from-literal=django_key='securely-generated-key'

kubectl create secret generic clickhouse-secrets \
  --from-literal=admin_password='clickhouseadminpassword' \
  --from-literal=reader_password='clickhousereaderpassword' \
  --from-literal=writer_password='clickhousewriterpassword'

Alternatively, you can use your preferred method for defining secrets in kubernetes. For example, you might use External Secrets to synchronize secrets from your cloud provider into your kubernetes cluster.

Values/Environment Overrides

All default values in the seqr-platform chart may be overriden with helm's Values file functionality. For example, to disable the postgresql deployment, you might create a file my-values.yaml with the contents:

seqr:
  postgresql:
    enabled: false

This is also the recommended pattern for overriding any seqr environment variables:

seqr:
  environment:
    GUNICORN_WORKER_THREADS: "8"

A more comprehensive example of what this may look like, and how the different values are formated in practice, is found in the seqr unit tests.

Migrating Application Data from docker-compose.yaml

  • If you wish to preserve your existing application state in postgresql, you may move your existing ./data/postgres to /var/seqr/postgresql-data. You should see:
cat /var/seqr/postgresql-data/PG_VERSION
12
kubectl exec seqr-POD-ID -c seqr -it -- bash
python3 /seqr/manage.py update_igv_location old_prefix new_prefix

Note that you do not need to migrate any elasticsearch data, although in order to continue to perform searches all data will need to be reloaded from VCF using the standard data loading process.

After re-loading all the data that will continue to be used in search, we recommend running the following one-time command to ensure that previously saved variants stay in sync with the latest available annotations going forward:

kubectl exec seqr-POD-ID -c seqr -it -- bash
python3 /seqr/manage.py set_saved_variant_key

Updating seqr

To fetch the latest versions of the helm infrastructure and seqr application code, you may run:

helm repo update
helm upgrade YOUR_INSTITUTION_NAME-seqr seqr-helm/seqr-platform

To update reference data in seqr, such as OMIM, HPO, etc., run the following. By default, this will be run automatically as a cron job.

kubectl exec seqr-POD-ID -c seqr -it -- bash
python3 /seqr/manage.py update_all_reference_data

To update the ClinVar reference data used in search, run the following. By default, this will be run automatically as a cron job.

kubectl exec seqr-POD-ID -c seqr -it -- bash
python3 /seqr/manage.py reload_clinvar_all_variants

Migrating seqr from the hail-search backend to the clickhouse backend.

The seqr-platform update from the 1.45.0-hail-search-final to 2.0.0 is breaking and requires manual interventions to potentially update an environment variable and migrate the search data. Here is the full sequence of steps:

  1. Update HAIL_SEARCH_DATA_DIR to PIPELINE_DATA_DIR. The HAIL_SEARCH_DATA_DIR environment variable has been deprecated in favor of a PIPELINE_DATA_DIR variable shared between the application and pipeline. If you have not altered your HAIL_SEARCH_DATA_DIR and wish to continue using the defaults, you should rename your HAIL_SEARCH_DATA_DIR to the default PIPELINE_DATA_DIR.
sudo mv /var/seqr/seqr-hail-search-data /var/seqr/pipeline-data

and proceed to step #2.

If you have altered the default HAIL_SEARCH_DATA_DIR you should set in your override my-values.yaml:

global:
  seqr:
    environment:
      PIPELINE_DATA_DIR: # current value of HAIL_SEARCH_DATA_DIR
  1. Upgrade your helm installation:
helm upgrade YOUR_INSTITUTION_NAME-seqr seqr-helm/seqr-platform -f my-values.yaml

This step should remove the hail-search pod and create a clickhouse pod.

  1. Export the hail-search tables to the clickhouse-ingestable format. Run the following commands:
# Get the POD-ID of the pipeline-runner pod
$ kubectl get pods | grep pipeline-runner-api
pipeline-runner-api-POD-ID            2/2     Running     0          119m

# Login to the pipeline-runner sidecar
$ kubectl exec pipeline-runner-api-POD-ID -c pipeline-runner-api-sidecar -it -- bash

# Run the migration script
$ uv run python3 -m 'v03_pipeline.bin.migrate_all_projects_to_clickhouse'

The migration is fully supported whether or not you have configured your environment to run the loading pipeline on GCP dataproc and will run in the same environment as data loading. It is also idempotent, so can safely be run multiple times in case of failures.

The migration should take a few minutes per project, substantially less than loading directly from VCF. To check the status of the migration and to debug if required:

  • Each project hail table is exported into the format produced by the loading pipeline as if it were a new run. For each of your loaded projects, you should expect a directory to be created:
$PIPELINE_DATA_DIR/{ReferenceGenome}/{DatasetType}/runs/hail_search_to_clickhouse_migration-{random_str}_{SampleType}_{project_guid}
  • Once the hail tables for the project have been successfully converted to parquet, you should expect a new file:
$PIPELINE_DATA_DIR/{ReferenceGenome}/{DatasetType}/runs/hail_search_to_clickhouse_migration-{random_str}_{SampleType}_{project_guid}/_SUCCESS
  • Once the run has been successfully loaded into clickhouse, you should expect a new file:
$PIPELINE_DATA_DIR/{ReferenceGenome}/{DatasetType}/runs/hail_search_to_clickhouse_migration-{random_str}_{SampleType}_{project_guid}/_CLICKHOUSE_LOAD_SUCCESS

After all migrations have successfully completed, run the following command to ensure that previously saved variants stay in sync with the latest available annotations going forward:

kubectl exec seqr-POD-ID -c seqr -it -- bash
python3 /seqr/manage.py set_saved_variant_key

Migrating seqr from the annotations and transcripts schema to the reference_data, variants, and variants/details schema (2.x.x -> 3.x.x breaking release).

The seqr-platform update from the 2.19.3-annotations-final to 3.0.0 is breaking and requires several manual interventions. If your first install is ~3.x.x you may ignore these instructions.

Here is the full sequence of steps:

  1. If you have an HGMD licence, you must now supply the VCFs as environment variables. You may supply cloud storage links in your helm installation as seqr.environment env vars.
seqr:
  environment:
    HGMD_GRCH37_URL: 'https://storage.googleapis.com/YOUT_BUCKET_NAME/GRCh37/HGMD/HGMD_Pro_2023.1_hg19.vcf.gz'
    HGMD_GRCH38_URL: 'https://storage.googleapis.com/YOUR_BUCKET_NAME/GRCh38/HGMD/HGMD_Pro_2023.1_hg38.vcf.gz'
  1. Clingen Allele Registry support has also been moved from the pipeline into a CronJob that runs from the seqr pod. The secrets should be moved to the seqr.additionalSecrets section of your helm overrides.
seqr:
  additionalSecrets:
    - name: CLINGEN_ALLELE_REGISTRY_LOGIN
    valueFrom:
        secretKeyRef:
            name: pipeline-secrets
            key: clingen_allele_registry_login
    - name: CLINGEN_ALLELE_REGISTRY_PASSWORD
    valueFrom:
        secretKeyRef:
            name: pipeline-secrets
            key: clingen_allele_registry_password
  1. Update your installation to the final supported 2.x.x version and run the migration process.

    1. Run the helm upgrade to release the seqr-platform-2.19.3-annotations-final version.

      helm repo update
      helm upgrade YOUR_INSTITUTION_NAME-seqr seqr-helm/seqr-platform --version seqr-platform-2.19.3-annotations-final
    2. Login to the pipeline-runner pod:

      # Get the POD-ID of the pipeline-runner pod
      kubectl get pods | grep pipeline-runner-api
      
      # Login to the pipeline-runner sidecar
      kubectl exec pipeline-runner-api-POD-ID -c pipeline-runner-api-sidecar -it -- bash
    3. Run the provided migration:

      uv run python3 -m v03_pipeline.bin.migrate_variants_tables

At a high level, this process:

  • Drops all reference data from your hail annotations table.
  • Exports the annotations table to the new variants.parquet and the new variant_details.parquet.
  • Loads those into ClickHouse.
  • Triggers a refresh of the processes that join each reference dataset against seqr variants.

Note that the schema for the tables themselves, and the asynchronous process that builds the all_variants tables, is managed automatically by the Django migrations. The migration is expected to take a couple of hours. It is idempotent and can safely be run multiple times.

  1. Delete your existing pipeline reference data _SUCCESS files. This will trigger an updated sync with a reduced file set.
rm -rf $REFERENCE_DATASETS_DIR/GRCh37/_SUCCESS 
rm -rf $REFERENCE_DATASETS_DIR/GRCh38/_SUCCESS

OR (if you've set the environment to a google cloud storage bucket)

gsutil rm -rf $REFERENCE_DATASETS_DIR/GRCh37/_SUCCESS 
gsutil rm -rf $REFERENCE_DATASETS_DIR/GRCh38/_SUCCESS
  1. Upgrade your installation to the latest version.
helm repo update
helm upgrade YOUR_INSTITUTION_NAME-seqr seqr-helm/seqr-platform

Debugging FAQ

  • How do I uninstall seqr and remove all application data?
helm uninstall YOUR_INSTITUTION_NAME-seqr
kind delete cluster
rm -rf /var/seqr/*
rm -rf /var/seqr/.user_scripts_initialized # an additional dotfile left by the bitnami postgresql container
  • How do I view seqr's disk utilization? You may access the size of each of the on-disk components with:
du -sh /var/seqr/*
  • How do I tail logs? To tail the logs of the pipeline worker after you have started a pipeline run, for example:
kubectl get pods -o name | grep pipeline-runner-api
pipeline-runner-api-5557bbc7-vrtcj
kubectl logs pipeline-runner-api-5557bbc7-vrtcj -c pipeline-runner-api-sidecar
2024-10-16 18:24:27 - pipeline_worker - INFO - Waiting for work
2024-10-16 18:24:28 - pipeline_worker - INFO - Waiting for work
2024-10-16 18:24:29 - pipeline_worker - INFO - Waiting for work
....
base_hail_table - INFO - UpdatedCachedReferenceDatasetQuery(reference_genome=GRCh37, dataset_type=SNV_INDEL, crdq=CLINVAR_PATH_VARIANTS) start
[Stage 42:========>

About

Helm charts for the seqr project

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

 
 
 

Contributors