This pipeline aligns or realigns genomic sequencing data (from FASTQ or CRAM files) using the DRAGEN pipeline on the Illumina Connected Analytics (ICA) platform.
It manages data upload to ICA, submission and monitoring of the DRAGEN pipeline, and download of results (CRAMs, gVCFs, QC metrics) back to Google Cloud Storage (GCS). It also performs subsequent QC steps, including Somalier fingerprinting and MultiQC report generation.
The workflow performs the following main steps:
- Prepare ICA: Creates analysis folders within the ICA project.
- Input Data Handling (Conditional):
- If
reads_type = "fastq":- Submits a separate pipeline in ICA to calculate MD5 checksums for all FASTQ files.
- Downloads the results and validates them against the manifest file.
- Generates a
fastq_list.csvfile for DRAGEN and uploads it to ICA.
- If
reads_type = "cram":- Uploads the CRAM file from GCS to ICA.
- If
- Run DRAGEN: Submits the main DRAGEN alignment pipeline to ICA and monitors its progress until completion, failure, or cancellation.
- Run MLR: Submits and monitors the DRAGEN MLR (Machine Learning Recalibration) pipeline.
- Download Results: Downloads the key outputs (CRAMs, gVCFs, and QC metrics) from ICA back to GCS.
- Post-Processing QC:
- Runs
somalier extracton the newly generated CRAM file to create a genomic fingerprint. - Aggregates all QC metrics (from DRAGEN and Somalier) into a single MultiQC report.
- Runs
- Cleanup (Optional, after checking all outputs are correct): Deletes the data from the ICA platform to reduce storage costs.
- Metamist Cohort: You must have a cohort created in Metamist containing the sequencing groups you wish to process.
- Configuration File: You must create a TOML configuration file. A reference can be found at
config/dragen_align_pa_defaults.toml.
Your TOML configuration file must specify the following key options:
-
[workflow]:input_cohorts: A list of Metamist cohort IDs to process (e.g.,['COH0001']).sequencing_type: Must be set (e.g.,"genome").reads_type: Critical. Must be set to either"fastq"or"cram".last_stages: A list of the final stages to run. To run the full pipeline including QC, use['RunMultiQc']. To also delete data from ICA after, use['DeleteDataInIca'].skip_stages: (Optional) A list of stages to skip, e.g.,['DeleteDataInIca'].
-
If
reads_type = "cram":[ica.cram_references]:old_cram_reference: Must be set to match one of the keys in this section (e.g.,"dragmap"). This tells the pipeline which reference genome file (already in ICA) was used to generate the original CRAM file.
-
[ica.data_prep]:upload_folder: The folder name to create in ICA for uploading data (e.g.,"my-cram-uploads").output_folder: The base folder name to create in ICA for pipeline outputs (e.g.,"my-dragen-results").
If you set reads_type = "fastq", the pipeline queries Metamist for an analysis of type manifest that has been registerd against the cohort during ingestion of the fastq data. It will check for production_manifest in the file name, in case the control manifest is also registered. The manifest file provides metadata about the FASTQ files. This information is used for both MD5 validation and to construct the DRAGEN-specific RGID (Read Group Identifier).
The required column headers are: Filenames, Checksum, Sample ID, Lane, Machine ID, and Flow cell.
| Filenames | Checksum | Sample ID | Lane | Machine ID | Flow cell |
|---|---|---|---|---|---|
| CPG0001_R1.fastq.gz | d41d8cd98f00b204e9800998ecf8427e | CPG0001 | 1 | M0001 | AABBCC |
| CPG0001_R2.fastq.gz | 9800998ecf8427e1d8cd98f00b204e98 | CPG0001 | 1 | M0001 | AABBCC |
| CPG0002_R1.fastq.gz | 1234567890abcdef1234567890abcdef | CPG0002 | 1 | M0001 | AABBCC |
| CPG0002_R2.fastq.gz | fedcba0987654321fedcba0987654321 | CPG0002 | 1 | M0001 | AABBCC |
The pipeline is launched using analysis-runner.
Example Invocation:
analysis-runner \
--dataset <your-dataset> \
--access test \
--config <path/to/your-config.toml> \
--output-dir '' \
--description "DRAGEN alignment for <your-cohort>" \
--image "australia-southeast1-docker.pkg.dev/cpg-common/images-dev/dragen_align_pa:<image-tag>" \
dragen_align_pa--dataset: The Metamist dataset associated with your cohort.--config: The path to your local TOML configuration file.--output-dir: This is required byanalysis-runnerbut is not used by this pipeline. You can leave it as''.--image: The full path to the pipeline's Docker image. The example uses a-devimage, but production runs will use a production (i.e. no-devimage)
The ManageDragenPipeline stage submits a job to the separate ICA platform. The analysis-runner job will then wait and poll ICA for the status.
If your analysis-runner job is interrupted (e.g., it fails or is stopped) while the DRAGEN pipeline is still running in ICA, you can resume monitoring.
The pipeline writes a {sequencing_group}_pipeline_id_and_arguid.json file to GCS upon submission.
To resume monitoring: Simply re-launch the pipeline with the exact same analysis-runner command. The ManageDragenPipeline stage will detect the existing .json file, read the pipeline ID from it, and begin monitoring that job instead of submitting a new one.
If you need to cancel a pipeline that is running in ICA:
- Stop the
analysis-runnerjob in Hail Batch. - In your TOML configuration file, set
ica.management.cancel_cohort_run = true. - Re-launch the pipeline using the same
analysis-runnercommand. - The
ManageDragenPipelinestage will detect thecancel_cohort_runflag, read the pipeline ID from the.jsonfile, and send an "abort" request to the ICA API.
When successful, the pipeline downloads all results to your dataset's GCS bucket. Key outputs are organized as follows:
- Realigned CRAMs:
gs://{BUCKET}/ica/{DRAGEN_VERSION}/output/cram/
- gVCFs:
gs://{BUCKET}/ica/{DRAGEN_VERSION}/output/base_gvcf/(from base DRAGEN run)gs://{BUCKET}/ica/{DRAGEN_VERSION}/output/recal_gvcf/(from MLR run)
- Raw QC Metrics and all Other Files:
gs://{BUCKET}/ica/{DRAGEN_VERSION}/output/dragen_metrics/
- Somalier Fingerprints:
gs://{BUCKET}/ica/{DRAGEN_VERSION}/output/somalier/
- Aggregated QC Report:
gs://{BUCKET}/ica/{DRAGEN_VERSION}/qc/{cohort_name}_multiqc_report.html
This README was generated in part by Gemini 2.5 Pro.