Document how to use Dataflow Flexible Resource Scheduling to save on Cloud cost

For loading VCF data into BigQuery, Variant Transforms uses Cloud Dataflow. Dataflow now provides a flag which can be passed that brings down the cost of using Dataflow and has been demonstrated to work well for Variant Transforms.

Details about Flexible Resource Scheduling (FlexRS) can be found here:

https://cloud.google.com/dataflow/docs/guides/flexrs

Note that users will likely want to read through the doc carefully to understand how it works, whether they'll need to update any Quotas, and overall what to expect (including the likelihood that job start will be delayed).

At a high level:

> FlexRS reduces batch processing costs by using advanced scheduling techniques, the Dataflow Shuffle service, and currently a combination of preemptible virtual machine (VM) instances and regular VMs.

and can be used with Variant Transforms by updating your COMMAND from:

```
COMMAND="vcf_to_bq \
  --input_pattern ${INPUT_PATTERN} \
  --output_table ${OUTPUT_TABLE} \
  --job_name vcf-to-bigquery \
  --runner DataflowRunner"
```

to:

```
COMMAND="vcf_to_bq \
  --input_pattern ${INPUT_PATTERN} \
  --output_table ${OUTPUT_TABLE} \
  --job_name vcf-to-bigquery \
  --flexrs_goal=COST_OPTIMIZED \
  --runner DataflowRunner"
```

We used this successfully for loading variants for over 9,000 WGS samples that were joint genotyped. For these particular tests, the cost dropped by about half. A few important things to note:

- I needed the fix in #657 
- Cost savings were better using `n1-standard-2` workers instead of `n1-highmem-16` (though runtime was longer)
- There was no need to specify a `--disk_size_gb` value (Dataflow takes care of disk allocation automatically with flexRS)
- I also included a fix for #658 

The fix for #658 wasn't strictly necessary, but the use of `COST_OPTIMIZED` introduces a delay to starting the (unnecessary) `merge_headers` Dataflow. On a few occasions, this added a couple of hours to the overall time to run `vcf_to_bq`.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Document how to use Dataflow Flexible Resource Scheduling to save on Cloud cost #662

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Document how to use Dataflow Flexible Resource Scheduling to save on Cloud cost #662

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions