VS-1780 - Set sample_info.is_loaded for parquet ingest by gbggrant · Pull Request #9320 · broadinstitute/gatk

gbggrant · 2026-02-02T12:43:06Z

Successful Bulk Ingest run here

* VS-1570 - Adding new validation for the VAT

* For VS-1644. Increase disk for ExcludeSitesFromSitesOnlyVcf task. Making it a task parameter so potentially overrideable.

* VS-1743. Modify pgen .pvar file to use a new ID field format. Specifically, 'chr:pos:ref'

* VS-1748. Have GvsExtractCohortFromSampleNames.wdl default to preparing tables with timestamped names (to avoid collisions) * Update the code to check for the presence of already prepared tables and don't run extract if the prep tables already exist

…#9300)

…lset. (#9307) * Add option to pad input interval list for Prepare Ranges. * Set default interval_list_padding.

Copilot

Pull request overview

This PR implements setting the sample_info.is_loaded flag for parquet-based ingestion workflows. The change introduces a new task SetIsLoadedColumnForParquetIngest that updates the is_loaded column after parquet loading verification, ensuring proper tracking of sample loading status in the parquet ingest path.

Changes:

Added SetIsLoadedColumnForParquetIngest task to update is_loaded flag after parquet verification
Conditionally skipped the original SetIsLoadedColumn task when using parquet ingest
Updated task dependencies to use LoadData.done instead of SetIsLoadedColumn.done

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
scripts/variantstore/wdl/GvsImportGenomes.wdl	Added new task for setting is_loaded column in parquet ingest flow and updated task dependencies
scripts/variantstore/wdl/GvsUtils.wdl	Reverted docker image version from 2026-01-27 to 2026-01-26
scripts/variantstore/wdl/GvsJointVariantCalling.wdl	Removed comment line
scripts/variantstore/wdl/GvsBulkIngestGenomes.wdl	Added comment line
.dockstore.yml	Updated branch tracking configuration

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-02T18:33:31Z

scripts/variantstore/wdl/GvsImportGenomes.wdl

+    # Note that we tried modifying CreateVariantIngestFiles to UPDATE sample_info.is_loaded on a per-sample basis.
+    # The major issue that was found is that BigQuery allows only 20 such concurrent DML statements. Considered using
+    # an exponential backoff, but at the number of samples that are being loaded this would introduce significant delays
+    # in workflow processing. So this method is used to set *all* of the saple_info.is_loaded flags at one time.


Corrected spelling of 'saple_info' to 'sample_info'.

Suggested change

# in workflow processing. So this method is used to set *all* of the saple_info.is_loaded flags at one time.

# in workflow processing. So this method is used to set *all* of the sample_info.is_loaded flags at one time.

Copilot · 2026-02-02T18:33:31Z

scripts/variantstore/wdl/GvsImportGenomes.wdl

    String project_id
    String dataset_name
-    String set_is_loaded_done
+    Array[String] load_done


The parameter name 'load_done' is ambiguous when typed as Array[String]. Consider renaming to 'load_done_statuses' or 'load_completion_markers' to clarify that it's a collection rather than a single completion signal.

mcovarr · 2026-02-02T20:17:58Z

scripts/variantstore/wdl/GvsImportGenomes.wdl

+    # bq query --max_rows check: ok update
+    bq --apilog=false --project_id=~{project_id} query --format=csv --use_legacy_sql=false ~{bq_labels} \
+    'UPDATE `~{dataset_name}.sample_info` SET is_loaded = true
+    WHERE sample_id IN (SELECT CAST(partition_id AS INT64)


~~I don't understand this... a sample id is being compared to a partition id?~~

Ah ok... I tinkered with this query in the console and I think I see how this works. But I'm wondering if this is still going to return correct results for vet tables > 001? Wouldn't the partitions in vet_002 start at 1 again?

I actually copied the bit about the partition from the 'normal' SetIsLoadedColumn method - I had thought it had been put in there to avoid some of the weirdly named vet and ref_ranges tables that were created during foxtrot? Very possible I misunderstood that.

mcovarr · 2026-02-02T21:03:05Z

scripts/variantstore/wdl/GvsImportGenomes.wdl

+    'UPDATE `~{dataset_name}.sample_info` SET is_loaded = true
+    WHERE sample_id IN (SELECT CAST(partition_id AS INT64)
+    from `~{dataset_name}.INFORMATION_SCHEMA.PARTITIONS`
+    WHERE partition_id NOT LIKE "__%" AND total_logical_bytes > 0 AND REGEXP_CONTAINS(table_name, "^vet_[0-9]+$")) OR


not sure I understand why this big OR block is here

shouldn't there be a ref_ranges version of the AND logic here?

The big OR logic I added (that starting here at 622) is trying to find out that there is a sample_name (as extricated from the file_path) in parquet_load_status for a vet parquet file creation and also one for a ref_ranges parquet file creation.

gatk-bot · 2026-02-06T00:16:28Z

Github actions tests reported job failures from actions build 21732753862
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
conda	17.0.6+10	21732753862.3	logs

gbggrant and others added 30 commits September 23, 2025 10:18

VS-1570 Add MANES check to VAT validation (#9266)

3ad4c98

* VS-1570 - Adding new validation for the VAT

VS-1644. Increase disk for ExcludeSitesFromSitesOnlyVcf task. (#9271)

4f0af4f

* For VS-1644. Increase disk for ExcludeSitesFromSitesOnlyVcf task. Making it a task parameter so potentially overrideable.

PGEN Exome improvements [VS-1634] (#9272)

67baa5a

VS-1743. Modify pgen .pvar file to use a new ID field format. (#9273)

67f06c8

* VS-1743. Modify pgen .pvar file to use a new ID field format. Specifically, 'chr:pos:ref'

WDLs and scripts from Foxtrot AN investigation [DST-2716] (#9278)

eab6024

Adjust PGEN extract memory based on Foxtrot experience [VS-1634] (#9280)

ea30f0e

WDL to create the participant mapping table [VS-1632] (#9283)

506f4c5

Remove a (now-not existant) clinvar significance from expected. (#9284)

197136a

Find and fix another bug (#9285)

717e0d6

CMRG-inspired cleanup [VS-1739] (#9287)

a85aa64

Unmapped VID participant ID fixup [VS-1686] (#9286)

e585309

Map dropped duplicate VIDs [VS-1757] (#9292)

fde4fb4

VS-1747 Respect the withdrawn flag in GvsExtractCohortFromSampleNames (…

66d34af

…#9300)

Merge master, fix unrelated VDS-making breakage [VS-1777] (#9302)

94bd318

VS-1772 Allow additional padding of intervals for GvsPrepareRangesCal…

308ef08

…lset. (#9307) * Add option to pad input interval list for Prepare Ranges. * Set default interval_list_padding.

Use ID file for Variants Docker building [VS-1788] (#9308)

286fc9a

Upate the variants docker.

080ca0b

Merge branch 'ah_var_store' into gg_VS-1780

95db0a7

Merge remote-tracking branch 'origin/VS-1736' into gg_VS-1780

b41883c

Minor updates for testing.

6c9ae7b

Hacky method to set is_loaded in sample_info table.

43d3ff3

Update hacky method to set is_loaded in sample_info table.

74d5678

To run integration test.

be56266

Testing setting sample_load_status for parquet files.

a327dc5

Limit test to BulkIngest shall we?

2b90504

Add branch to .dockstore.yml

c4d1450

Test another way to set sample_info.is_loaded based on parquet

30d693e

For want of a ) a workflow has failed.

e55fb95

Dang

7b9e0f2

gbggrant added 2 commits February 1, 2026 21:27

Better query

66b588c

Cleanup

d9d5e1c

gbggrant requested review from Copilot and mcovarr February 2, 2026 18:32

Copilot AI reviewed Feb 2, 2026

View reviewed changes

mcovarr reviewed Feb 2, 2026

View reviewed changes

mcovarr added 8 commits February 5, 2026 14:23

embed sample id in parquet filename

c93bf4f

update Docker

179fba5

oops

3c8fd75

update Docker

d6e56cd

reorder components for sanity

5bead97

cleanup

f3bc8f4

cleanup

8a87182

update Docker

842f84d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VS-1780 - Set sample_info.is_loaded for parquet ingest#9320

VS-1780 - Set sample_info.is_loaded for parquet ingest#9320
gbggrant wants to merge 40 commits intoVS-1736from
gg_VS-1780

gbggrant commented Feb 2, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 2, 2026

Uh oh!

Copilot AI Feb 2, 2026

Uh oh!

mcovarr Feb 2, 2026 •

edited

Loading

Uh oh!

mcovarr Feb 2, 2026 •

edited

Loading

Uh oh!

gbggrant Feb 3, 2026

Uh oh!

mcovarr Feb 2, 2026

Uh oh!

mcovarr Feb 2, 2026

Uh oh!

gbggrant Feb 3, 2026

Uh oh!

gatk-bot commented Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	# in workflow processing. So this method is used to set all of the saple_info.is_loaded flags at one time.
	# in workflow processing. So this method is used to set all of the sample_info.is_loaded flags at one time.

Conversation

gbggrant commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

mcovarr Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mcovarr Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gbggrant Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

mcovarr Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

mcovarr Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

gbggrant Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

gatk-bot commented Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gbggrant commented Feb 2, 2026 •

edited

Loading

mcovarr Feb 2, 2026 •

edited

Loading

mcovarr Feb 2, 2026 •

edited

Loading