Skip to content

VS-1780 - Set sample_info.is_loaded for parquet ingest#9320

Draft
gbggrant wants to merge 40 commits intoVS-1736from
gg_VS-1780
Draft

VS-1780 - Set sample_info.is_loaded for parquet ingest#9320
gbggrant wants to merge 40 commits intoVS-1736from
gg_VS-1780

Conversation

@gbggrant
Copy link
Collaborator

@gbggrant gbggrant commented Feb 2, 2026

Successful Bulk Ingest run here

gbggrant and others added 30 commits September 23, 2025 10:18
* VS-1570 - Adding new validation for the VAT
* For VS-1644. Increase disk for ExcludeSitesFromSitesOnlyVcf task.
Making it a task parameter so potentially overrideable.
* VS-1743. Modify pgen .pvar file to use a new ID field format.
Specifically, 'chr:pos:ref'
* VS-1748. Have GvsExtractCohortFromSampleNames.wdl default to preparing tables with timestamped names (to avoid collisions)
* Update the code to check for the presence of already prepared tables and don't run extract if the prep tables already exist
…lset. (#9307)

* Add option to pad input interval list for Prepare Ranges.
* Set default interval_list_padding.
@gbggrant gbggrant requested review from Copilot and mcovarr February 2, 2026 18:32
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements setting the sample_info.is_loaded flag for parquet-based ingestion workflows. The change introduces a new task SetIsLoadedColumnForParquetIngest that updates the is_loaded column after parquet loading verification, ensuring proper tracking of sample loading status in the parquet ingest path.

Changes:

  • Added SetIsLoadedColumnForParquetIngest task to update is_loaded flag after parquet verification
  • Conditionally skipped the original SetIsLoadedColumn task when using parquet ingest
  • Updated task dependencies to use LoadData.done instead of SetIsLoadedColumn.done

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
scripts/variantstore/wdl/GvsImportGenomes.wdl Added new task for setting is_loaded column in parquet ingest flow and updated task dependencies
scripts/variantstore/wdl/GvsUtils.wdl Reverted docker image version from 2026-01-27 to 2026-01-26
scripts/variantstore/wdl/GvsJointVariantCalling.wdl Removed comment line
scripts/variantstore/wdl/GvsBulkIngestGenomes.wdl Added comment line
.dockstore.yml Updated branch tracking configuration

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

# Note that we tried modifying CreateVariantIngestFiles to UPDATE sample_info.is_loaded on a per-sample basis.
# The major issue that was found is that BigQuery allows only 20 such concurrent DML statements. Considered using
# an exponential backoff, but at the number of samples that are being loaded this would introduce significant delays
# in workflow processing. So this method is used to set *all* of the saple_info.is_loaded flags at one time.
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected spelling of 'saple_info' to 'sample_info'.

Suggested change
# in workflow processing. So this method is used to set *all* of the saple_info.is_loaded flags at one time.
# in workflow processing. So this method is used to set *all* of the sample_info.is_loaded flags at one time.

Copilot uses AI. Check for mistakes.
String project_id
String dataset_name
String set_is_loaded_done
Array[String] load_done
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter name 'load_done' is ambiguous when typed as Array[String]. Consider renaming to 'load_done_statuses' or 'load_completion_markers' to clarify that it's a collection rather than a single completion signal.

Copilot uses AI. Check for mistakes.
# bq query --max_rows check: ok update
bq --apilog=false --project_id=~{project_id} query --format=csv --use_legacy_sql=false ~{bq_labels} \
'UPDATE `~{dataset_name}.sample_info` SET is_loaded = true
WHERE sample_id IN (SELECT CAST(partition_id AS INT64)
Copy link
Collaborator

@mcovarr mcovarr Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this... a sample id is being compared to a partition id?

Copy link
Collaborator

@mcovarr mcovarr Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok... I tinkered with this query in the console and I think I see how this works. But I'm wondering if this is still going to return correct results for vet tables > 001? Wouldn't the partitions in vet_002 start at 1 again?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually copied the bit about the partition from the 'normal' SetIsLoadedColumn method - I had thought it had been put in there to avoid some of the weirdly named vet and ref_ranges tables that were created during foxtrot? Very possible I misunderstood that.

'UPDATE `~{dataset_name}.sample_info` SET is_loaded = true
WHERE sample_id IN (SELECT CAST(partition_id AS INT64)
from `~{dataset_name}.INFORMATION_SCHEMA.PARTITIONS`
WHERE partition_id NOT LIKE "__%" AND total_logical_bytes > 0 AND REGEXP_CONTAINS(table_name, "^vet_[0-9]+$")) OR
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I understand why this big OR block is here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't there be a ref_ranges version of the AND logic here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The big OR logic I added (that starting here at 622) is trying to find out that there is a sample_name (as extricated from the file_path) in parquet_load_status for a vet parquet file creation and also one for a ref_ranges parquet file creation.

@gatk-bot
Copy link

gatk-bot commented Feb 6, 2026

Github actions tests reported job failures from actions build 21732753862
Failures in the following jobs:

Test Type JDK Job ID Logs
conda 17.0.6+10 21732753862.3 logs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants