VS-1780 - Set sample_info.is_loaded for parquet ingest#9320
VS-1780 - Set sample_info.is_loaded for parquet ingest#9320
Conversation
* VS-1570 - Adding new validation for the VAT
* For VS-1644. Increase disk for ExcludeSitesFromSitesOnlyVcf task. Making it a task parameter so potentially overrideable.
* VS-1743. Modify pgen .pvar file to use a new ID field format. Specifically, 'chr:pos:ref'
* VS-1748. Have GvsExtractCohortFromSampleNames.wdl default to preparing tables with timestamped names (to avoid collisions) * Update the code to check for the presence of already prepared tables and don't run extract if the prep tables already exist
…lset. (#9307) * Add option to pad input interval list for Prepare Ranges. * Set default interval_list_padding.
There was a problem hiding this comment.
Pull request overview
This PR implements setting the sample_info.is_loaded flag for parquet-based ingestion workflows. The change introduces a new task SetIsLoadedColumnForParquetIngest that updates the is_loaded column after parquet loading verification, ensuring proper tracking of sample loading status in the parquet ingest path.
Changes:
- Added
SetIsLoadedColumnForParquetIngesttask to updateis_loadedflag after parquet verification - Conditionally skipped the original
SetIsLoadedColumntask when using parquet ingest - Updated task dependencies to use
LoadData.doneinstead ofSetIsLoadedColumn.done
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/variantstore/wdl/GvsImportGenomes.wdl | Added new task for setting is_loaded column in parquet ingest flow and updated task dependencies |
| scripts/variantstore/wdl/GvsUtils.wdl | Reverted docker image version from 2026-01-27 to 2026-01-26 |
| scripts/variantstore/wdl/GvsJointVariantCalling.wdl | Removed comment line |
| scripts/variantstore/wdl/GvsBulkIngestGenomes.wdl | Added comment line |
| .dockstore.yml | Updated branch tracking configuration |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Note that we tried modifying CreateVariantIngestFiles to UPDATE sample_info.is_loaded on a per-sample basis. | ||
| # The major issue that was found is that BigQuery allows only 20 such concurrent DML statements. Considered using | ||
| # an exponential backoff, but at the number of samples that are being loaded this would introduce significant delays | ||
| # in workflow processing. So this method is used to set *all* of the saple_info.is_loaded flags at one time. |
There was a problem hiding this comment.
Corrected spelling of 'saple_info' to 'sample_info'.
| # in workflow processing. So this method is used to set *all* of the saple_info.is_loaded flags at one time. | |
| # in workflow processing. So this method is used to set *all* of the sample_info.is_loaded flags at one time. |
| String project_id | ||
| String dataset_name | ||
| String set_is_loaded_done | ||
| Array[String] load_done |
There was a problem hiding this comment.
The parameter name 'load_done' is ambiguous when typed as Array[String]. Consider renaming to 'load_done_statuses' or 'load_completion_markers' to clarify that it's a collection rather than a single completion signal.
| # bq query --max_rows check: ok update | ||
| bq --apilog=false --project_id=~{project_id} query --format=csv --use_legacy_sql=false ~{bq_labels} \ | ||
| 'UPDATE `~{dataset_name}.sample_info` SET is_loaded = true | ||
| WHERE sample_id IN (SELECT CAST(partition_id AS INT64) |
There was a problem hiding this comment.
I don't understand this... a sample id is being compared to a partition id?
There was a problem hiding this comment.
Ah ok... I tinkered with this query in the console and I think I see how this works. But I'm wondering if this is still going to return correct results for vet tables > 001? Wouldn't the partitions in vet_002 start at 1 again?
There was a problem hiding this comment.
I actually copied the bit about the partition from the 'normal' SetIsLoadedColumn method - I had thought it had been put in there to avoid some of the weirdly named vet and ref_ranges tables that were created during foxtrot? Very possible I misunderstood that.
| 'UPDATE `~{dataset_name}.sample_info` SET is_loaded = true | ||
| WHERE sample_id IN (SELECT CAST(partition_id AS INT64) | ||
| from `~{dataset_name}.INFORMATION_SCHEMA.PARTITIONS` | ||
| WHERE partition_id NOT LIKE "__%" AND total_logical_bytes > 0 AND REGEXP_CONTAINS(table_name, "^vet_[0-9]+$")) OR |
There was a problem hiding this comment.
not sure I understand why this big OR block is here
There was a problem hiding this comment.
shouldn't there be a ref_ranges version of the AND logic here?
There was a problem hiding this comment.
The big OR logic I added (that starting here at 622) is trying to find out that there is a sample_name (as extricated from the file_path) in parquet_load_status for a vet parquet file creation and also one for a ref_ranges parquet file creation.
|
Github actions tests reported job failures from actions build 21732753862
|
Successful Bulk Ingest run here