Skip to content

Add unrestricted_use_only and surveillance_use_only constructor params #724

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 21 commits into
base: master
Choose a base branch
from

Conversation

leehart
Copy link
Collaborator

@leehart leehart commented Feb 7, 2025

Re: issue #716

@leehart leehart changed the title Support unrestricted_use_only and surveillance_use_only Support unrestricted_use_only and surveillance_use_only constructor params Feb 14, 2025
@leehart leehart marked this pull request as ready for review February 20, 2025 15:36
@leehart
Copy link
Collaborator Author

leehart commented Feb 20, 2025

Opening this for a WIP review, to potentially avoid going too far down the wrong path.

@alimanfoo @cclarkson @ahernank @jonbrenas I'm trying to identify other functions that we need to filter results for, according to:

  • unrestricted_use_only, which I've so far applied to sample_sets(), and
  • surveillance_use_only. which I've so far applied to sample_metadata().

Those seem like the main ones, which some other functions use, but I want to make sure we plug all potential leaks.

@leehart
Copy link
Collaborator Author

leehart commented Feb 25, 2025

  • To do: add _prep_sample_query_param()

@leehart leehart marked this pull request as draft March 18, 2025 09:23
@leehart
Copy link
Collaborator Author

leehart commented Mar 21, 2025

  • To do: check that all public functions honour the unrestricted_use_only and surveillance_use_only constructor params.

@leehart
Copy link
Collaborator Author

leehart commented Apr 8, 2025

It looks like Ag3 currently has about 137 public methods... 🤔

@leehart
Copy link
Collaborator Author

leehart commented Apr 8, 2025

It also looks like about 119 of Ag3's public methods cannot be called without specifying params (cannot rely on defaults), which makes the testing of these constructor params somewhat difficult to automate. This also smells a lot like a "god object" anti-pattern https://en.wikipedia.org/wiki/God_object

We should probably consider re-organising all of those methods, despite the inconvenience, but that will probably have to wait. In the meantime, I hope to be able to figure out which functions are vulnerable to leaking unfiltered data, relating to either the surveillance-only or unrestricted-use-only flags.

@leehart
Copy link
Collaborator Author

leehart commented Apr 8, 2025

@ahernank @jonbrenas Checkmarks indicate whether the function gets its data from an upstream public function, or file, or param, or otherwise looks covered. Unchecked functions indicate some doubt and require further investigation, discussion or coding.

  • aa_allele_frequencies - gets its data from snp_allele_frequencies
  • aa_allele_frequencies_advanced - gets its data from snp_allele_frequencies_advanced
  • add_extra_metadata - checks data from sample_metadata
  • aim_calls - uses _prep_sample_query_param and sample_metadata
  • aim_metadata - gets its data from samples.species_aim.csv
  • aim_variants - gets its data from aim_defs_{analysis}/{aims}.zarr
  • average_fst - uses snp_allele_counts
  • biallelic_diplotype_pairwise_distances - uses _prep_sample_selection_cache_params which uses _prep_sample_query_param
  • biallelic_diplotypes - uses _prep_sample_selection_cache_params which uses _prep_sample_query_param
  • biallelic_snp_calls - uses snp_allele_counts and snp_calls
  • biallelic_snps_to_plink - uses biallelic_snp_calls
  • clear_extra_metadata - is simply self._extra_metadata = []
  • cnv_coverage_calls - TO INVESTIGATE: uses _cnv_coverage_calls_dataset
  • cnv_discordant_read_calls - TO INVESTIGATE: uses _prep_sample_query_param and sample_metadata but also _cnv_discordant_read_calls_dataset
  • cnv_hmm - TO INVESTIGATE: uses _prep_sample_query_param and sample_metadata but also _cnv_hmm_dataset
  • cohort_diversity_stats - uses sample_metadata and snp_allele_counts
  • cohorts - gets its data from cohorts_{cohort_set}.csv
  • cohorts_metadata - gets its data from samples.cohorts.csv
  • count_samples - uses sample_metadata
  • cross_metadata - gets its data from crosses.fam
  • diplotype_pairwise_distances - uses _prep_sample_query_param
  • diversity_stats - TO INVESTIGATE: uses _setup_cohort_queries and cohort_diversity_stats not _prep_sample_query_param
  • fst_gwss - uses _prep_sample_query_param
  • g123_calibration - uses _prep_sample_query_param
  • g123_gwss - uses _prep_sample_query_param
  • gene_cnv - TO INVESTIGATE: uses _gene_cnv not _prep_sample_query_param
  • gene_cnv_frequencies - TO INVESTIGATE: uses _gene_cnv_frequencies not _prep_sample_query_param
  • gene_cnv_frequencies_advanced - TO INVESTIGATE: uses _gene_cnv_frequencies_advanced not _prep_sample_query_param
  • general_metadata - TO CONSIDER: gets its data from samples.meta.csv (NOT FILTERED)
  • geneset - uses genome_features
  • genome_feature_children - TO INVESTIGATE: uses _genome_features
  • genome_features - TO INVESTIGATE: uses _genome_features_for_contig and _genome_features
  • genome_sequence - TO INVESTIGATE: uses _genome_sequence_for_contig
  • h12_calibration - uses _prep_sample_query_param
  • h12_gwss - uses _prep_sample_query_param
  • h1x_gwss - uses _prep_sample_query_param
  • haplotype_pairwise_distances - uses _prep_sample_query_param
  • haplotype_sites - uses _haplotype_sites_for_region
  • haplotypes - TO INVESTIGATE: uses _prep_sample_query_param and sample_metadata but also _haplotypes_for_contig
  • haplotypes_frequencies - uses sample_metadata and locate_cohorts and haplotypes and haplotype_frequencies
  • haplotypes_frequencies_advanced - uses sample_metadata and haplotypes and haplotype_frequencies
  • igv - uses _igv_config and igv_notebook (LOOKS COVERED)
  • ihs_gwss - TO INVESTIGATE: uses _prep_sample_query_param but also _ihs_gwss
  • is_accessible - uses genome_sequence and snp_sites and _site_filters_for_region
  • karyotype - TO INVESTIGATE: uses load_inversion_tags and snp_calls but not _prep_sample_query_param
  • load_inversion_tags - gets its data from self._inversion_tag_path
  • lookup_release - uses sample_sets
  • lookup_sample - uses sample_metadata
  • lookup_study - uses sample_sets
  • lookup_study_info - uses sample_sets
  • lookup_terms_of_use_info - uses sample_sets
  • njt - uses _prep_sample_selection_cache_params which uses _prep_sample_query_param
  • open_cnv_coverage_calls - gets its data from coverage_calls/{analysis}/zarr
  • open_cnv_discordant_read_calls - gets its data from cnv/{sample_set}/{calls_version}/zarr
  • open_cnv_hmm - gets its data from cnv/{sample_set}/hmm/zarr
  • open_file - simply self._fs.open(f"{self._base_path}/{path}")
  • open_genome - gets its data from {self._base_path}/{self._genome_zarr_path}
  • open_haplotype_sites - gets its data from snp_haplotypes/sites/{analysis}/zarr
  • open_haplotypes - gets its data from snp_haplotypes/{sample_set}/{analysis}/zarr
  • open_site_annotations - gets its data from {self._base_path}/{self._site_annotations_zarr_path}
  • open_site_filters - gets its data from site_filters/{self._site_filters_analysis}/{mask_prepped}/
  • open_snp_genotypes - gets its data from snp_genotypes/all/{sample_set}/
  • open_snp_sites - gets its data from snp_genotypes/all/sites/
  • pairwise_average_fst - TO INVESTIGATE: uses _setup_cohort_queries not _prep_sample_query_param
  • pca - uses _prep_sample_selection_cache_params which uses _prep_sample_query_param
  • plot_aim_heatmap - uses aim_calls and go_make_subplots
  • plot_cnv_hmm_coverage - uses plot_cnv_hmm_coverage_track and plot_genes
  • plot_cnv_hmm_coverage_track - uses cnv_hmm
  • plot_cnv_hmm_heatmap - uses plot_cnv_hmm_heatmap_track and plot_genes
  • plot_cnv_hmm_heatmap_track - uses cnv_hmm
  • plot_diplotype_clustering - uses sample_metadata and diplotype_pairwise_distances
  • plot_diplotype_clustering_advanced - TO INVESTIGATE: uses plot_diplotype_clustering but also _dipclust_het_bar_trace and _dipclust_cnv_bar_trace and _insert_dipclust_snp_trace and _dipclust_concat_subplots
  • plot_diversity_stats - gets its data from df_stats param
  • plot_frequencies_heatmap - gets its data from df param
  • plot_frequencies_interactive_map - gets its data from ds param
  • plot_frequencies_map_markers - gets its data from ds param
  • plot_frequencies_time_series - gets its data from ds param
  • plot_fst_gwss - uses plot_fst_gwss_track and plot_genes
  • plot_fst_gwss_track - uses fst_gwss
  • plot_g123_calibration - uses g123_calibration
  • plot_g123_gwss - uses plot_g123_gwss_track and plot_genes
  • plot_g123_gwss_track - uses g123_gwss
  • plot_genes - TO INVESTIGATE: uses genome_sequence and _plot_genes_setup_data
  • plot_h12_calibration - uses h12_calibration
  • plot_h12_gwss - uses plot_h12_gwss_track and plot_genes
  • plot_h12_gwss_multi_overlay - uses plot_h12_gwss_multi_overlay_track and plot_genes
  • plot_h12_gwss_multi_overlay_track - TO INVESTIGATE: uses h12_gwss and _setup_cohort_queries not _prep_sample_query_param
  • plot_h12_gwss_multi_panel - TO INVESTIGATE: uses plot_h12_gwss_track and plot_genes and _setup_cohort_queries not _prep_sample_query_param
  • plot_h12_gwss_track - uses h12_gwss
  • plot_h1x_gwss - uses plot_h1x_gwss_track and plot_genes
  • plot_h1x_gwss_track - uses h1x_gwss
  • plot_haplotype_clustering - TO INVESTIGATE: uses sample_metadata and haplotype_pairwise_distances but also _setup_sample_symbol and _setup_sample_colors_plotly and _setup_sample_hover_data_plotly and plot_dendrogram
  • plot_haplotype_network - uses haplotypes and sample_metadata and median_joining_network and mjn_graph and plotly_discrete_legend (I expect _setup_sample_colors_plotly is OK)
  • plot_heterozygosity - uses plot_heterozygosity_track and plot_genes
  • plot_heterozygosity_track - TO INVESTIGATE: uses _sample_count_het and _plot_heterozygosity_track
  • plot_ihs_gwss - uses plot_ihs_gwss_track and plot_genes
  • plot_ihs_gwss_track - uses ihs_gwss
  • plot_njt - uses njt and sample_metadata (I expect _setup_sample_symbol and _setup_sample_colors_plotly and _setup_sample_hover_data_plotly are OK)
  • plot_pairwise_average_fst - gets its data from fst_df param
  • plot_pca_coords - gets its data from data param (I expect _setup_sample_symbol and _setup_sample_colors_plotly and _setup_sample_hover_data_plotly are OK)
  • plot_pca_coords_3d - gets its data from data param (I expect _setup_sample_symbol and _setup_sample_colors_plotly and _setup_sample_hover_data_plotly are OK)
  • plot_pca_variance - gets its data from evr param (simple bar plot)
  • plot_roh - TO INVESTIGATE: uses _sample_count_het and _plot_heterozygosity_track and _roh_hmm_predict and plot_roh_track and plot_genes
  • plot_roh_track - gets its data from df_roh param
  • plot_sample_location_geo - uses sample_metadata (NOTE: could potentially use _prep_sample_query_param too)
  • plot_sample_location_mapbox - uses sample_metadata (NOTE: could potentially use _prep_sample_query_param too)
  • plot_samples_bar - uses sample_metadata (NOTE: could potentially use _prep_sample_query_param too)
  • plot_samples_interactive_map - uses sample_metadata (NOTE: could potentially use _prep_sample_query_param too)
  • plot_snps - uses plot_snps_track and plot_genes
  • plot_snps_track - uses snp_allele_counts and snp_variants and genome_sequence
  • plot_transcript - uses genome_features and genome_feature_children
  • plot_xpehh_gwss - uses plot_xpehh_gwss_track and plot_genes
  • plot_xpehh_gwss_track - uses xpehh_gwss
  • read_files - gets its data from its paths param
  • results_cache_get - gets its data from results.zarr.zip or results.npz
  • results_cache_set - gets its data from its params and results params and writes params.json and results.zarr.zip
  • roh_hmm - TO INVESTIGATE: uses _sample_count_het and _roh_hmm_predict
  • sample_metadata - uses _prep_sample_query_param and general_metadata and sequence_qc_metadata and surveillance_flags and aim_metadata and cohorts_metadata
  • sample_sets - uses _read_sample_sets and _unrestricted_use_only
  • sequence_qc_metadata - gets its data from sequence_qc_stats.csv
  • site_annotations - uses _site_annotations_raw and snp_sites
  • site_filters - uses _site_filters_for_region
  • snp_allele_counts - uses _prep_sample_selection_cache_params which uses _prep_sample_query_param
  • snp_allele_frequencies - uses sample_metadata and snp_calls
  • snp_allele_frequencies_advanced - uses sample_metadata and snp_calls
  • snp_calls - uses _prep_sample_query_param and _snp_calls
  • snp_dataset - uses snp_calls
  • snp_effects - uses snp_variants and _snp_df_melt and _snp_effect_annotator
  • snp_genotype_allele_counts - uses snp_calls
  • snp_genotypes - uses _prep_sample_query_param and _snp_genotypes_for_contig
  • snp_sites - uses _snp_sites_for_region and site_filters
  • snp_variants - uses _snp_variants_for_contig
  • surveillance_flags - gets its data from surveillance.flags.csv
  • view_alignments - uses _igv_view_alignments_tracks and igv
  • wgs_data_catalog - TO CONSIDER: gets its data from wgs_snp_data.csv (NOT FILTERED)
  • wgs_run_accessions - TO CONSIDER: gets its data from wgs_accession_data.csv (NOT FILTERED)
  • xpehh_gwss - uses _prep_sample_query_param and _xpehh_gwss

@leehart leehart changed the title Support unrestricted_use_only and surveillance_use_only constructor params Add unrestricted_use_only and surveillance_use_only constructor params Apr 24, 2025
@leehart
Copy link
Collaborator Author

leehart commented Apr 24, 2025

Now investigating unexpected test failures after merging with master branch.

Local pytest:

446 passed, 38 warnings, 224 errors

CI pytest:

474 passed, 36 warnings, 224 errors

@jonbrenas
Copy link
Collaborator

Thanks @leehart. For IGV to work, it needed to have access to a 'public' version of the reference genomes, e.g., at gs://vo_anoph_temp_us_central1/vo_afun_release/reference/genome/idAnoFuneDA-416_04. This, in turn, needed to be passed to the AnophelesBase constructor. The tests use simulated data so the link needs to be specified (but None should work).

@leehart
Copy link
Collaborator Author

leehart commented Apr 24, 2025

Following discussion, the plan is now to reduce the scope of honouring these constructor params to only currently "documented" public functions (as per docs/source/Ag3.rst, Af1 and Amin1) and to change other undocumented functions to private functions, e.g. open_cnv_coverage_calls to _open_cnv_coverage_calls.

@leehart
Copy link
Collaborator Author

leehart commented Apr 24, 2025

List of "documented" public functions/properties to check:

Note: I have crossed off functions/properties that should be OK once their sub-functions are OK, to help us focus on the root issues, but this doesn't imply that they are themselves "safe", only that any potential issues appear to be confined to their sub-functions.

Basic data access

  • releases - Do we want to filter these accordingly? E.g. only return releases that are unrestricted or have surveillance data. Note: this PR currently uses a new relevant_releases property to get the filtered set of releases, for tests, etc.
  • sample_sets - Honours unrestricted_use_only. Do we want to filter these according to surveillance_use_only? E.g. only return sample sets that have at least one surveillance sample when surveillance_use_only=True.
  • lookup_release - Only uses sample_sets, so when sample_sets is resolved, this should be too.
  • lookup_study - Only uses sample_sets, so when sample_sets is resolved, this should be too.

Reference genome data access

  • contigs - This basically returns self.config["CONTIGS"] and doesn't look connected.
  • genome_sequence - This uses parse_single_region and _genome_sequence_for_contig and doesn't look connected.
  • genome_features - This uses _prep_gff_attributes , parse_multi_region, _genome_features_for_contig, _genome_features. _genome_features reads from f"{self._base_path}/{self._geneset_gff3_path}". _genome_features_for_contig uses genome_sequence, which looks unconnected. This doesn't look connected.
  • plot_transcript - This uses genome_features, which looks unconnected, and genome_feature_children. genome_feature_children uses _genome_features, which reads from f"{self._base_path}/{self._geneset_gff3_path}". This doesn't look connected.
  • plot_genes - This uses genome_sequence, which doesn't look connected, and _plot_genes_setup_data, which uses genome_features, which also doesn't look connected.

Sample metadata access

  • sample_metadata - This uses _prep_sample_query_param, which should honour surveillance_use_only. This also uses general_metadata, sequence_qc_metadata, surveillance_flags, aim_metadata, cohorts_metadata. There is an additional mechanism in place to restrict samples according to surveillance_use_only. Since this function uses either the samples sets returned by sample_sets() or the given param sample_sets, this should be resolved when sample_sets() is resolved too. (See above.)
  • add_extra_metadata - Uses sample_metadata to match against the given data param. If the user provided sample ids that were not surveillance_use_only but had also set surveillance_use_only=True, I expect the expectation would be to only use surveillance_use_only samples from the provided data and to not include any other samples. It appears that sample_metadata() uses a left join in its merge with _extra_metadata, which occurs (crucially) after is_surveillance samples have already been filtered. So it shouldn't be possible to accidentally (or purposely) add in incompatible samples via add_extra_metadata(), but that is probably worth checking. Similarly, we also want to ensure that samples that contradict the unrestricted_use_only setting can't be added this way.
  • clear_extra_metadata - This is just self._extra_metadata = [], so looks unconnected.
  • cross_metadata - This currently reads from f"{self._base_path}/v3/metadata/crosses/crosses.fam" with no filtering, i.e. without honouring unrestricted_use_only nor surveillance_use_only.
  • count_samples - This uses sample_metadata and looks like it should be resolved whenever that is resolved.
  • lookup_sample - This uses sample_metadata and looks like it should be resolved whenever that is resolved.
  • plot_samples_bar - This uses sample_metadata and looks like it should be resolved whenever that is resolved.
  • plot_samples_interactive_map - This uses sample_metadata and looks like it should be resolved whenever that is resolved.
  • plot_sample_location_mapbox - This uses sample_metadata and looks like it should be resolved whenever that is resolved.
  • plot_sample_location_geo - This uses sample_metadata and looks like it should be resolved whenever that is resolved.
  • wgs_data_catalog - This reads f"{self._base_path}/{release_path}/metadata/general/{sample_set}/wgs_snp_data.csv" with no filtering, i.e. without honouring unrestricted_use_only nor surveillance_use_only.
  • cohorts - This reads f"{major_version_path[:2]}_cohorts/cohorts_{cohorts_analysis}/cohorts_{cohort_set}.csv", where none of those data specify particular samples nor sample_sets. There is a cohort_size number though, which would be irrespective of unrestricted_use_only or surveillance_use_only. It looks like these data are connected to the samples relating to each cohort_id via the samples.cohorts.tsv file for each sample set in GCS. So it looks like this function is connected but "insulated". Perhaps the cohort_size could be calculated on the fly rather than hard-coded, if we need to update that number depending on unrestricted_use_only or surveillance_use_only.

SNP data access

  • site_mask_ids - This basically just returns self.config.get("SITE_MASK_IDS", so looks unconnected.

  • snp_calls - This uses _prep_sample_query_param, which should honour surveillance_use_only. This function also uses _snp_calls, which uses sample_metadata and looks like it should be resolved when that's resolved too.

  • snp_allele_counts - This uses _prep_sample_selection_cache_params, which uses _prep_sample_query_param, which should honour surveillance_use_only. This function also uses _snp_allele_counts, which uses snp_calls and looks like it should be resolved when that's resolved too.

  • plot_snps - This uses plot_snps_track and plot_genes. plot_snps_track uses snp_allele_counts, snp_variants and genome_sequence. snp_variants uses _snp_variants_for_contig, which uses genome_sequence, open_snp_sites and open_site_filters. open_snp_sites reads f"{self._base_path}/{self._major_version_path}/snp_genotypes/all/sites/" and open_site_filters reads f"{self._base_path}/{self._major_version_path}/site_filters/{self._site_filters_analysis}/{mask_prepped}/" , which both look unconnected. It looks like this function should be resolved when its other subfunctions are resolved, i.e. plot_genes, snp_allele_counts and genome_sequence.

  • site_annotations - This uses (1) _site_annotations_raw and (2)snp_sites.

    • (1)_site_annotations_raw uses (1b) open_site_annotations, which reads f"{self._base_path}/{self._site_annotations_zarr_path}", which looks unconnected.
    • (2) snp_sites uses (2a) _snp_sites_for_region and (2b) site_filters.
      • (2a) _snp_sites_for_region uses (2a.i.) _snp_sites_for_contig.
        • (2a.i.) _snp_sites_for_contig uses (2a.i.1.) genome_sequence
      • (2b)site_filters uses (2b.i.)_site_filters_for_region, which uses (2b.i.1.) _site_filters_for_contig and (2b.i.2.) _snp_sites_for_contig.
        • (2b.i.1.) _site_filters_for_contig uses (2b.i.1.a.) open_site_filters.
          • (2b.i.1.a.) open_snp_sites reads `f"{self._base_path}/{self._major_version_path}/snp_genotypes/all/sites/", which looks unconnected.
        • (2b.i.2.) _snp_sites_for_contig uses (2b.i.2.i.) genome_sequence.
    • It looks like this function should be resolved when genome_sequence is resolved.
  • is_accessible - This uses (1) genome_sequence, (2) snp_sites and (3) _site_filters_for_region.

    • (2) snp_sites uses (2a) _snp_sites_for_region and (2b) site_filters.
      • (2a) _snp_sites_for_region uses (2a.i) _snp_sites_for_contig
        • (2a.i) _snp_sites_for_contig uses genome_sequence.
      • (2b) site_filters uses (2b.i) _site_filters_for_region.
        • (2b.i) _site_filters_for_region uses (2b.i.1.) _site_filters_for_contig and (2b.i.2) _snp_sites_for_contig.
          • (2b.i.1.) _site_filters_for_contig uses (2b.i.1.a.) open_site_filters.
            • (2b.i.1.a.) open_site_filters reads f"{self._base_path}/{self._major_version_path}/site_filters/{self._site_filters_analysis}/{mask_prepped}/", which looks unconnected.
          • (2b.i.2.) _snp_sites_for_contig uses (2b.i.2.a.)genome_sequence
    • (3) _site_filters_for_region uses (3a) _site_filters_for_contig and (3b) _snp_sites_for_contig.
      • (3a) _site_filters_for_contig uses (3a.i.) open_site_filters.
        • (3a.i.) open_site_filters reads f"{self._base_path}/{self._major_version_path}/site_filters/{self._site_filters_analysis}/{mask_prepped}/", which looks unconnected.
      • (3b) _snp_sites_for_contig uses (3b.i.) genome_sequence.
    • It looks like this function should be resolved when genome_sequence is resolved.
  • biallelic_snp_calls - This uses snp_allele_counts, snp_calls. It looks like this function should be resolved when those are resolved.

  • biallelic_diplotypes - This uses _prep_sample_selection_cache_params, which uses _prep_sample_query_param, which should honour surveillance_use_only. This function also uses _biallelic_diplotypes, which uses biallelic_snp_calls. It looks like this function should be resolved when biallelic_snp_calls is resolved.

  • biallelic_snps_to_plink - This uses biallelic_snp_calls, and looks like it should be resolved when that's resolved.

Haplotype data access

  • phasing_analysis_ids - This property basically just returns PHASING_ANALYSIS_IDS from the config as a tuple.
  • haplotypes - This uses _prep_sample_query_param, which should honour surveillance_use_only. This function also uses _haplotypes_for_contig and sample_metadata. _haplotypes_for_contig uses genome_sequence. This function looks like it should be resolved when sample_metadata and genome_sequence are resolved.
  • haplotype_sites - This uses _haplotype_sites_for_region, which uses _haplotype_sites_for_contig, which uses genome_sequence. This function looks like it should be resolved when genome_sequence is resolved.

AIM data access

  • aim_ids - This basically just returns self._aim_ids.
  • aim_variants - This basically just reads f"{self._base_path}/reference/aim_defs_{analysis}/{aims}.zarr", which doesn't look connected to specific samples.
  • aim_calls - This uses _prep_sample_query_param, which should honour surveillance_use_only. This function also uses _aim_calls_dataset and sample_metadata. _aim_calls_dataset reads from f"{self._base_path}/{release_path}/aim_calls_{analysis}/{sample_set}/{aims}.zarr". Although the data from _aim_calls_dataset contains data for all samples in the sample set, it looks like the code selects only the samples and sample metadata according to the sample query, via ds = ds.isel(samples=loc_samples). So this should be resolved when sample_metadata is resolved, but it is probably worth checking that the subselection is working as expected, i.e. that only surveillance_use_only samples are returned when required. In any case, there is no filtering for unrestricted_use_only, so it would currently be possible to get data (AIM calls) for terms-restricted sample sets regardless of the unrestricted_use_only setting. It might make sense to add filtering for unrestricted_use_only to the _prep_sample_sets_param function. We would have to handle cases where the user specifies sample sets that contradict the unrestricted_use_only setting, e.g. as errors or warnings.
  • plot_aim_heatmap - This uses aim_calls and looks like it should be resolved when that is resolved.

CNV data access

  • coverage_calls_analysis_ids - This basically just returns COVERAGE_CALLS_ANALYSIS_IDS from the config as a tuple.
  • cnv_hmm - This uses _prep_sample_query_param, which should honour surveillance_use_only. This function also uses _cnv_hmm_dataset and sample_metadata. _cnv_hmm_dataset uses open_cnv_hmm, which reads from f"{self._base_path}/{release_path}/cnv/{sample_set}/hmm/zarr".

Similar to the issue with aim_calls, the data from _cnv_hmm_dataset contains data for all samples in the sample set, but it looks like the code selects only the samples and sample metadata according to the sample query, via ds = ds.isel(samples=loc_query_samples). So this should be resolved when sample_metadata is resolved, but it is probably worth checking that the subselection is working as expected, i.e. that only surveillance_use_only samples are returned when required.

Again, as with aim_calls, there is no filtering for unrestricted_use_only, so it would currently be possible to get data (CNV HMM data) for terms-restricted sample sets regardless of the unrestricted_use_only setting.

Again, this might be resolved by adding filtering for unrestricted_use_only to the _prep_sample_sets_param function, and we would have to handle cases where the user specifies sample sets that contradict the unrestricted_use_only setting, e.g. as errors or warnings.

  • cnv_coverage_calls - This uses _cnv_coverage_calls_dataset, which uses open_cnv_coverage_calls, which reads from f"{self._base_path}/{release_path}/cnv/{sample_set}/coverage_calls/{analysis}/zarr". There is no filtering for either surveillance_use_only or unrestricted_use_only, so it would currently be possible to get data (CNV coverage calls) for samples regardless of those settings.

Since the function takes a sample_set param (singular) rather than sample_sets (list or singular str), there isn't currently a _prep_sample_sets_param being used either. So we will need a plan for handling this, e.g. cases where the user specifies a sample set that contradict the unrestricted_use_only setting, e.g. as errors or warnings.

  • cnv_discordant_read_calls - This uses _prep_sample_query_param, which should honour surveillance_use_only. This function also uses _cnv_discordant_read_calls_dataset and sample_metadata. _cnv_discordant_read_calls_dataset uses open_cnv_discordant_read_calls, which reads from f"{self._base_path}/{release_path}/cnv/{sample_set}/{calls_version}/zarr".

This function has the same kind of issues as aim_calls and cnv_hmm.

  1. We would want to check that the subselection via ds = ds.isel(samples=loc_query_samples) is working as expected when surveillance_use_only.
  2. There is no filtering for unrestricted_use_only, so it would currently be possible to get data (CNV DRC data) for terms-restricted sample sets regardless of the unrestricted_use_only setting.
  3. We might want to add filtering for unrestricted_use_only to the _prep_sample_sets_param function, and handle cases where the user specifies sample sets that contradict the unrestricted_use_only setting.
  • plot_cnv_hmm_coverage - This uses plot_cnv_hmm_coverage_track and plot_genes. plot_cnv_hmm_coverage_track uses cnv_hmm. It looks like this function should be resolved when plot_genes and cnv_hmm are resolved.
  • plot_cnv_hmm_heatmap - This uses plot_cnv_hmm_heatmap_track and plot_genes. plot_cnv_hmm_heatmap_track uses cnv_hmm. It looks like this function should be resolved when plot_genes and cnv_hmm are resolved.
  • gene_cnv - This uses _gene_cnv, which uses genome_features and cnv_hmm. It looks like this function should be resolved when genome_features and cnv_hmm are resolved.

Integrative genomics viewer (IGV)

  • igv - This uses _igv_config, which basically just creates the IGV config and doesn't look directly connected to sample data.
  • view_alignments - This uses _igv_view_alignments_tracks and igv. _igv_view_alignments_tracks uses sample_metadata, wgs_data_catalog and _igv_site_filters_tracks.
  1. sample_metadata is used to get the sample_set for the given sample id string, which is then used with wgs_data_catalog to get the alignments_bam and snp_genotypes_vcf URLs.
  2. _igv_site_filters_tracks reads from f"{self._public_url}{self._major_version_path}/site_filters/{self._site_filters_analysis}/vcf/{site_mask}/{contig}_sitefilters.vcf.gz", which I don't believe is connected directly to sample data.

The end user (or other code) could try to pass a sample id to this function that was incompatible with the surveillance_use_only or unrestricted_use_only settings. The sample_metadata function (in this PR) uses _prep_sample_query_param and an additional mechanism to filter its data according to is_surveillance. However, there is currently no mechanism to honour the unrestricted_use_only setting. We might want to use the _prep_sample_sets_param function, which is also used by sample_metadata, to restrict sample_sets_prepped to those that honour the unrestricted_use_only setting.

In this case, the incompatible sample id would be provided to _igv_view_alignments_tracks but I expect we would want the self.sample_metadata().set_index("sample_id").loc[sample] part to fail, rather than to succeed regardless of the constructor params, and that scenario ought to be handled gracefully.

SNP and CNV frequency analysis

  • snp_allele_frequencies - This uses sample_metadata, snp_calls, _snp_df_melt, _snp_effect_annotator and _transcript_to_parent_name.
  1. _transcript_to_parent_name uses genome_features and doesn't appear to be directly connected to sample data.
  2. _snp_effect_annotator uses open_genome and genome_features, and doesn't appear to be directly connected to sample data.
  3. _snp_df_melt basically just processes the data given to it via ds_snp and doesn't appear to be directly connected to sample data.

It looks like this function should be resolved when sample_metadata and snp_calls are resolved.

  • snp_allele_frequencies_advanced - Similar to above, this uses sample_metadata, snp_calls, _snp_effect_annotator, _transcript_to_parent_name. It looks like this function should be resolved when sample_metadata and snp_calls are resolved.
  • aa_allele_frequencies - This uses snp_allele_frequencies and _transcript_to_parent_name. _transcript_to_parent_name uses genome_features and doesn't appear to be directly connected to sample data. It looks like this function should be resolved when snp_allele_frequencies is resolved.
  • aa_allele_frequencies_advanced - This uses snp_allele_frequencies_advanced and _transcript_to_parent_name. _transcript_to_parent_name uses genome_features and doesn't appear to be directly connected to sample data. It looks like this function should be resolved when snp_allele_frequencies_advanced is resolved.
  • gene_cnv_frequencies - This uses _gene_cnv_frequencies, which uses gene_cnv and sample_metadata. It looks like this function should be resolved when sample_metadata and gene_cnv are resolved.
  • gene_cnv_frequencies_advanced - This uses _gene_cnv_frequencies_advanced, which uses gene_cnv and sample_metadata. It looks like this function should be resolved when sample_metadata and gene_cnv are resolved.
  • haplotypes_frequencies - This uses sample_metadata and haplotypes. It looks like this function should be resolved when sample_metadata and haplotypes are resolved.
  • haplotypes_frequencies_advanced - This uses sample_metadata and haplotypes. It looks like this function should be resolved when sample_metadata and haplotypes are resolved.
  • plot_frequencies_heatmap - This uses the DataFrame passed to it via its df param, and doesn't look directly connected to sample data.
  • plot_frequencies_time_series - This uses the Dataset passed to it via its ds param, and doesn't look directly connected to sample data.
  • plot_frequencies_interactive_map - This uses the Dataset passed to it via itsds param. This function also uses plot_frequencies_map_markers, which also uses the Dataset passed to it via its ds param. This function doesn't look directly connected to sample data.

Principal components analysis (PCA)

  • pca - This uses _prep_sample_selection_cache_params, which uses _prep_sample_query_param, which should honour surveillance_use_only. This function also uses _pca and sample_metadata. _pca uses biallelic_diplotypes, which also uses _prep_sample_selection_cache_params as well as _biallelic_diplotypes. _biallelic_diplotypes uses biallelic_snp_calls, which uses snp_allele_counts and snp_calls.

This function should be partly resolved when snp_allele_counts,snp_calls and sample_metadata are resolved. However, as with other functions, there is currently no protection against passing in sample set identifiers that contradict the unrestricted_use_only setting. The use of _prep_sample_query_param only enforces the surveillance_use_only setting.

Since the _prep_sample_selection_cache_params function uses both _prep_sample_query_param and _prep_sample_sets_param, it looks like amending the _prep_sample_sets_param function to honour the unrestricted_use_only setting might be the way to go.

  • plot_pca_variance - This uses the explained variance data passed to it via its evr param to basically just produce a bar plot. This function doesn't look directly connected to sample data.
  • plot_pca_coords - This uses the data passed to it via its data param. This function also uses _setup_sample_symbol, _setup_sample_colors_plotly and _setup_sample_hover_data_plotly. This function and the functions it uses don't look directly connected to sample data.
  • plot_pca_coords_3d - This uses the data passed to it via its data param. This function also uses _setup_sample_symbol, _setup_sample_colors_plotly and _setup_sample_hover_data_plotly. This function and the functions it uses don't look directly connected to sample data.

Genetic distance and neighbour-joining trees (NJT)

  • plot_njt - This uses njt and sample_metadata. This function also uses _setup_sample_symbol, _setup_sample_colors_plotly and _setup_sample_hover_data_plotly, which don't look directly connected to sample data. It looks like this function should be resolved when sample_metadata and njt are resolved.

  • njt - This uses _prep_sample_selection_cache_params, which uses _prep_sample_query_param, which should honour surveillance_use_only. This function also uses _njt, which uses biallelic_diplotype_pairwise_distances.

This function should be partly resolved when biallelic_diplotype_pairwise_distances is resolved. However, as with other functions, there is currently no protection against passing in sample set identifiers that contradict the unrestricted_use_only setting. The use of _prep_sample_query_param only enforces the surveillance_use_only setting.

  • biallelic_diplotype_pairwise_distances - This uses _prep_sample_selection_cache_params, which uses _prep_sample_query_param, which should honour surveillance_use_only. This function also uses _biallelic_diplotype_pairwise_distances, which uses biallelic_diplotypes.

This function should be partly resolved when biallelic_diplotypes is resolved. However, there is currently no protection against passing in sample set identifiers that contradict the unrestricted_use_only setting.

I propose that we look at changing the _prep_sample_sets_param function to honour the unrestricted_use_only setting, which will then be used by _prep_sample_selection_cache_params, and it can potentially be used alongside _prep_sample_query_param elsewhere, wherever thesample_sets param appears.

  • honour unrestricted_use_only in _prep_sample_sets_param

By the way, I suspect that we will probably need a smarter mechanism for _prep_sample_query_param, which currently just blindly tags on f"{sample_query} and is_surveillance == True", which could cause an absurd but otherwise benign accumulation of duplicated query criteria.

  • avoid duplications caused by _prep_sample_query_param

Heterozygosity analysis

  • plot_heterozygosity - This uses plot_heterozygosity_track and plot_genes. plot_heterozygosity_track uses (1) _sample_count_het and (2) _plot_heterozygosity_track.

    • (1) _sample_count_het uses (1.a.)lookup_sample and (1.b.)snp_calls.
      • (1.a.) lookup_sample uses sample_metadata.
    • (2) _plot_heterozygosity_track uses genome_sequence.

It looks like this function should be resolved when plot_genes, snp_callsandgenome_sequence` are resolved.

This function can take sample (actually samples) and sample_set (singular), so needs to respond appropriately if the settings given to the constructor contradict them.

  • roh_hmm - This uses (1) _sample_count_het and (2) _roh_hmm_predict.

    • (1) _sample_count_het uses (1.a.)lookup_sample and (1.b.)snp_calls.
      • (1.a.) lookup_sample uses sample_metadata.
    • (2) _roh_hmm_predict is passed a sample_id but otherwise doesn't look directly connected to sample data. Since this is already a private function, I don't think we need to throw an error if the given sample_id isn't compatible with the constructor params.

It looks like this function should be resolved when snp_calls and sample_metadata are resolved.

  • plot_roh - This uses (1) _sample_count_het, (2) _plot_heterozygosity_track, (3) _roh_hmm_predict, (4) plot_roh_track and (5) plot_genes.

    • (1) _sample_count_het uses (1.a.)lookup_sample and (1.b.)snp_calls.
      • (1.a.) lookup_sample uses sample_metadata.
    • (2) _plot_heterozygosity_track uses genome_sequence.
    • (3) _roh_hmm_predict is passed a sample_id but is private, as mentioned above.
    • (4) plot_roh_track uses genome_sequence.

It looks like this function should be resolved when plot_genes, snp_calls, sample_metadata and genome_sequence are resolved.

Diversity analysis

  • cohort_diversity_stats - This uses sample_metadata, snp_allele_counts and _block_jackknife_cohort_diversity_stats. _block_jackknife_cohort_diversity_stats doesn't look directly connected to sample data.

It looks like this function should be resolved when sample_metadata and snp_allele_counts are resolved.

  • diversity_stats - This uses _setup_cohort_queries and cohort_diversity_stats. _setup_cohort_queries uses sample_metadata, but also uses sample_query in other ways that look like they need closer inspection.

This function should be partly resolved when cohort_diversity_stats is resolved. However, _setup_cohort_queries does not use _prep_sample_query_param directly and needs to be checked more thoroughly for potential leaks. Plus, there is currently no protection against passing in sample set identifiers that contradict the unrestricted_use_only setting.

  • Check _setup_cohort_queries more thoroughly.

  • plot_diversity_stats - This uses _setup_sample_colors_plotly, which does not look directly connected to sample data. This function also uses the DataFrame passed to it via its df_stats param, and doesn't look directly connected to sample data.

Genome-wide selection scans

  • h12_calibration - This uses _prep_sample_query_param, which should honour surveillance_use_only. However, we might also want to modify _prep_sample_sets_param to honour unrestricted_use_only too. This function also uses _h12_calibration, which uses haplotypes. It looks like this function should be resolved when haplotypes and _prep_sample_sets_param are resolved.

  • plot_h12_calibration - This uses h12_calibration, and looks like it should be resolved when that is resolved.

  • h12_gwss - This uses _prep_sample_sets_param and _prep_sample_query_param. This function also uses _h12_gwss, which uses haplotypes. It looks like this function should be resolved when haplotypes and _prep_sample_sets_param are resolved.

  • plot_h12_gwss - This uses plot_h12_gwss_track and plot_genes. plot_h12_gwss_track uses h12_gwss. It looks like this function should be resolved when plot_genes and h12_gwss are resolved.

  • plot_h12_gwss_multi_panel - This uses (1) _setup_cohort_queries, (2) plot_h12_gwss_track and (3) plot_genes.

    • (1) _setup_cohort_queries uses sample_metadata but also requires closer inspection.
    • (2) plot_h12_gwss_track uses (2.a.) h12_gwss and (2.b.) _bokeh_style_genome_xaxis.
      • (2.b.) _bokeh_style_genome_xaxis doesn't look like it's directly connected to sample data.

It looks like this function should be resolved when _setup_cohort_queries, plot_genes, sample_metadata and h12_gwss are resolved.

  • plot_h12_gwss_multi_overlay - This uses plot_h12_gwss_multi_overlay_track and plot_genes. plot_h12_gwss_multi_overlay_track uses _setup_cohort_queries, h12_gwss and _bokeh_style_genome_xaxis. _bokeh_style_genome_xaxis doesn't look like it's directly connected to sample data.

This function should be partly resolved when plot_genes is resolved. However, as mentioned previously, _setup_cohort_queries does not use _prep_sample_query_param directly and needs to be checked more thoroughly for potential leaks. Plus, there is currently no protection against passing in sample set identifiers (via its sample_sets param) that contradict the unrestricted_use_only setting.

  • h1x_gwss - This uses _prep_sample_query_param and _prep_sample_sets_param. This function also uses _h1x_gwss, which uses haplotypes. It looks like this function should be resolved when haplotypes and _prep_sample_sets_param are resolved.

  • plot_h1x_gwss - This uses plot_h1x_gwss_track and plot_genes. plot_h1x_gwss_track uses h1x_gwss and _bokeh_style_genome_xaxis. _bokeh_style_genome_xaxis doesn't look like it's directly connected to sample data. It looks like this function should be resolved when plot_genes and h1x_gwss are resolved.

  • g123_calibration - This uses _prep_sample_sets_param and _prep_sample_query_param. _prep_optional_site_mask_param doesn't look connected to sample data. This function also uses _g123_calibration, which uses _load_data_for_g123, which uses snp_calls and haplotype_sites. It looks like this function should be resolved when _prep_sample_sets_param, snp_calls and haplotype_sites are resolved.

  • plot_g123_calibration - This uses g123_calibration and it looks like this should be resolved when that is resolved.

  • g123_gwss - This uses _prep_sample_sets_param and _prep_sample_query_param. This function also uses _g123_gwss, which uses _load_data_for_g123, which uses snp_calls and haplotype_sites. It looks like this function should be resolved when _prep_sample_sets_param, snp_calls and haplotype_sites are resolved.

  • plot_g123_gwss - This uses plot_g123_gwss_track and plot_genes. plot_g123_gwss_track uses g123_gwss and _bokeh_style_genome_xaxis. It looks like this function should be resolved when plot_genes and g123_gwss are resolved.

  • ihs_gwss - This uses _prep_sample_sets_param and _prep_sample_query_param. This function also uses _prep_phasing_analysis_param, which doesn't look directly connected to sample data. This function also uses _ihs_gwss, which uses haplotypes. It looks like this functions should be resolved when _prep_sample_sets_param and haplotypes are resolved.

  • plot_ihs_gwss - This uses plot_ihs_gwss_track and plot_genes. plot_ihs_gwss_track uses ihs_gwss and _bokeh_style_genome_xaxis. It looks like this function should be resolved when plot_genes and ihs_gwss are resolved.

  • xpehh_gwss - This uses _prep_phasing_analysis_param, _prep_sample_sets_param, _prep_sample_query_param and _xpehh_gwss. _xpehh_gwss uses haplotypes. It looks like this function should be resolved when _prep_sample_sets_param and haplotypes are resolved.

  • plot_xpehh_gwss - This uses plot_xpehh_gwss_track and plot_genes. plot_xpehh_gwss_track uses xpehh_gwss and _bokeh_style_genome_xaxis. It looks like this function should be resolved when plot_genes and xpehh_gwss are resolved.

Haplotype clustering and network analysis

  • plot_haplotype_clustering - This uses sample_metadata, haplotype_pairwise_distances, _setup_sample_symbol, _setup_sample_colors_plotly, _setup_sample_hover_data_plotly and plot_dendrogram. plot_dendrogram, _setup_sample_symbol and _setup_sample_colors_plotly don't look directly connected to sample data. _setup_sample_hover_data_plotly doesn't look directly connected to sample data, although it does have the string sample_id hard-coded. It looks like this function should be resolved when sample_metadata and haplotype_pairwise_distances are resolved.
  • plot_haplotype_network - This uses haplotypes, sample_metadata, median_joining_network, _setup_sample_colors_plotly, mjn_graph and plotly_discrete_legend. median_joining_network, _setup_sample_colors_plotly, mjn_graph and plotly_discrete_legend don't look directly connected to sample data. It looks like this function should be resolved when haplotypes and sample_metadata are resolved.
  • haplotype_pairwise_distances - This uses _prep_sample_sets_param, _prep_sample_query_param and _prep_region_cache_param. This function also uses _haplotype_pairwise_distances, which uses haplotypes. It looks like this function should be resolved when _prep_sample_sets_param and haplotypes are resolved.

Diplotype clustering

  • plot_diplotype_clustering - This uses (1) sample_metadata, (2) diplotype_pairwise_distances, (3) _setup_sample_symbol, (4) _setup_sample_colors_plotly, (5) _setup_sample_hover_data_plotly and (6) plot_dendrogram.
    • (2)diplotype_pairwise_distances uses (2.a.) _prep_sample_sets_param, (2.b.) _prep_sample_query_param and (2.c.) _diplotype_pairwise_distances.
      • (2.c.) _diplotype_pairwise_distances uses snp_calls.
    • (3) _setup_sample_symbol and (4) _setup_sample_colors_plotly don't look directly connected to sample data. (5) _setup_sample_hover_data_plotly doesn't look directly connected to sample data, although it does have sample_id string hard-coded.
    • (6) plot_dendrogram doesn't look directly connected to sample data.

It looks like this function should be resolved when sample_metadata, _prep_sample_sets_param and snp_calls are resolved.

  • plot_diplotype_clustering_advanced - This uses plot_diplotype_clustering_advanced (itself), (1) _dipclust_het_bar_trace, (2) _dipclust_cnv_bar_trace, (3) _insert_dipclust_snp_trace and (4) _dipclust_concat_subplots.
    • (1) _dipclust_het_bar_trace uses snp_calls.
    • (2) _dipclust_cnv_bar_trace uses gene_cnv.
    • (3) _insert_dipclust_snp_trace uses (3.a.) _dipclust_snp_trace.
      • (3.a.) _dipclust_snp_trace uses (3.a.i.) snp_genotype_allele_counts.
        • (3.a.i.) snp_genotype_allele_counts uses (3.a.i.1.) snp_calls, (3.a.i.2) _snp_df_melt and (3.a.i.3.) _snp_effect_annotator.
          • (3.a.i.2) _snp_df_melt uses the Dataset from its ds_snp param and doesn't look directly connected to sample data.
          • (3.a.i.3.) _snp_effect_annotator uses open_genome and genome_features, and doesn't appear to be directly connected to sample data.
    • (4) _dipclust_concat_subplots has params sample_sets and sample_query but only uses those for titles and doesn't look directly connected to sample data.

It looks like this function should be resolved when snp_calls and gene_cnv are resolved.

Fst analysis

  • average_fst - This uses snp_allele_counts and looks like it should be resolved when that's resolved.
  • pairwise_average_fst - This uses _setup_cohort_queries and average_fst. _setup_cohort_queries is already on the list for further investigation. It looks like this function should be resolved when _setup_cohort_queries and average_fst are resolved.
  • plot_pairwise_average_fst - This uses the DataFrame from its fst_df param and doesn't look directly connected to sample data.
  • fst_gwss - This uses _prep_sample_query_param, _prep_sample_sets_param, _prep_optional_site_mask_param and _fst_gwss. _fst_gwss uses snp_allele_counts and snp_sites. It looks like this function should be resolved when _prep_sample_sets_param, snp_allele_counts and snp_sites are resolved.
  • plot_fst_gwss - This uses plot_fst_gwss_track and plot_genes. plot_fst_gwss_track uses fst_gwss. It looks like this function should be resolved when plot_genes and fst_gwss are resolved.

Inversion karyotypes

  • karyotype - This uses load_inversion_tags and snp_calls. load_inversion_tags basically just reads from self._inversion_tag_path. It looks like this function should be resolved when snp_calls is resolved.

@leehart
Copy link
Collaborator Author

leehart commented Apr 29, 2025

In summary, the following publicly-documented functions (about 20, out of about 100) need further investigation, discussion, decision or resolution, with regards to compliance with the two new constructor params:

  • releases - Reprogrammed accordingly.

  • sample_sets - Reprogrammed accordingly.

  • add_extra_metadata - We wanted to check this to ensure that it was not possible to add samples (via the data param) that are incompatible with either the surveillance_use_only or the unrestricted_use_only settings.

    • Each set of _extra_metadata is added during sample_metadata through a merge using a left join, so should it should only be possible to add columns to samples that already exist in the df_samples DataFrame at that point. However, df_samples is based on general_metadata and the merge happens before the _prep_sample_query_param is applied.
    • general_metadata uses _prep_sample_sets_param, which uses _relevant_sample_sets (in this PR), so that should honour both unrestricted_use_only and surveillance_use_only with respect to sample sets. At the sample level, the application of _prep_sample_query_param should apply the is_surveillance == True injected query criterion.
    • It is still possible (in this PR) to add non-surveillance samples to the _extra_metadata mechanism when using surveillance_use_only, but those will still be filtered out when using functions such as sample_metadata, via the _prep_sample_query_param mechanism.
    • When using unrestricted_use_only, attempting to add metadata for samples from a sample set that is restricted (via add_extra_metadata) will raise a ValueError, as per existing code, because none of the samples in the restricted set will match the samples returned by sample_metadata. However, if there is just one "unrestricted" sample in that set of data, then this error will not be raised. Instead, in this PR, the _prep_sample_sets_param, which is used in functions such as sample_metadata, will prevent irrelevant sample sets and their samples being included in the returned data, via use of _relevant_sample_sets.
  • cross_metadata - This used to return data for crosses regardless of unrestricted_use_only or surveillance_use_only, even though the crosses are (judging by AG1000G-X/surveillance.flags.csv) not for surveillance use.

    • For when _unrestricted_use_only is set True, I've added code to check whether the AG1000G-X sample set is marked for unrestricted use, and return an empty DataFrame if it isn't.
    • For when _surveillance_use_only is set True, I've added code to use the surveillance flags for the AG1000G-X sample set to only return data for samples that have is_surveillance set True.
  • wgs_data_catalog - This currently just reads wgs_snp_data.csv without filtering. To be consistent, I reckon we want to filter out any samples or sample sets that don't match the given unrestricted_use_only or surveillance_use_only settings.

    • For when _unrestricted_use_only is set True, I've added code to check whether the specified sample set is marked for unrestricted use, and return an empty DataFrame if it isn't. This is currently an extra safeguard, since wgs_data_catalog uses lookup_release(sample_set), which uses sample_sets, which already filters on unrestricted_use.
    • For when _surveillance_use_only is set True, I've added code to use the surveillance flags for the specified sample set to only return data for samples that have is_surveillance set True.
  • cohorts - Although the cohorts_{cohort_set}.csv returned doesn't specify particular samples or sample sets, there is a cohort_size hard-coded into the data file, which I guess might be wrong, depending on the specified settings.

When _unrestricted_use_only is set True, or when _surveillance_use_only is set True, it seems (ideally) that the reported cohort_size should change accordingly, e.g. to not count samples that have is_surveillance as False, and/or to not count samples from sets that have unrestricted_use as False.

Since the cohorts function isn't used internally for anything (outside of tests), and changing this would require a re-release of all of the cohorts data, with a re-engineering or removal of the cohort_size values, the risk of keeping this and the benefit to changing this seem low. However, we might want to consider adding a caveat to the description for the cohort_size, which currently simply reads "the number of samples in the cohort".

  • Discuss with team and make decision.

  • aim_calls - This needs checking to make sure that AIM data read from {sample_set}/{aims}.zarr is not returned with data records for samples that don't honour surveillance_use_only or unrestricted_use_only appropriately. Also, this function might honour surveillance_use_only by way of _prep_sample_query_param or sample_metadata, but we should also filter for unrestricted_use_only, e.g. only return AIM calls for sample sets that are unrestricted when unrestricted_use_only is set.

    • Scenario 1: When surveillance_use_only is set True and aim_calls is called specifying sample sets that have no surveillance samples.

      • The use of _prep_sample_query_param in aim_calls can cause a ValueError to be returned when there are no relevant samples, i.e. ValueError: No samples found for query 'is_surveillance == True'. This behaviour is probably to be expected, because the user has constructed the object for surveillance-only data but has then explicitly specified a sample set that has no such samples.
    • Scenario 2: When surveillance_use_only is set True and aim_calls is called specifying sample sets that have mixture of surveillance and non-surveillance samples.

      • This would currently raise an IndexError because the code tries to use ds.isel(samples=loc_samples) but loc_samples comes from sample_metadata with a query applied, whereas ds comes from the Zarr via _aim_calls_dataset without filters, so the indexes won't always match. To resolve this, I've modified the code to select from the AIM calls Dataset based on the relevant sample ids corresponding to the sample_metadata.
    • Scenario 3: Normally, without surveillance_use_only nor unrestricted_use_only, calling aim_calls will (in the current release and in this PR) return data for any samples, regardless of whether they are restricted and non-surveillance. For example, if restricted sample sets with no surveillance samples are specified via sample_sets, then those data will be returned.

    • Scenario 4: When unrestricted_use_only is set True, aim_calls will not (in this PR) return data for samples that belong to restricted sample sets. For example, if no sample sets are specified, then aims_calls will (in this PR) return data for all relevant sample sets, i.e. only unrestricted sample sets.

      • Currently (in this PR) to check this is behaviour, we can only use the sample ids returned by aim_calls, since aim_calls doesn't currently return sample_set nor release ids. The sample ids can be related to their corresponding sample sets via sample_metadata. I've checked that the sample ids returned by aim_calls in this scenario match the expected list of unrestricted sample sets.
      • Currently (in this PR) an error will be raised, as might be expected, if the user tries to specify a restricted sample set when calling aim_calls after having set unrestricted_use_only to True. For example: "Sample set '1270-VO-MULTI-PAMGEN-VMF00244' not found. This sample set might be unavailable or irrelevant with respect to settings."
    • Scenario 5: Likewise, when surveillance_use_only is set True, aim_calls will not (in this PR) return data for non-surveillance samples. If no sample sets are specified, then aims_calls will attempt to return data for all relevant samples, in this case all surveillance samples.

      • I've checked this behaviour (in this PR) by verifying that the sample ids returned by aim_calls match the surveillance sample ids, i.e. all the samples with is_surveillance set to True from the sample_metadata function. I also checked that the sample_metadata function only returns the surveillance sample ids when surveillance_use_only is set True.
      • Currently (in this PR) an error will be raised, as might be expected, if the user tries to specify a sample set with no surveillance samples when calling aim_calls after having set surveillance_use_only to True. For example: "No samples found for query 'is_surveillance == True'". This mechanism was already in place to cover situations where the user provided a sample_query that yielded no samples, but in this case _prep_sample_query_param has added the is_surveillance == True query criterion.
  • cnv_hmm - Similar to aim_calls, this should honour surveillance_use_only by way of _prep_sample_query_param, but should be checked for that. For this function to be compliant, we'll need sample_metadata to be resolved (by way of sample_sets, etc.), and I reckon we also want to add filtering for unrestricted_use_only to _prep_sample_sets_param. If the user specifies sample sets that contradict, I suppose there should be a warning or error raised.

    • I rewrote the code in the part that handles the sample_query, which includes the injected is_surveillance == True, so that it is identical to (consistent with) the aim_calls logic and error messages.
    • Note that cnv_hmm includes an additional filter on the samples it returns based on the max_coverage_variance parameter and each sample's sample_coverage_variance.
  • cnv_coverage_calls - I think we want to filter coverage_calls/{analysis}/zarr according to surveillance_use_only or unrestricted_use_only, as appropriate. Unusually, this function takes a sample_set (singular) param, rather than sample_sets, so it doesn't use _prep_sample_sets_param, which I we could otherwise use to intercept. Maybe we'll want something like_is_surveilance_use_only(sample_sets) to easily determine this.

  • cnv_discordant_read_calls - Similar issues to aim_calls and cnv_hmm. We want to check that the subselection is actually working as expected. We want to filter appropriately for unrestricted_use_only, and I think we want to update the _prep_sample_sets_param function to filter based on unrestricted_use_only.

  • view_alignments (perhaps depends on _prep_sample_sets_param amendment) - I think we want to check that the surveillance_use_only setting is honoured and maybe use the _prep_sample_sets_param to honour the unrestricted_use_only setting. If an incompatible sample id is passed in to this function, then we'll need a way to handle that.

  • pca - This depends on _prep_sample_sets_param amendment. I think we want _prep_sample_query_param to enforce surveillance_use_only and _prep_sample_sets_param to enforce unrestricted_use_only.

  • njt - As above, I think we want _prep_sample_query_param to enforce surveillance_use_only and _prep_sample_sets_param to enforce unrestricted_use_only.

  • biallelic_diplotype_pairwise_distances - As above, I think we want _prep_sample_query_param to enforce surveillance_use_only and _prep_sample_sets_param to enforce unrestricted_use_only.

  • _prep_sample_sets_param - We want to honour unrestricted_use_only. This function uses sample_sets as its authority on the list of relevant sample sets, so this should be compliant when sample_sets is compliant.

  • _prep_sample_query_param - We want to avoid duplications of the is_surveillance criterion in the query. I believe this has been mitigated by simply not appending another instance of the is_surveillance criterion if the query string already ends with it. However, this will not cover all situations where the is_surveillance criterion has been added, e.g. mid-query. Although convenient, generally, it doesn't seem right to be re-purposing (polluting or hijacking) custom user queries with automatic data filtering, so perhaps a separate method is required instead, although that would require more thought and re-engineering.

  • diversity_stats - We need to check _setup_cohort_queries doesn't leak and protect against sample sets that contradict unrestricted_use_only.

  • _setup_cohort_queries - Needs checking more thoroughly. Might need amending to honour surveillance_use_only and unrestricted_use_only.

  • plot_h12_gwss_multi_overlay - This depends on _setup_cohort_queries scrutiny.

Many of the other functions, which leaves about 80, will require the above functions to be updated in order to behave according to plan.

Copy link

codecov bot commented Apr 29, 2025

Codecov Report

Attention: Patch coverage is 93.67089% with 5 lines in your changes missing coverage. Please review.

Project coverage is 96.06%. Comparing base (0197957) to head (23e9012).
Report is 41 commits behind head on master.

Files with missing lines Patch % Lines
malariagen_data/anoph/sample_metadata.py 90.00% 4 Missing ⚠️
malariagen_data/anoph/base.py 92.85% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #724      +/-   ##
==========================================
- Coverage   96.13%   96.06%   -0.08%     
==========================================
  Files          47       47              
  Lines        4683     4749      +66     
==========================================
+ Hits         4502     4562      +60     
- Misses        181      187       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@leehart
Copy link
Collaborator Author

leehart commented May 6, 2025

Currently trying to resolve a recursion error relating to the sample_sets, releases and surveillance_flags functions.

Basically, for example, sample_sets needs releases, which needs surveillance_flags, which needs _parse_metadata_paths, which needs _prep_sample_sets_param, which needs sample_sets, ad infinitum.

@leehart
Copy link
Collaborator Author

leehart commented May 8, 2025

@ahernank @cclarkson I suspect there might be something wrong with the surveillance data, or something I don't understand:

  • AG1000G-X sample metadata has 298 lines (for 297 samples)
    • vo_agam_release_master_us_central1/v3/metadata/general/AG1000G-X/samples.meta.csv
  • AG1000G-X surveillance flags has 808 lines (for 807 samples)
    • vo_agam_release_master_us_central1/v3/metadata/general/AG1000G-X/surveillance.flags.csv

Shouldn't there be a one-to-one correspondence between these two files?

  • 1324-VO-ET-GOLASSA-VMF00257 metadata has 421 lines
    • vo_agam_release_master_us_central1/v3.13/metadata/general/1324-VO-ET-GOLASSA-VMF00257/samples.meta.csv
  • 1324-VO-ET-GOLASSA-VMF00257 surveillance flags has 426 lines
    • vo_agam_release_master_us_central1/v3.13/metadata/general/1324-VO-ET-GOLASSA-VMF00257/surveillance.flags.csv

I suspect the surveillance flags data is including all samples instead of just the samples we released after QC.

@ahernank
Copy link
Collaborator

ahernank commented May 8, 2025

Thanks @leehart. Yup, absolutely, these were staged directly without the QC filtering. This wraps up with other bits that it would be good to address to move forward, I've opened https://github.com/malariagen/vector-ops/issues/2485 -- should we tackle this over there?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants