Add `unrestricted_use_only` and `surveillance_use_only` constructor params #724

leehart · 2025-02-07T10:51:19Z

Re: issue #716

…_releases property.

leehart · 2025-02-20T15:43:40Z

Opening this for a WIP review, to potentially avoid going too far down the wrong path.

@alimanfoo @cclarkson @ahernank @jonbrenas I'm trying to identify other functions that we need to filter results for, according to:

unrestricted_use_only, which I've so far applied to sample_sets(), and
surveillance_use_only. which I've so far applied to sample_metadata().

Those seem like the main ones, which some other functions use, but I want to make sure we plug all potential leaks.

leehart · 2025-02-25T14:33:13Z

To do: add _prep_sample_query_param()

…use_only

leehart · 2025-03-21T10:43:33Z

To do: check that all public functions honour the unrestricted_use_only and surveillance_use_only constructor params.

leehart · 2025-04-08T11:52:40Z

It looks like Ag3 currently has about 137 public methods... 🤔

leehart · 2025-04-08T12:11:43Z

It also looks like about 119 of Ag3's public methods cannot be called without specifying params (cannot rely on defaults), which makes the testing of these constructor params somewhat difficult to automate. This also smells a lot like a "god object" anti-pattern https://en.wikipedia.org/wiki/God_object

We should probably consider re-organising all of those methods, despite the inconvenience, but that will probably have to wait. In the meantime, I hope to be able to figure out which functions are vulnerable to leaking unfiltered data, relating to either the surveillance-only or unrestricted-use-only flags.

leehart · 2025-04-08T14:01:47Z

@ahernank @jonbrenas Checkmarks indicate whether the function gets its data from an upstream public function, or file, or param, or otherwise looks covered. Unchecked functions indicate some doubt and require further investigation, discussion or coding.

aa_allele_frequencies - gets its data from snp_allele_frequencies
aa_allele_frequencies_advanced - gets its data from snp_allele_frequencies_advanced
add_extra_metadata - checks data from sample_metadata
aim_calls - uses _prep_sample_query_param and sample_metadata
aim_metadata - gets its data from samples.species_aim.csv
aim_variants - gets its data from aim_defs_{analysis}/{aims}.zarr
average_fst - uses snp_allele_counts
biallelic_diplotype_pairwise_distances - uses _prep_sample_selection_cache_params which uses _prep_sample_query_param
biallelic_diplotypes - uses _prep_sample_selection_cache_params which uses _prep_sample_query_param
biallelic_snp_calls - uses snp_allele_counts and snp_calls
biallelic_snps_to_plink - uses biallelic_snp_calls
clear_extra_metadata - is simply self._extra_metadata = []
cnv_coverage_calls - TO INVESTIGATE: uses _cnv_coverage_calls_dataset
cnv_discordant_read_calls - TO INVESTIGATE: uses _prep_sample_query_param and sample_metadata but also _cnv_discordant_read_calls_dataset
cnv_hmm - TO INVESTIGATE: uses _prep_sample_query_param and sample_metadata but also _cnv_hmm_dataset
cohort_diversity_stats - uses sample_metadata and snp_allele_counts
cohorts - gets its data from cohorts_{cohort_set}.csv
cohorts_metadata - gets its data from samples.cohorts.csv
count_samples - uses sample_metadata
cross_metadata - gets its data from crosses.fam
diplotype_pairwise_distances - uses _prep_sample_query_param
diversity_stats - TO INVESTIGATE: uses _setup_cohort_queries and cohort_diversity_stats not _prep_sample_query_param
fst_gwss - uses _prep_sample_query_param
g123_calibration - uses _prep_sample_query_param
g123_gwss - uses _prep_sample_query_param
gene_cnv - TO INVESTIGATE: uses _gene_cnv not _prep_sample_query_param
gene_cnv_frequencies - TO INVESTIGATE: uses _gene_cnv_frequencies not _prep_sample_query_param
gene_cnv_frequencies_advanced - TO INVESTIGATE: uses _gene_cnv_frequencies_advanced not _prep_sample_query_param
general_metadata - TO CONSIDER: gets its data from samples.meta.csv (NOT FILTERED)
geneset - uses genome_features
genome_feature_children - TO INVESTIGATE: uses _genome_features
genome_features - TO INVESTIGATE: uses _genome_features_for_contig and _genome_features
genome_sequence - TO INVESTIGATE: uses _genome_sequence_for_contig
h12_calibration - uses _prep_sample_query_param
h12_gwss - uses _prep_sample_query_param
h1x_gwss - uses _prep_sample_query_param
haplotype_pairwise_distances - uses _prep_sample_query_param
haplotype_sites - uses _haplotype_sites_for_region
haplotypes - TO INVESTIGATE: uses _prep_sample_query_param and sample_metadata but also _haplotypes_for_contig
haplotypes_frequencies - uses sample_metadata and locate_cohorts and haplotypes and haplotype_frequencies
haplotypes_frequencies_advanced - uses sample_metadata and haplotypes and haplotype_frequencies
igv - uses _igv_config and igv_notebook (LOOKS COVERED)
ihs_gwss - TO INVESTIGATE: uses _prep_sample_query_param but also _ihs_gwss
is_accessible - uses genome_sequence and snp_sites and _site_filters_for_region
karyotype - TO INVESTIGATE: uses load_inversion_tags and snp_calls but not _prep_sample_query_param
load_inversion_tags - gets its data from self._inversion_tag_path
lookup_release - uses sample_sets
lookup_sample - uses sample_metadata
lookup_study - uses sample_sets
lookup_study_info - uses sample_sets
lookup_terms_of_use_info - uses sample_sets
njt - uses _prep_sample_selection_cache_params which uses _prep_sample_query_param
open_cnv_coverage_calls - gets its data from coverage_calls/{analysis}/zarr
open_cnv_discordant_read_calls - gets its data from cnv/{sample_set}/{calls_version}/zarr
open_cnv_hmm - gets its data from cnv/{sample_set}/hmm/zarr
open_file - simply self._fs.open(f"{self._base_path}/{path}")
open_genome - gets its data from {self._base_path}/{self._genome_zarr_path}
open_haplotype_sites - gets its data from snp_haplotypes/sites/{analysis}/zarr
open_haplotypes - gets its data from snp_haplotypes/{sample_set}/{analysis}/zarr
open_site_annotations - gets its data from {self._base_path}/{self._site_annotations_zarr_path}
open_site_filters - gets its data from site_filters/{self._site_filters_analysis}/{mask_prepped}/
open_snp_genotypes - gets its data from snp_genotypes/all/{sample_set}/
open_snp_sites - gets its data from snp_genotypes/all/sites/
pairwise_average_fst - TO INVESTIGATE: uses _setup_cohort_queries not _prep_sample_query_param
pca - uses _prep_sample_selection_cache_params which uses _prep_sample_query_param
plot_aim_heatmap - uses aim_calls and go_make_subplots
plot_cnv_hmm_coverage - uses plot_cnv_hmm_coverage_track and plot_genes
plot_cnv_hmm_coverage_track - uses cnv_hmm
plot_cnv_hmm_heatmap - uses plot_cnv_hmm_heatmap_track and plot_genes
plot_cnv_hmm_heatmap_track - uses cnv_hmm
plot_diplotype_clustering - uses sample_metadata and diplotype_pairwise_distances
plot_diplotype_clustering_advanced - TO INVESTIGATE: uses plot_diplotype_clustering but also _dipclust_het_bar_trace and _dipclust_cnv_bar_trace and _insert_dipclust_snp_trace and _dipclust_concat_subplots
plot_diversity_stats - gets its data from df_stats param
plot_frequencies_heatmap - gets its data from df param
plot_frequencies_interactive_map - gets its data from ds param
plot_frequencies_map_markers - gets its data from ds param
plot_frequencies_time_series - gets its data from ds param
plot_fst_gwss - uses plot_fst_gwss_track and plot_genes
plot_fst_gwss_track - uses fst_gwss
plot_g123_calibration - uses g123_calibration
plot_g123_gwss - uses plot_g123_gwss_track and plot_genes
plot_g123_gwss_track - uses g123_gwss
plot_genes - TO INVESTIGATE: uses genome_sequence and _plot_genes_setup_data
plot_h12_calibration - uses h12_calibration
plot_h12_gwss - uses plot_h12_gwss_track and plot_genes
plot_h12_gwss_multi_overlay - uses plot_h12_gwss_multi_overlay_track and plot_genes
plot_h12_gwss_multi_overlay_track - TO INVESTIGATE: uses h12_gwss and _setup_cohort_queries not _prep_sample_query_param
plot_h12_gwss_multi_panel - TO INVESTIGATE: uses plot_h12_gwss_track and plot_genes and _setup_cohort_queries not _prep_sample_query_param
plot_h12_gwss_track - uses h12_gwss
plot_h1x_gwss - uses plot_h1x_gwss_track and plot_genes
plot_h1x_gwss_track - uses h1x_gwss
plot_haplotype_clustering - TO INVESTIGATE: uses sample_metadata and haplotype_pairwise_distances but also _setup_sample_symbol and _setup_sample_colors_plotly and _setup_sample_hover_data_plotly and plot_dendrogram
plot_haplotype_network - uses haplotypes and sample_metadata and median_joining_network and mjn_graph and plotly_discrete_legend (I expect _setup_sample_colors_plotly is OK)
plot_heterozygosity - uses plot_heterozygosity_track and plot_genes
plot_heterozygosity_track - TO INVESTIGATE: uses _sample_count_het and _plot_heterozygosity_track
plot_ihs_gwss - uses plot_ihs_gwss_track and plot_genes
plot_ihs_gwss_track - uses ihs_gwss
plot_njt - uses njt and sample_metadata (I expect _setup_sample_symbol and _setup_sample_colors_plotly and _setup_sample_hover_data_plotly are OK)
plot_pairwise_average_fst - gets its data from fst_df param
plot_pca_coords - gets its data from data param (I expect _setup_sample_symbol and _setup_sample_colors_plotly and _setup_sample_hover_data_plotly are OK)
plot_pca_coords_3d - gets its data from data param (I expect _setup_sample_symbol and _setup_sample_colors_plotly and _setup_sample_hover_data_plotly are OK)
plot_pca_variance - gets its data from evr param (simple bar plot)
plot_roh - TO INVESTIGATE: uses _sample_count_het and _plot_heterozygosity_track and _roh_hmm_predict and plot_roh_track and plot_genes
plot_roh_track - gets its data from df_roh param
plot_sample_location_geo - uses sample_metadata (NOTE: could potentially use _prep_sample_query_param too)
plot_sample_location_mapbox - uses sample_metadata (NOTE: could potentially use _prep_sample_query_param too)
plot_samples_bar - uses sample_metadata (NOTE: could potentially use _prep_sample_query_param too)
plot_samples_interactive_map - uses sample_metadata (NOTE: could potentially use _prep_sample_query_param too)
plot_snps - uses plot_snps_track and plot_genes
plot_snps_track - uses snp_allele_counts and snp_variants and genome_sequence
plot_transcript - uses genome_features and genome_feature_children
plot_xpehh_gwss - uses plot_xpehh_gwss_track and plot_genes
plot_xpehh_gwss_track - uses xpehh_gwss
read_files - gets its data from its paths param
results_cache_get - gets its data from results.zarr.zip or results.npz
results_cache_set - gets its data from its params and results params and writes params.json and results.zarr.zip
roh_hmm - TO INVESTIGATE: uses _sample_count_het and _roh_hmm_predict
sample_metadata - uses _prep_sample_query_param and general_metadata and sequence_qc_metadata and surveillance_flags and aim_metadata and cohorts_metadata
sample_sets - uses _read_sample_sets and _unrestricted_use_only
sequence_qc_metadata - gets its data from sequence_qc_stats.csv
site_annotations - uses _site_annotations_raw and snp_sites
site_filters - uses _site_filters_for_region
snp_allele_counts - uses _prep_sample_selection_cache_params which uses _prep_sample_query_param
snp_allele_frequencies - uses sample_metadata and snp_calls
snp_allele_frequencies_advanced - uses sample_metadata and snp_calls
snp_calls - uses _prep_sample_query_param and _snp_calls
snp_dataset - uses snp_calls
snp_effects - uses snp_variants and _snp_df_melt and _snp_effect_annotator
snp_genotype_allele_counts - uses snp_calls
snp_genotypes - uses _prep_sample_query_param and _snp_genotypes_for_contig
snp_sites - uses _snp_sites_for_region and site_filters
snp_variants - uses _snp_variants_for_contig
surveillance_flags - gets its data from surveillance.flags.csv
view_alignments - uses _igv_view_alignments_tracks and igv
wgs_data_catalog - TO CONSIDER: gets its data from wgs_snp_data.csv (NOT FILTERED)
wgs_run_accessions - TO CONSIDER: gets its data from wgs_accession_data.csv (NOT FILTERED)
xpehh_gwss - uses _prep_sample_query_param and _xpehh_gwss

leehart · 2025-04-24T09:14:10Z

Now investigating unexpected test failures after merging with master branch.

Local pytest:

446 passed, 38 warnings, 224 errors

CI pytest:

474 passed, 36 warnings, 224 errors

jonbrenas · 2025-04-24T09:33:19Z

Thanks @leehart. For IGV to work, it needed to have access to a 'public' version of the reference genomes, e.g., at gs://vo_anoph_temp_us_central1/vo_afun_release/reference/genome/idAnoFuneDA-416_04. This, in turn, needed to be passed to the AnophelesBase constructor. The tests use simulated data so the link needs to be specified (but None should work).

leehart · 2025-04-24T10:35:54Z

Following discussion, the plan is now to reduce the scope of honouring these constructor params to only currently "documented" public functions (as per docs/source/Ag3.rst, Af1 and Amin1) and to change other undocumented functions to private functions, e.g. open_cnv_coverage_calls to _open_cnv_coverage_calls.

leehart · 2025-04-24T11:42:44Z

List of "documented" public functions/properties to check:

Note: I have crossed off functions/properties that should be OK once their sub-functions are OK, to help us focus on the root issues, but this doesn't imply that they are themselves "safe", only that any potential issues appear to be confined to their sub-functions.

Basic data access

releases - Do we want to filter these accordingly? E.g. only return releases that are unrestricted or have surveillance data. Note: this PR currently uses a new relevant_releases property to get the filtered set of releases, for tests, etc.
sample_sets - Honours unrestricted_use_only. Do we want to filter these according to surveillance_use_only? E.g. only return sample sets that have at least one surveillance sample when surveillance_use_only=True.
lookup_release - Only uses sample_sets, so when sample_sets is resolved, this should be too.
lookup_study - Only uses sample_sets, so when sample_sets is resolved, this should be too.

Reference genome data access

contigs - This basically returns self.config["CONTIGS"] and doesn't look connected.
genome_sequence - This uses parse_single_region and _genome_sequence_for_contig and doesn't look connected.
genome_features - This uses _prep_gff_attributes , parse_multi_region, _genome_features_for_contig, _genome_features. _genome_features reads from f"{self._base_path}/{self._geneset_gff3_path}". _genome_features_for_contig uses genome_sequence, which looks unconnected. This doesn't look connected.
plot_transcript - This uses genome_features, which looks unconnected, and genome_feature_children. genome_feature_children uses _genome_features, which reads from f"{self._base_path}/{self._geneset_gff3_path}". This doesn't look connected.
plot_genes - This uses genome_sequence, which doesn't look connected, and _plot_genes_setup_data, which uses genome_features, which also doesn't look connected.

Sample metadata access

SNP data access

site_mask_ids - This basically just returns self.config.get("SITE_MASK_IDS", so looks unconnected.
snp_calls - This uses _prep_sample_query_param, which should honour surveillance_use_only. This function also uses _snp_calls, which uses sample_metadata and looks like it should be resolved when that's resolved too.
snp_allele_counts - This uses _prep_sample_selection_cache_params, which uses _prep_sample_query_param, which should honour surveillance_use_only. This function also uses _snp_allele_counts, which uses snp_calls and looks like it should be resolved when that's resolved too.
plot_snps - This uses plot_snps_track and plot_genes. plot_snps_track uses snp_allele_counts, snp_variants and genome_sequence. snp_variants uses _snp_variants_for_contig, which uses genome_sequence, open_snp_sites and open_site_filters. open_snp_sites reads f"{self._base_path}/{self._major_version_path}/snp_genotypes/all/sites/" and open_site_filters reads f"{self._base_path}/{self._major_version_path}/site_filters/{self._site_filters_analysis}/{mask_prepped}/" , which both look unconnected. It looks like this function should be resolved when its other subfunctions are resolved, i.e. plot_genes, snp_allele_counts and genome_sequence.
site_annotations - This uses (1) _site_annotations_raw and (2)snp_sites.
- (1)_site_annotations_raw uses (1b) open_site_annotations, which reads f"{self._base_path}/{self._site_annotations_zarr_path}", which looks unconnected.
- (2) snp_sites uses (2a) _snp_sites_for_region and (2b) site_filters.
  - (2a) _snp_sites_for_region uses (2a.i.) _snp_sites_for_contig.
    - (2a.i.) _snp_sites_for_contig uses (2a.i.1.) genome_sequence
  - (2b)site_filters uses (2b.i.)_site_filters_for_region, which uses (2b.i.1.) _site_filters_for_contig and (2b.i.2.) _snp_sites_for_contig.
    - (2b.i.1.) _site_filters_for_contig uses (2b.i.1.a.) open_site_filters.
      - (2b.i.1.a.) open_snp_sites reads `f"{self._base_path}/{self._major_version_path}/snp_genotypes/all/sites/", which looks unconnected.
    - (2b.i.2.) _snp_sites_for_contig uses (2b.i.2.i.) genome_sequence.
- It looks like this function should be resolved when genome_sequence is resolved.
is_accessible - This uses (1) genome_sequence, (2) snp_sites and (3) _site_filters_for_region.
- (2) snp_sites uses (2a) _snp_sites_for_region and (2b) site_filters.
  - (2a) _snp_sites_for_region uses (2a.i) _snp_sites_for_contig
    - (2a.i) _snp_sites_for_contig uses genome_sequence.
  - (2b) site_filters uses (2b.i) _site_filters_for_region.
    - (2b.i) _site_filters_for_region uses (2b.i.1.) _site_filters_for_contig and (2b.i.2) _snp_sites_for_contig.
      - (2b.i.1.) _site_filters_for_contig uses (2b.i.1.a.) open_site_filters.
        
        (2b.i.1.a.) open_site_filters reads f"{self._base_path}/{self._major_version_path}/site_filters/{self._site_filters_analysis}/{mask_prepped}/", which looks unconnected.
      - (2b.i.2.) _snp_sites_for_contig uses (2b.i.2.a.)genome_sequence
- (3) _site_filters_for_region uses (3a) _site_filters_for_contig and (3b) _snp_sites_for_contig.
  - (3a) _site_filters_for_contig uses (3a.i.) open_site_filters.
    - (3a.i.) open_site_filters reads f"{self._base_path}/{self._major_version_path}/site_filters/{self._site_filters_analysis}/{mask_prepped}/", which looks unconnected.
  - (3b) _snp_sites_for_contig uses (3b.i.) genome_sequence.
- It looks like this function should be resolved when genome_sequence is resolved.
biallelic_snp_calls - This uses snp_allele_counts, snp_calls. It looks like this function should be resolved when those are resolved.
biallelic_diplotypes - This uses _prep_sample_selection_cache_params, which uses _prep_sample_query_param, which should honour surveillance_use_only. This function also uses _biallelic_diplotypes, which uses biallelic_snp_calls. It looks like this function should be resolved when biallelic_snp_calls is resolved.
biallelic_snps_to_plink - This uses biallelic_snp_calls, and looks like it should be resolved when that's resolved.

Haplotype data access

phasing_analysis_ids - This property basically just returns PHASING_ANALYSIS_IDS from the config as a tuple.
haplotypes - This uses _prep_sample_query_param, which should honour surveillance_use_only. This function also uses _haplotypes_for_contig and sample_metadata. _haplotypes_for_contig uses genome_sequence. This function looks like it should be resolved when sample_metadata and genome_sequence are resolved.
haplotype_sites - This uses _haplotype_sites_for_region, which uses _haplotype_sites_for_contig, which uses genome_sequence. This function looks like it should be resolved when genome_sequence is resolved.

AIM data access

aim_ids - This basically just returns self._aim_ids.
aim_variants - This basically just reads f"{self._base_path}/reference/aim_defs_{analysis}/{aims}.zarr", which doesn't look connected to specific samples.
aim_calls - This uses _prep_sample_query_param, which should honour surveillance_use_only. This function also uses _aim_calls_dataset and sample_metadata. _aim_calls_dataset reads from f"{self._base_path}/{release_path}/aim_calls_{analysis}/{sample_set}/{aims}.zarr". Although the data from _aim_calls_dataset contains data for all samples in the sample set, it looks like the code selects only the samples and sample metadata according to the sample query, via ds = ds.isel(samples=loc_samples). So this should be resolved when sample_metadata is resolved, but it is probably worth checking that the subselection is working as expected, i.e. that only surveillance_use_only samples are returned when required. In any case, there is no filtering for unrestricted_use_only, so it would currently be possible to get data (AIM calls) for terms-restricted sample sets regardless of the unrestricted_use_only setting. It might make sense to add filtering for unrestricted_use_only to the _prep_sample_sets_param function. We would have to handle cases where the user specifies sample sets that contradict the unrestricted_use_only setting, e.g. as errors or warnings.
plot_aim_heatmap - This uses aim_calls and looks like it should be resolved when that is resolved.

CNV data access

coverage_calls_analysis_ids - This basically just returns COVERAGE_CALLS_ANALYSIS_IDS from the config as a tuple.
cnv_hmm - This uses _prep_sample_query_param, which should honour surveillance_use_only. This function also uses _cnv_hmm_dataset and sample_metadata. _cnv_hmm_dataset uses open_cnv_hmm, which reads from f"{self._base_path}/{release_path}/cnv/{sample_set}/hmm/zarr".

Similar to the issue with aim_calls, the data from _cnv_hmm_dataset contains data for all samples in the sample set, but it looks like the code selects only the samples and sample metadata according to the sample query, via ds = ds.isel(samples=loc_query_samples). So this should be resolved when sample_metadata is resolved, but it is probably worth checking that the subselection is working as expected, i.e. that only surveillance_use_only samples are returned when required.

Again, as with aim_calls, there is no filtering for unrestricted_use_only, so it would currently be possible to get data (CNV HMM data) for terms-restricted sample sets regardless of the unrestricted_use_only setting.

Again, this might be resolved by adding filtering for unrestricted_use_only to the _prep_sample_sets_param function, and we would have to handle cases where the user specifies sample sets that contradict the unrestricted_use_only setting, e.g. as errors or warnings.

cnv_coverage_calls - This uses _cnv_coverage_calls_dataset, which uses open_cnv_coverage_calls, which reads from f"{self._base_path}/{release_path}/cnv/{sample_set}/coverage_calls/{analysis}/zarr". There is no filtering for either surveillance_use_only or unrestricted_use_only, so it would currently be possible to get data (CNV coverage calls) for samples regardless of those settings.

Since the function takes a sample_set param (singular) rather than sample_sets (list or singular str), there isn't currently a _prep_sample_sets_param being used either. So we will need a plan for handling this, e.g. cases where the user specifies a sample set that contradict the unrestricted_use_only setting, e.g. as errors or warnings.

cnv_discordant_read_calls - This uses _prep_sample_query_param, which should honour surveillance_use_only. This function also uses _cnv_discordant_read_calls_dataset and sample_metadata. _cnv_discordant_read_calls_dataset uses open_cnv_discordant_read_calls, which reads from f"{self._base_path}/{release_path}/cnv/{sample_set}/{calls_version}/zarr".

This function has the same kind of issues as aim_calls and cnv_hmm.

We would want to check that the subselection via ds = ds.isel(samples=loc_query_samples) is working as expected when surveillance_use_only.
There is no filtering for unrestricted_use_only, so it would currently be possible to get data (CNV DRC data) for terms-restricted sample sets regardless of the unrestricted_use_only setting.
We might want to add filtering for unrestricted_use_only to the _prep_sample_sets_param function, and handle cases where the user specifies sample sets that contradict the unrestricted_use_only setting.

plot_cnv_hmm_coverage - This uses plot_cnv_hmm_coverage_track and plot_genes. plot_cnv_hmm_coverage_track uses cnv_hmm. It looks like this function should be resolved when plot_genes and cnv_hmm are resolved.
plot_cnv_hmm_heatmap - This uses plot_cnv_hmm_heatmap_track and plot_genes. plot_cnv_hmm_heatmap_track uses cnv_hmm. It looks like this function should be resolved when plot_genes and cnv_hmm are resolved.
gene_cnv - This uses _gene_cnv, which uses genome_features and cnv_hmm. It looks like this function should be resolved when genome_features and cnv_hmm are resolved.

Integrative genomics viewer (IGV)

igv - This uses _igv_config, which basically just creates the IGV config and doesn't look directly connected to sample data.
view_alignments - This uses _igv_view_alignments_tracks and igv. _igv_view_alignments_tracks uses sample_metadata, wgs_data_catalog and _igv_site_filters_tracks.

sample_metadata is used to get the sample_set for the given sample id string, which is then used with wgs_data_catalog to get the alignments_bam and snp_genotypes_vcf URLs.
_igv_site_filters_tracks reads from f"{self._public_url}{self._major_version_path}/site_filters/{self._site_filters_analysis}/vcf/{site_mask}/{contig}_sitefilters.vcf.gz", which I don't believe is connected directly to sample data.

The end user (or other code) could try to pass a sample id to this function that was incompatible with the surveillance_use_only or unrestricted_use_only settings. The sample_metadata function (in this PR) uses _prep_sample_query_param and an additional mechanism to filter its data according to is_surveillance. However, there is currently no mechanism to honour the unrestricted_use_only setting. We might want to use the _prep_sample_sets_param function, which is also used by sample_metadata, to restrict sample_sets_prepped to those that honour the unrestricted_use_only setting.

In this case, the incompatible sample id would be provided to _igv_view_alignments_tracks but I expect we would want the self.sample_metadata().set_index("sample_id").loc[sample] part to fail, rather than to succeed regardless of the constructor params, and that scenario ought to be handled gracefully.

SNP and CNV frequency analysis

snp_allele_frequencies - This uses sample_metadata, snp_calls, _snp_df_melt, _snp_effect_annotator and _transcript_to_parent_name.

_transcript_to_parent_name uses genome_features and doesn't appear to be directly connected to sample data.
_snp_effect_annotator uses open_genome and genome_features, and doesn't appear to be directly connected to sample data.
_snp_df_melt basically just processes the data given to it via ds_snp and doesn't appear to be directly connected to sample data.

It looks like this function should be resolved when sample_metadata and snp_calls are resolved.

Principal components analysis (PCA)

pca - This uses _prep_sample_selection_cache_params, which uses _prep_sample_query_param, which should honour surveillance_use_only. This function also uses _pca and sample_metadata. _pca uses biallelic_diplotypes, which also uses _prep_sample_selection_cache_params as well as _biallelic_diplotypes. _biallelic_diplotypes uses biallelic_snp_calls, which uses snp_allele_counts and snp_calls.

This function should be partly resolved when snp_allele_counts,snp_calls and sample_metadata are resolved. However, as with other functions, there is currently no protection against passing in sample set identifiers that contradict the unrestricted_use_only setting. The use of _prep_sample_query_param only enforces the surveillance_use_only setting.

Since the _prep_sample_selection_cache_params function uses both _prep_sample_query_param and _prep_sample_sets_param, it looks like amending the _prep_sample_sets_param function to honour the unrestricted_use_only setting might be the way to go.

plot_pca_variance - This uses the explained variance data passed to it via its evr param to basically just produce a bar plot. This function doesn't look directly connected to sample data.
plot_pca_coords - This uses the data passed to it via its data param. This function also uses _setup_sample_symbol, _setup_sample_colors_plotly and _setup_sample_hover_data_plotly. This function and the functions it uses don't look directly connected to sample data.
plot_pca_coords_3d - This uses the data passed to it via its data param. This function also uses _setup_sample_symbol, _setup_sample_colors_plotly and _setup_sample_hover_data_plotly. This function and the functions it uses don't look directly connected to sample data.

Genetic distance and neighbour-joining trees (NJT)

plot_njt - This uses njt and sample_metadata. This function also uses _setup_sample_symbol, _setup_sample_colors_plotly and _setup_sample_hover_data_plotly, which don't look directly connected to sample data. It looks like this function should be resolved when sample_metadata and njt are resolved.
njt - This uses _prep_sample_selection_cache_params, which uses _prep_sample_query_param, which should honour surveillance_use_only. This function also uses _njt, which uses biallelic_diplotype_pairwise_distances.

This function should be partly resolved when biallelic_diplotype_pairwise_distances is resolved. However, as with other functions, there is currently no protection against passing in sample set identifiers that contradict the unrestricted_use_only setting. The use of _prep_sample_query_param only enforces the surveillance_use_only setting.

biallelic_diplotype_pairwise_distances - This uses _prep_sample_selection_cache_params, which uses _prep_sample_query_param, which should honour surveillance_use_only. This function also uses _biallelic_diplotype_pairwise_distances, which uses biallelic_diplotypes.

This function should be partly resolved when biallelic_diplotypes is resolved. However, there is currently no protection against passing in sample set identifiers that contradict the unrestricted_use_only setting.

I propose that we look at changing the _prep_sample_sets_param function to honour the unrestricted_use_only setting, which will then be used by _prep_sample_selection_cache_params, and it can potentially be used alongside _prep_sample_query_param elsewhere, wherever thesample_sets param appears.

honour unrestricted_use_only in _prep_sample_sets_param

By the way, I suspect that we will probably need a smarter mechanism for _prep_sample_query_param, which currently just blindly tags on f"{sample_query} and is_surveillance == True", which could cause an absurd but otherwise benign accumulation of duplicated query criteria.

avoid duplications caused by _prep_sample_query_param

Heterozygosity analysis

plot_heterozygosity - This uses plot_heterozygosity_track and plot_genes. plot_heterozygosity_track uses (1) _sample_count_het and (2) _plot_heterozygosity_track.
- (1) _sample_count_het uses (1.a.)lookup_sample and (1.b.)snp_calls.
  - (1.a.) lookup_sample uses sample_metadata.
- (2) _plot_heterozygosity_track uses genome_sequence.

It looks like this function should be resolved when plot_genes, snp_callsandgenome_sequence` are resolved.

This function can take sample (actually samples) and sample_set (singular), so needs to respond appropriately if the settings given to the constructor contradict them.

roh_hmm - This uses (1) _sample_count_het and (2) _roh_hmm_predict.
- (1) _sample_count_het uses (1.a.)lookup_sample and (1.b.)snp_calls.
  - (1.a.) lookup_sample uses sample_metadata.
- (2) _roh_hmm_predict is passed a sample_id but otherwise doesn't look directly connected to sample data. Since this is already a private function, I don't think we need to throw an error if the given sample_id isn't compatible with the constructor params.

It looks like this function should be resolved when snp_calls and sample_metadata are resolved.

plot_roh - This uses (1) _sample_count_het, (2) _plot_heterozygosity_track, (3) _roh_hmm_predict, (4) plot_roh_track and (5) plot_genes.
- (1) _sample_count_het uses (1.a.)lookup_sample and (1.b.)snp_calls.
  - (1.a.) lookup_sample uses sample_metadata.
- (2) _plot_heterozygosity_track uses genome_sequence.
- (3) _roh_hmm_predict is passed a sample_id but is private, as mentioned above.
- (4) plot_roh_track uses genome_sequence.

It looks like this function should be resolved when plot_genes, snp_calls, sample_metadata and genome_sequence are resolved.

Diversity analysis

cohort_diversity_stats - This uses sample_metadata, snp_allele_counts and _block_jackknife_cohort_diversity_stats. _block_jackknife_cohort_diversity_stats doesn't look directly connected to sample data.

It looks like this function should be resolved when sample_metadata and snp_allele_counts are resolved.

diversity_stats - This uses _setup_cohort_queries and cohort_diversity_stats. _setup_cohort_queries uses sample_metadata, but also uses sample_query in other ways that look like they need closer inspection.

This function should be partly resolved when cohort_diversity_stats is resolved. However, _setup_cohort_queries does not use _prep_sample_query_param directly and needs to be checked more thoroughly for potential leaks. Plus, there is currently no protection against passing in sample set identifiers that contradict the unrestricted_use_only setting.

Check _setup_cohort_queries more thoroughly.
plot_diversity_stats - This uses _setup_sample_colors_plotly, which does not look directly connected to sample data. This function also uses the DataFrame passed to it via its df_stats param, and doesn't look directly connected to sample data.

Genome-wide selection scans

h12_calibration - This uses _prep_sample_query_param, which should honour surveillance_use_only. However, we might also want to modify _prep_sample_sets_param to honour unrestricted_use_only too. This function also uses _h12_calibration, which uses haplotypes. It looks like this function should be resolved when haplotypes and _prep_sample_sets_param are resolved.
plot_h12_calibration - This uses h12_calibration, and looks like it should be resolved when that is resolved.
h12_gwss - This uses _prep_sample_sets_param and _prep_sample_query_param. This function also uses _h12_gwss, which uses haplotypes. It looks like this function should be resolved when haplotypes and _prep_sample_sets_param are resolved.
plot_h12_gwss - This uses plot_h12_gwss_track and plot_genes. plot_h12_gwss_track uses h12_gwss. It looks like this function should be resolved when plot_genes and h12_gwss are resolved.
plot_h12_gwss_multi_panel - This uses (1) _setup_cohort_queries, (2) plot_h12_gwss_track and (3) plot_genes.
- (1) _setup_cohort_queries uses sample_metadata but also requires closer inspection.
- (2) plot_h12_gwss_track uses (2.a.) h12_gwss and (2.b.) _bokeh_style_genome_xaxis.
  - (2.b.) _bokeh_style_genome_xaxis doesn't look like it's directly connected to sample data.

It looks like this function should be resolved when _setup_cohort_queries, plot_genes, sample_metadata and h12_gwss are resolved.

plot_h12_gwss_multi_overlay - This uses plot_h12_gwss_multi_overlay_track and plot_genes. plot_h12_gwss_multi_overlay_track uses _setup_cohort_queries, h12_gwss and _bokeh_style_genome_xaxis. _bokeh_style_genome_xaxis doesn't look like it's directly connected to sample data.

This function should be partly resolved when plot_genes is resolved. However, as mentioned previously, _setup_cohort_queries does not use _prep_sample_query_param directly and needs to be checked more thoroughly for potential leaks. Plus, there is currently no protection against passing in sample set identifiers (via its sample_sets param) that contradict the unrestricted_use_only setting.

Haplotype clustering and network analysis

plot_haplotype_clustering - This uses sample_metadata, haplotype_pairwise_distances, _setup_sample_symbol, _setup_sample_colors_plotly, _setup_sample_hover_data_plotly and plot_dendrogram. plot_dendrogram, _setup_sample_symbol and _setup_sample_colors_plotly don't look directly connected to sample data. _setup_sample_hover_data_plotly doesn't look directly connected to sample data, although it does have the string sample_id hard-coded. It looks like this function should be resolved when sample_metadata and haplotype_pairwise_distances are resolved.
plot_haplotype_network - This uses haplotypes, sample_metadata, median_joining_network, _setup_sample_colors_plotly, mjn_graph and plotly_discrete_legend. median_joining_network, _setup_sample_colors_plotly, mjn_graph and plotly_discrete_legend don't look directly connected to sample data. It looks like this function should be resolved when haplotypes and sample_metadata are resolved.
haplotype_pairwise_distances - This uses _prep_sample_sets_param, _prep_sample_query_param and _prep_region_cache_param. This function also uses _haplotype_pairwise_distances, which uses haplotypes. It looks like this function should be resolved when _prep_sample_sets_param and haplotypes are resolved.

Diplotype clustering

plot_diplotype_clustering - This uses (1) sample_metadata, (2) diplotype_pairwise_distances, (3) _setup_sample_symbol, (4) _setup_sample_colors_plotly, (5) _setup_sample_hover_data_plotly and (6) plot_dendrogram.
- (2)diplotype_pairwise_distances uses (2.a.) _prep_sample_sets_param, (2.b.) _prep_sample_query_param and (2.c.) _diplotype_pairwise_distances.
  - (2.c.) _diplotype_pairwise_distances uses snp_calls.
- (3) _setup_sample_symbol and (4) _setup_sample_colors_plotly don't look directly connected to sample data. (5) _setup_sample_hover_data_plotly doesn't look directly connected to sample data, although it does have sample_id string hard-coded.
- (6) plot_dendrogram doesn't look directly connected to sample data.

It looks like this function should be resolved when sample_metadata, _prep_sample_sets_param and snp_calls are resolved.

plot_diplotype_clustering_advanced - This uses plot_diplotype_clustering_advanced (itself), (1) _dipclust_het_bar_trace, (2) _dipclust_cnv_bar_trace, (3) _insert_dipclust_snp_trace and (4) _dipclust_concat_subplots.
- (1) _dipclust_het_bar_trace uses snp_calls.
- (2) _dipclust_cnv_bar_trace uses gene_cnv.
- (3) _insert_dipclust_snp_trace uses (3.a.) _dipclust_snp_trace.
  - (3.a.) _dipclust_snp_trace uses (3.a.i.) snp_genotype_allele_counts.
    - (3.a.i.) snp_genotype_allele_counts uses (3.a.i.1.) snp_calls, (3.a.i.2) _snp_df_melt and (3.a.i.3.) _snp_effect_annotator.
      - (3.a.i.2) _snp_df_melt uses the Dataset from its ds_snp param and doesn't look directly connected to sample data.
      - (3.a.i.3.) _snp_effect_annotator uses open_genome and genome_features, and doesn't appear to be directly connected to sample data.
- (4) _dipclust_concat_subplots has params sample_sets and sample_query but only uses those for titles and doesn't look directly connected to sample data.

It looks like this function should be resolved when snp_calls and gene_cnv are resolved.

Fst analysis

average_fst - This uses snp_allele_counts and looks like it should be resolved when that's resolved.
pairwise_average_fst - This uses _setup_cohort_queries and average_fst. _setup_cohort_queries is already on the list for further investigation. It looks like this function should be resolved when _setup_cohort_queries and average_fst are resolved.
plot_pairwise_average_fst - This uses the DataFrame from its fst_df param and doesn't look directly connected to sample data.
fst_gwss - This uses _prep_sample_query_param, _prep_sample_sets_param, _prep_optional_site_mask_param and _fst_gwss. _fst_gwss uses snp_allele_counts and snp_sites. It looks like this function should be resolved when _prep_sample_sets_param, snp_allele_counts and snp_sites are resolved.
plot_fst_gwss - This uses plot_fst_gwss_track and plot_genes. plot_fst_gwss_track uses fst_gwss. It looks like this function should be resolved when plot_genes and fst_gwss are resolved.

Inversion karyotypes

karyotype - This uses load_inversion_tags and snp_calls. load_inversion_tags basically just reads from self._inversion_tag_path. It looks like this function should be resolved when snp_calls is resolved.

leehart · 2025-04-29T13:39:26Z

In summary, the following publicly-documented functions (about 20, out of about 100) need further investigation, discussion, decision or resolution, with regards to compliance with the two new constructor params:

releases - Reprogrammed accordingly.
sample_sets - Reprogrammed accordingly.
add_extra_metadata - We wanted to check this to ensure that it was not possible to add samples (via the data param) that are incompatible with either the surveillance_use_only or the unrestricted_use_only settings.
- Each set of _extra_metadata is added during sample_metadata through a merge using a left join, so should it should only be possible to add columns to samples that already exist in the df_samples DataFrame at that point. However, df_samples is based on general_metadata and the merge happens before the _prep_sample_query_param is applied.
- general_metadata uses _prep_sample_sets_param, which uses _relevant_sample_sets (in this PR), so that should honour both unrestricted_use_only and surveillance_use_only with respect to sample sets. At the sample level, the application of _prep_sample_query_param should apply the is_surveillance == True injected query criterion.
- It is still possible (in this PR) to add non-surveillance samples to the _extra_metadata mechanism when using surveillance_use_only, but those will still be filtered out when using functions such as sample_metadata, via the _prep_sample_query_param mechanism.
- When using unrestricted_use_only, attempting to add metadata for samples from a sample set that is restricted (via add_extra_metadata) will raise a ValueError, as per existing code, because none of the samples in the restricted set will match the samples returned by sample_metadata. However, if there is just one "unrestricted" sample in that set of data, then this error will not be raised. Instead, in this PR, the _prep_sample_sets_param, which is used in functions such as sample_metadata, will prevent irrelevant sample sets and their samples being included in the returned data, via use of _relevant_sample_sets.
cross_metadata - This used to return data for crosses regardless of unrestricted_use_only or surveillance_use_only, even though the crosses are (judging by AG1000G-X/surveillance.flags.csv) not for surveillance use.
- For when _unrestricted_use_only is set True, I've added code to check whether the AG1000G-X sample set is marked for unrestricted use, and return an empty DataFrame if it isn't.
- For when _surveillance_use_only is set True, I've added code to use the surveillance flags for the AG1000G-X sample set to only return data for samples that have is_surveillance set True.
wgs_data_catalog - This currently just reads wgs_snp_data.csv without filtering. To be consistent, I reckon we want to filter out any samples or sample sets that don't match the given unrestricted_use_only or surveillance_use_only settings.
- For when _unrestricted_use_only is set True, I've added code to check whether the specified sample set is marked for unrestricted use, and return an empty DataFrame if it isn't. This is currently an extra safeguard, since wgs_data_catalog uses lookup_release(sample_set), which uses sample_sets, which already filters on unrestricted_use.
- For when _surveillance_use_only is set True, I've added code to use the surveillance flags for the specified sample set to only return data for samples that have is_surveillance set True.
cohorts - Although the cohorts_{cohort_set}.csv returned doesn't specify particular samples or sample sets, there is a cohort_size hard-coded into the data file, which I guess might be wrong, depending on the specified settings.

When _unrestricted_use_only is set True, or when _surveillance_use_only is set True, it seems (ideally) that the reported cohort_size should change accordingly, e.g. to not count samples that have is_surveillance as False, and/or to not count samples from sets that have unrestricted_use as False.

Since the cohorts function isn't used internally for anything (outside of tests), and changing this would require a re-release of all of the cohorts data, with a re-engineering or removal of the cohort_size values, the risk of keeping this and the benefit to changing this seem low. However, we might want to consider adding a caveat to the description for the cohort_size, which currently simply reads "the number of samples in the cohort".

Many of the other functions, which leaves about 80, will require the above functions to be updated in order to behave according to plan.

codecov · 2025-04-29T16:46:05Z

Codecov Report

Attention: Patch coverage is 93.67089% with 5 lines in your changes missing coverage. Please review.

Project coverage is 96.06%. Comparing base (0197957) to head (23e9012).
Report is 41 commits behind head on master.

Files with missing lines	Patch %	Lines
malariagen_data/anoph/sample_metadata.py	90.00%	4 Missing ⚠️
malariagen_data/anoph/base.py	92.85%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #724      +/-   ##
==========================================
- Coverage   96.13%   96.06%   -0.08%     
==========================================
  Files          47       47              
  Lines        4683     4749      +66     
==========================================
+ Hits         4502     4562      +60     
- Misses        181      187       +6

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

leehart · 2025-05-06T11:32:01Z

Currently trying to resolve a recursion error relating to the sample_sets, releases and surveillance_flags functions.

Basically, for example, sample_sets needs releases, which needs surveillance_flags, which needs _parse_metadata_paths, which needs _prep_sample_sets_param, which needs sample_sets, ad infinitum.

leehart · 2025-05-08T15:40:47Z

@ahernank @cclarkson I suspect there might be something wrong with the surveillance data, or something I don't understand:

AG1000G-X sample metadata has 298 lines (for 297 samples)
- vo_agam_release_master_us_central1/v3/metadata/general/AG1000G-X/samples.meta.csv
AG1000G-X surveillance flags has 808 lines (for 807 samples)
- vo_agam_release_master_us_central1/v3/metadata/general/AG1000G-X/surveillance.flags.csv

Shouldn't there be a one-to-one correspondence between these two files?

1324-VO-ET-GOLASSA-VMF00257 metadata has 421 lines
- vo_agam_release_master_us_central1/v3.13/metadata/general/1324-VO-ET-GOLASSA-VMF00257/samples.meta.csv
1324-VO-ET-GOLASSA-VMF00257 surveillance flags has 426 lines
- vo_agam_release_master_us_central1/v3.13/metadata/general/1324-VO-ET-GOLASSA-VMF00257/surveillance.flags.csv

I suspect the surveillance flags data is including all samples instead of just the samples we released after QC.

ahernank · 2025-05-08T19:02:37Z

Thanks @leehart. Yup, absolutely, these were staged directly without the QC filtering. This wraps up with other bits that it would be good to address to move forward, I've opened https://github.com/malariagen/vector-ops/issues/2485 -- should we tackle this over there?

WIP: dev support for unrestricted_use_only, surveillance_use_only

fc6c2ba

leehart changed the title ~~Support unrestricted_use_only and surveillance_use_only~~ Support unrestricted_use_only and surveillance_use_only constructor params Feb 14, 2025

leehart added 4 commits February 18, 2025 14:40

Add test sample sets for Af1 with unrestricted_use_only. Add relevant…

dfdd4e2

…_releases property.

Update comment re skipping test due to lack of relevant fixtures

1b6cbb9

Add surveillance flags to sample_metadata(). Add tests.

02921c9

Merge branch 'master' into GH716_add_constructor_params

d4af40a

leehart marked this pull request as ready for review February 20, 2025 15:36

leehart requested review from alimanfoo, cclarkson, ahernank and jonbrenas February 20, 2025 15:43

leehart marked this pull request as draft March 18, 2025 09:23

leehart added 3 commits March 18, 2025 17:32

Merge branch 'master' into GH716_add_constructor_params

d4e7e70

WIP: add _prep_sample_query_param() stub where _prep_sample_set_param()

19902b0

Add logic to _prep_sample_query_param() to honour self._surveillance_…

d7b8383

…use_only

leehart added 4 commits March 21, 2025 16:25

Merge branch 'master' into GH716_add_constructor_params

de0daf8

Allow _prep_sample_query_param() to return None

435e8a7

Return consistent data type from _prep_sample_query_param()

bde3d4e

Merge branch 'master' into GH716_add_constructor_params

bfed3f4

Merge branch 'master' into GH716_add_constructor_params

78d26d1

leehart changed the title ~~Support unrestricted_use_only and surveillance_use_only constructor params~~ Add unrestricted_use_only and surveillance_use_only constructor params Apr 24, 2025

Merge branch 'master' into GH716_add_constructor_params

6396126

Add new public_url param to sample_metadata tests

50b3f5c

Merge branch 'master' into GH716_add_constructor_params

23e9012

Merge branch 'master' into GH716_add_constructor_params

ea950fc

jonbrenas mentioned this pull request May 20, 2025

Fix/fake sample count het #779

Open

leehart added 4 commits May 23, 2025 12:33

WIP: dev support for unrestricted_use_only, surveillance_use_only params

fdebfd4

Merge branch 'master' into GH716_add_constructor_params

62a848e

WIP: amend data types

d125707

Add doc for _surveillance_flags sample_sets param

a9f44c4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add `unrestricted_use_only` and `surveillance_use_only` constructor params #724

Add `unrestricted_use_only` and `surveillance_use_only` constructor params #724

Uh oh!

leehart commented Feb 7, 2025 •

edited

Loading

Uh oh!

leehart commented Feb 20, 2025

Uh oh!

leehart commented Feb 25, 2025 •

edited

Loading

Uh oh!

leehart commented Mar 21, 2025 •

edited

Loading

Uh oh!

leehart commented Apr 8, 2025

Uh oh!

leehart commented Apr 8, 2025

Uh oh!

leehart commented Apr 8, 2025 •

edited

Loading

Uh oh!

leehart commented Apr 24, 2025

Uh oh!

jonbrenas commented Apr 24, 2025

Uh oh!

leehart commented Apr 24, 2025

Uh oh!

leehart commented Apr 24, 2025 •

edited

Loading

Uh oh!

leehart commented Apr 29, 2025 •

edited

Loading

Uh oh!

codecov bot commented Apr 29, 2025

Uh oh!

leehart commented May 6, 2025

Uh oh!

leehart commented May 8, 2025

Uh oh!

ahernank commented May 8, 2025

Uh oh!

Uh oh!

Add unrestricted_use_only and surveillance_use_only constructor params #724

Are you sure you want to change the base?

Add unrestricted_use_only and surveillance_use_only constructor params #724

Uh oh!

Conversation

leehart commented Feb 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leehart commented Feb 20, 2025

Uh oh!

leehart commented Feb 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leehart commented Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leehart commented Apr 8, 2025

Uh oh!

leehart commented Apr 8, 2025

Uh oh!

leehart commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leehart commented Apr 24, 2025

Uh oh!

jonbrenas commented Apr 24, 2025

Uh oh!

leehart commented Apr 24, 2025

Uh oh!

leehart commented Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Basic data access

Reference genome data access

Sample metadata access

SNP data access

Haplotype data access

AIM data access

CNV data access

Integrative genomics viewer (IGV)

SNP and CNV frequency analysis

Principal components analysis (PCA)

Genetic distance and neighbour-joining trees (NJT)

Heterozygosity analysis

Diversity analysis

Genome-wide selection scans

Haplotype clustering and network analysis

Diplotype clustering

Fst analysis

Inversion karyotypes

Uh oh!

leehart commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Apr 29, 2025

Codecov Report

Uh oh!

leehart commented May 6, 2025

Uh oh!

leehart commented May 8, 2025

Uh oh!

ahernank commented May 8, 2025

Uh oh!

Uh oh!

Add `unrestricted_use_only` and `surveillance_use_only` constructor params #724

Add `unrestricted_use_only` and `surveillance_use_only` constructor params #724

leehart commented Feb 7, 2025 •

edited

Loading

leehart commented Feb 25, 2025 •

edited

Loading

leehart commented Mar 21, 2025 •

edited

Loading

leehart commented Apr 8, 2025 •

edited

Loading

leehart commented Apr 24, 2025 •

edited

Loading

leehart commented Apr 29, 2025 •

edited

Loading