mouse_STR

Relative Path: analyses/data/mouse_STR Date Added: <2026-04-07 Tue>

Using NGS for STR profiling isn’t the norm, and information for the standard 18 mouse loci (specifically STR motif AND genomic coordinates) aren’t readily accessible.
1. mouse_STR_genomic_locations_patent.csv: Fig 1. USRE49835E1 - Mouse cell line authentication - Google Patents (mm38)
2. mouse_STR_loci_info.csv Table 3 of the Authentication of Human and Mouse Cell Lines by Short Tandem Repeat (STR) DNA Genotype Analysis - Assay Guidance Manual (mm39)
3. mouse_STR_primers.csv Table 11 of Interlaboratory study to validate a STR profiling method for intraspecies identification of mouse cell lines (mm38)
4. mouse_STR_sequencing_primers.csv Table S2. of Short tandem repeat profiling via next-generation sequencing for cell line authentication (mm38)
5. mouse_STR_multiplex_pcr_primers.csv Table 7. of the Assay Guidance Manual (mm39)

tcr_imgt_c

Relative Path: analyses/data/tcr_imgt_c Date Added: <2026-03-18 Wed> Source: IMGT

IMGT reference sequences for human TCR C chains to use for TCR constructs
- For TRAC, the C-REGION (which contains all of the cDNA) was extracted from the X02592 accession and concatenated with a stop codon
- No cDNA pages were available for TRBC[1,2] so instead for each allele, sequences EX1-4 (exons) were obtained and concatenated together with a stop codon
Links
- TRAC
- TRBC1
- TRBC2

TCR_VDJ_families

Relative Path: analyses/data/TCR_VDJ_families Date Added: <2026-03-12 Thu> Source: In-house Amino acid sequences of codon-optimized TCR V and constant chains

TFLink_Homo_sapiens_interactions_SS_mitab_v1.0.tsv

Relative Path: analyses/data/TFLink_Homo_sapiens_interactions_SS_mitab_v1.0.tsv Date Added: <2026-02-25 Wed> Source: Homo sapiens small-scale interaction table In MITAB format (see here). All entries were verified with small-scale experimental evidence

human_core_ regulator data

Relative Path: analyses/data/hcc_reference/human_core_{Regulator_Gene,TF_Target}.txt Date Added: <2026-02-25 Wed> Source: Core data Regulatory interactions for core human genes

signalink_liver_network.csv

Relative Path: analyses/data/hcc_reference/signalink_liver_network.cys Date Added: <2026-02-24 Tue> Source: Signalink v3.1 Liver

Obtained by selecting…
- species: “human”
- all pathways
- Pathway regulators, pathway mambers, and transcriptional regulators
- liver localization

cell_markers.yaml, gene_sets.yaml

Relative Path: analyses/data/hcc_reference/cell_markers.yaml Date Added: <2026-02-11 Wed> Source: Custom Cell markers and gene sets used for annotating HCC scRNA-seq data. Generated with analyses/scrnaseq_hcc/get_markers.R

disco_cell_types.csv

Relative Path: analyses/data/disco_cholangiocyte.csv Date Added: <2026-02-11 Wed> Source: Cell types from DISCO

construct_sequences

Relative Path: analyses/pdac_tcr/construct_sequences Date Added: <2026-02-05 Thu> Source: External construct sequences

The following sequences were obtained from supplementary table 3 of the linked paper
- furin_linker.fasta
- murinized_trac.fasta
- murinized_trbc.fasta
- gibson_assembly_3p.fasta
- gibson_assembly_5p.fasta

u133a_mapping.tsv

Relative Path: analyses/data/u133a_mapping.tsv Date Added: <2025-11-12 Wed> Source: Biomart Mapping between Ensembl IDs, HUGO gene symbols, NCBI IDs, and Affymetrix U133A probe ids. Obtained from Ensembl biomart (version 115) with the GRCh38.p14 assembly

mammaprint_candidate_genes.tsv

Relative Path: analyses/data/mammaprint_candidate_genes.tsv Date Added: <2025-11-12 Wed> Source: Gene expression profiling predicts clinical outcome of breast cancer List of 231 prognosis reporter genes obtained by Veer et al. (Supplementary Table 2 of the linked paper). The 70 genes belonging to the final prognostic signature (used in the Mammaprint test) are indicated by the final_list column. These are the 70 genes with the highest correlation coefficients

IEDB_data.tsv

Relative Path: analyses/data/IEDB_data.tsv Date Added: <2025-10-06 Mon> Source: https://downloads.iedb.org/misc/TCRMatch/IEDB_data.tsv IEDB data formatted for TCRMatch

CEDAR_data.tsv

Relative Path: analyses/data/CEDAR_data.tsv Date Added: <2025-10-06 Mon> Source: https://github.com/IEDB/TCRMatch Epitope data from the CEDAR database as provided in the 1.3 release of TCRMatch

vdjdb.h5ad

Relative Path: analyses/data/vdjdb.h5ad Date Added: <2025-09-26 Fri> Source: scirpy.datasets.vdjdb — scirpy Complete VDJdb database (see the homepage here), last updated 30 July, 2025

McPAS-TCR.csv

Relative Path: analyses/data/McPAS-TCR.csv Date Added: <2025-09-26 Fri> Source: Download page Complete McPAS-TCR database (last updated September 10, 2022). “McPAS-TCR is a manually curated catalogue of T cell receptor (TCR) sequences that were found in T cells associated with various pathological conditions in humans and in mice. It is meant to link TCR sequences to their antigen target or to the pathology and organ with which they are associated”

MGI_Exome_Capture_V5_lifted.bed

Absolute path: /ssh:shannc@161.200.107.77:/data/project/stemcell/shannc/reference/exome_kits/MGI_V5/MGI_Exome_Capture_V5_lifted.bed Date added: <2025-03-03 Mon> Source: Custom MGI_Exome_Capture_V5.bed file with coordinates lifted over from the hg19 assembly to hg38. Liftover process carried out with the web interface of UCSC LiftOver with default parameters. Failed regions are in “liftover_failures.bed” in the same directory

MGI_Exome_Capture_V5.bed

Absolute path: /ssh:shannc@161.200.107.77:/data/project/stemcell/shannc/reference/exome_kits/MGI/_V5/MGI_Exome_Capture_V5.bed Date added: <2025-03-03 Mon> Source: MGIEasy Exome Capture V5 capture bed file

Adult_Human_Skin.pkl

Absolute path: /home/shannc/Bio_SDD/chula-stem/analyses/data/Adult_Human_Skin.pkl Date added: <2025-02-11 Tue> Source: CellTypist Adult human skin model

9_healthy_reference_AP_large_intestine_finalmodel.pkl

Absolute path: /home/shannc/Bio_SDD/chula-stem/analyses/data/9_healthy_reference_AP_large_intestine_finalmodel.pkl Date added: <2025-02-11 Tue> Source: Large intestine Cell typist model of Large intestine cells collected from adult/paediatric samples (Pan-GI cell atlas)

CellMarker2_human.csv

Absolute path: /home/shannc/Bio_SDD/chula-stem/analyses/data/CellMarker2_human.csv Date added: <2025-02-11 Tue> Source: CellMarker 2.0 Human Cell markers Human cell markers from CellMarker 2 used for marker-based annotation e.g. CellAssign

Homo_sapiens.GRCh38.113.ref_flat

Absolute path: /ssh:shannc@161.200.107.77:/data/project/stemcell/shannc/reference/genomes/Homo_sapiens.GRCh38.113.ref_flat Date added: <2025-01-31 Fri> Source: Custom refFlat file of the GRCh38 assembly, generated to use with gatk’s CollectRnaSeqMetrics from the Ensembl gtf file. Converted using the following command (taken from broadinstitute/picard#805 )

gtfToGenePred \
   -genePredExt \
   -geneNameAsName2 \
   -ignoreGroupsWithoutExons \
   !{params.gtf} \
   /dev/stdout | \
   awk 'BEGIN { OFS="\t"} {print $12, $1, $2, $3, $4, $5, $6, $7, $8, $9, $10}'

GCA_000001405.15_GRCh38_no_alt_analysis_set_100.bw

Absolute path: /ssh:shannc@161.200.107.77:/data/project/stemcell/shannc/reference/tool_specific/GCA_000001405.15_GRCh38_no_alt_analysis_set_100.bw Date added: <2025-01-20 Mon> Source: https://s3.amazonaws.com/purecn/GCA_000001405.15_GRCh38_no_alt_analysis_set_100.bw 100-kmer mappability file for hg38, intended for use with PureCN, prepared by the Waldron lab.

TCGA_HCC

Absolute path: /ssh:shannc@161.200.107.77:/data/project/stemcell/shannc/public_data/TCGA_HCC Date added: <2025-01-15 Wed> Source: TCGA Hepatocellular Carcinoma data from (available on GDC)

STAR raw read counts of hepatocellular carcinoma (both normal and tumor) samples downloaded from the Genomic Data Portal. The data are from the TCGA project - Comprehensive and Integrative Genomic Characterization of Hepatocellular Carcinoma

2024-06-18_IntOGen-Drivers

Absolute path: /home/shannc/Bio_SDD/chula-stem/analyses/data/2024-06-18_IntOGen-Drivers Date added: <2025-01-09 Thu> Source: Intogen Intogen driver gene data

census.csv

Absolute path: /home/shannc/Bio_SDD/chula-stem/analyses/data/census.csv Date added: <2025-01-10 Fri> Source: Cancer Gene Census

COSMIC cancer gene census

Homo_sapiens.GRCh38.cdna.all.fa.gz

Absolute path: /ssh:shannc@161.200.107.77:/data/project/stemcell/shannc/reference/transcriptomes/Homo_sapiens.GRCh38.cdna.all.fa.gz Date added: <2025-01-09 Thu> Source: Ensembl cDNA

All cDNA sequences corresponding to Ensembl human genes (genome build GRCh38), excluding nCRNA

kallisto_GRCh38_ensembl.idx

Absolute path: /ssh:shannc@161.200.107.77:/data/project/stemcell/shannc/reference/transcriptomes/Homo_sapiens.GRCh38.cdna.all.fa.gz Date added: <2025-01-09 Thu> Source: Custom Kallisto index (kallisto index -i ... ) of Ensembl cDNA sequences (Homo_sapiens.GRCh38.cdna.all.fa.gz) converted into RNA with seqkit seq --dna2rna

Homo_sapiens.GRCh38.113.gtf.gz

Absolute path: /ssh:shannc@161.200.107.77:/data/project/stemcell/shannc/reference/genomes/Homo_sapiens.GRCh38.113.gtf.gz Date added: <2025-01-09 Thu> Source: Ensembl annotation data

Ensembl gene annotation data for human genome GRCh38, version 113

Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz

Absolute path: /ssh:shannc@161.200.107.77:/data/project/stemcell/shannc/reference/genomes/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz Date added: <2025-01-09 Thu> Source: Ensembl genomes

Ensembl’s primary assembly of the GRCh38 genome, added into the directory on 2024-8-15

SureSelectHumanAllExonV6Hg38

Absolute path: /ssh:shannc@161.200.107.77:/data/project/stemcell/shannc/reference/exome_kits/SureSelectHumanAllExonV6Hg38 Date added: <2024-11-22 Thu> Source: Agilent SureDesign Download package for the SureSelectHumanAllExon V6 Hg38 capture kit. The *_Covered.bed file are taken to be baits, *_Regions.bed to be targets From this thread

GRCh38_gencode_v44_CTAT_lib_Oct292023.plug-n-play.tar.gz

Absolute path: /ssh:shannc@161.200.107.77:/data/project/stemcell/shannc/reference/GRCh38_gencode_v44_CTAT_lib_Oct292023.plug-n-play.tar.gz Date added: <2025-01-01 Wed> Source: https://data.broadinstitute.org/Trinity/CTAT_RESOURCE_LIB/ Reference genome and protein-coding gene annotation set, for use in fusion detection with STAR-Fusion

GCF_000001405.40_GRCh38.p14_genomic.sdf

Absolute path: /ssh:shannc@161.200.107.77:/data/project/stemcell/shannc/reference/genomes/GCF_000001405.40_GRCh38.p14_genomic.sdf Date added: <2024-12-30 Mon> Source: Custom

RTG Sequence Data File of the GRCh38 reference genome, see the docs <2024-12-30 Mon> Generated for use with vcfeval

cosmicv3.4_signatures.csv

Absolute path: /home/shannc/Bio_SDD/chula-stem/nextflow/config/cosmicv3.4_signatures.csv Date added: <2024-12-25 Wed> Source: Custom Contains metadata about cosmic signatures (v3.4) to aid in reporting the results of SigProfiler

Signature,Class,Proposed_aetiology,Studies,COSMIC_link

Proposed aetiologies are taken directly from the COSMIC page

Clingen-Gene-Disease-Summary-2024-12-19.csv

Absolute path: /ssh:shannc@161.200.107.77:/data/project/stemcell/shannc/reference/therapeutics/Clingen-Dosage-Sensitivity-2024-12-16.csv Date added: <2024-12-19 Thu> Source: ClinGen A summary of Gene-Disease validity curations by ClinGen. Used primarily to identify potentially important genes that are affected by called repetitive regions

aggregated_msi.tsv

Absolute path: /ssh:shannc@161.200.107.77:/data/project/stemcell/shannc/reference/variants/structural/aggregated_msi.tsv Date added: <2024-12-18 Wed> Source: Custom

Description

dbVar and clinVar repetitive element data, combined using a custom script. Intended to help identify when observed repetitive elements (mainly microsatellites and tandem duplications) in the data have been found in other experiments Includes somatic tandem duplications from dbVar and microsatellites from clinvar

chr\tstart\tstop\tgenes\tsource\ttype\taccession

aggregated_cnv.tsv

Absolute path: /ssh:shannc@161.200.107.77:/data/project/stemcell/shannc/reference/variants/structural/aggregated_cnv.tsv Date added: <2024-12-18 Wed> Source: Custom

Description

dbVar and gnomAD CNV data, combined using a custom script. Intended to help identify when observed CNVs in the data have been found in other experiments Includes somatic cnvs from dbVar and gnomad.v4.1 non-neuro cnvs

chr\tstart\tstop\tgenes\tsource\ttype\taccession

gnomADv4.1.0_Exomes

Absolute path: /ssh:shannc@161.200.107.77:/data/project/stemcell/shannc/reference/variants/gnomADv4.1.0_Exomes Date added: <2024-11-6> Source: https://gnomad.broadinstitute.org/downloads#v4

Description

GnomAD exome data for all available chromosomes

Cosmic_CompleteCNA_v101_GRCh38.tsv

Absolute path: /ssh:shannc@161.200.107.77:/data/project/stemcell/shannc/reference/variants/structural/Cosmic_CompleteCNA_v101_GRCh38.tsv Date added: <2024-12-13 Fri> Source: https://cancer.sanger.ac.uk/cosmic/download/cosmic/v101/completecna

Description

Cosmic copy number data

gnomADv4.1.0/random

Absolute path: /ssh:shannc@161.200.107.77:/data/project/stemcell/shannc/reference/variants/gnomADv4.1.0_Exomes/random Date added: <2024-12-18 Wed> Source: Custom

Description

Randomly sampled subsets of all the gnomAD exome data, taking 10% of the original variants. To resolve memory issues with running on the complete gnomAD set

gatk_resources

Absolute path: /ssh:shannc@161.200.107.77:/data/project/stemcell/shannc/reference/variants/gatk_resources Date added: <2024-12-17 Tue> Source: https://gatk.broadinstitute.org/hc/en-us/articles/360035890811-Resource-bundle

Description

Contains 1000G project data, and Mills gold standard indels. Recommended for use with BQSR, as described in this faq

gnomad.v4.1.cnv.non_neuro.vcf.gz

Absolute path: /ssh:shannc@161.200.107.77:/data/project/stemcell/shannc/reference/variants/structural/gnomad.v4.1.cnv.non_neuro.vcf.gz Date added: <2024-12-17 Tue> Source: https://datasetgnomad.blob.core.windows.net/dataset/release/4.1/genome_sv/gnomad.v4.1.sv.sites.vcf.gz.tbi

Description

Gnomad Copy number data, obtained the same way as the SV data

gnomad.v4.1.sv.sites.vcf.gz

Absolute path: /ssh:shannc@161.200.107.77:/data/project/stemcell/shannc/reference/variants/structural/gnomad.v4.1.sv.sites.vcf.gz Date added: FILL Source: https://datasetgnomad.blob.core.windows.net/dataset/release/4.1/genome_sv/gnomad.v4.1.sv.sites.vcf.gz.tbi

Description

Structural variants from gnomAD, see this for an overview of the data generation

##FILTER=<ID=PASS,Description="All filters passed">
##ALT=<ID=CNV,Description="Copy Number Polymorphism">
##CPX_TYPE_INS_iDEL="Insertion with deletion at insertion site."
##CPX_TYPE_INVdel="Complex inversion with 3' flanking deletion."
##CPX_TYPE_INVdup="Complex inversion with 3' flanking duplication."
##CPX_TYPE_dDUP="Dispersed duplication."
##CPX_TYPE_dDUP_iDEL="Dispersed duplication with deletion at insertion site."
##CPX_TYPE_delINV="Complex inversion with 5' flanking deletion."
##CPX_TYPE_delINVdel="Complex inversion with 5' and 3' flanking deletions."
##CPX_TYPE_delINVdup="Complex inversion with 5' flanking deletion and 3' flanking duplication."
##CPX_TYPE_dupINV="Complex inversion with 5' flanking duplication."
##CPX_TYPE_dupINVdel="Complex inversion with 5' flanking duplication and 3' flanking deletion."
##CPX_TYPE_dupINVdup="Complex inversion with 5' and 3' flanking duplications."
##CPX_TYPE_piDUP_FR="Palindromic inverted tandem duplication, forward-reverse orientation."
##CPX_TYPE_piDUP_RF="Palindromic inverted tandem duplication, reverse-forward orientation."
##FILTER=<ID=FAIL_MANUAL_REVIEW,Description="Low-quality variant that did not pass manual review of supporting evidence">
##FILTER=<ID=HIGH_NCR,Description="Unacceptably high rate of no-call GTs">
##FILTER=<ID=IGH_MHC_OVERLAP,Description="SVs that are overlapped by over 50% by IGH or MCH regions, these variants are of low confidence">
##FILTER=<ID=LOWQUAL_WHAM_SR_DEL,Description="deletions under1Kb that are uniquely from wham and have SR-only support">
##FILTER=<ID=MULTIALLELIC,Description="Multiallelic site">
##FILTER=<ID=OUTLIER_SAMPLE_ENRICHED,Description="SVs that are enriched for non-reference genotypes in outlier samples, likely indicating noisy or unreliable genotypes">
##FILTER=<ID=REDUNDANT_LG_CNV,Description="Multiple large CNVs called at the same locus likely indicates unreliable clustering and/or low-quality multiallelic locus">
##FILTER=<ID=REFERENCE_ARTIFACT,Description="Likely reference artifact sites that are homozygous alternative in over 99% of the samples">
##FILTER=<ID=UNRESOLVED,Description="Variant is unresolved">
##FORMAT=<ID=CN,Number=1,Type=Integer,Description="Predicted copy state">
##FORMAT=<ID=CNQ,Number=1,Type=Integer,Description="Read-depth genotype quality">
##FORMAT=<ID=EV,Number=.,Type=String,Description="Classes of evidence supporting final genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=MANUAL,Number=1,Type=String,Description="Reason for a failure from manual review">
##FORMAT=<ID=PE_GQ,Number=1,Type=Integer,Description="Paired-end genotype quality">
##FORMAT=<ID=PE_GT,Number=1,Type=Integer,Description="Paired-end genotype">
##FORMAT=<ID=RD_CN,Number=1,Type=Integer,Description="Predicted copy state">
##FORMAT=<ID=RD_GQ,Number=1,Type=Integer,Description="Read-depth genotype quality">
##FORMAT=<ID=SR_GQ,Number=1,Type=Integer,Description="Split read genotype quality">
##FORMAT=<ID=SR_GT,Number=1,Type=Integer,Description="Split-read genotype">
##INFO=<ID=ALGORITHMS,Number=.,Type=String,Description="Source algorithms">
##INFO=<ID=BOTHSIDES_SUPPORT,Number=0,Type=Flag,Description="Variant has read-level support for both sides of breakpoint.Indicates higher-confidence variants.">
##INFO=<ID=CHR2,Number=1,Type=String,Description="Chromosome for END coordinate">
##INFO=<ID=CPX_INTERVALS,Number=.,Type=String,Description="Genomic intervals constituting complex variant.">
##INFO=<ID=CPX_TYPE,Number=1,Type=String,Description="Class of complex variant.">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the structural variant">
##INFO=<ID=END2,Number=1,Type=Integer,Description="End position of the structural variant on CHR2">
##INFO=<ID=EVIDENCE,Number=.,Type=String,Description="Classes of random forest support.">
##INFO=<ID=LOW_CONFIDENCE_REPETITIVE_LARGE_DUP,Number=0,Type=Flag,Description="Duplications over 5Kb that are overlapped by segmental duplicates or simple repeats, these variants are not as high confident as others because of the affect by the repetitive sequences">
And more...

clinvar.vcf.gz

Absolute path: /ssh:shannc@161.200.107.77:/data/project/stemcell/shannc/reference/variants/structural/clinvar.vcf.gz Date added: 2024-12-12 Source: https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar_20241208.vcf.gz

Description

All vcf data from clinVar for GRCh38

##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=2024-12-08
##source=ClinVar
##reference=GRCh38
##ID=<Description="ClinVar Variation ID">
##INFO=<ID=AF_ESP,Number=1,Type=Float,Description="allele frequencies from GO-ESP">
##INFO=<ID=AF_EXAC,Number=1,Type=Float,Description="allele frequencies from ExAC">
##INFO=<ID=AF_TGP,Number=1,Type=Float,Description="allele frequencies from TGP">
##INFO=<ID=ALLELEID,Number=1,Type=Integer,Description="the ClinVar Allele ID">
##INFO=<ID=CLNDN,Number=.,Type=String,Description="ClinVar's preferred disease name for the concept specified by disease identifiers in CLNDISDB">
##INFO=<ID=CLNDNINCL,Number=.,Type=String,Description="For included Variant : ClinVar's preferred disease name for the concept specified by disease identifiers in CLNDISDB">
##INFO=<ID=CLNDISDB,Number=.,Type=String,Description="Tag-value pairs of disease database name and identifier submitted for germline classifications, e.g. OMIM:NNNNNN">
##INFO=<ID=CLNDISDBINCL,Number=.,Type=String,Description="For included Variant: Tag-value pairs of disease database name and identifier for germline classifications, e.g. OMIM:NNNNNN">
##INFO=<ID=CLNHGVS,Number=.,Type=String,Description="Top-level (primary assembly, alt, or patch) HGVS expression.">
##INFO=<ID=CLNREVSTAT,Number=.,Type=String,Description="ClinVar review status of germline classification for the Variation ID">
##INFO=<ID=CLNSIG,Number=.,Type=String,Description="Aggregate germline classification for this single variant; multiple values are separated by a vertical bar">
##INFO=<ID=CLNSIGCONF,Number=.,Type=String,Description="Conflicting germline classification for this single variant; multiple values are separated by a vertical bar">
##INFO=<ID=CLNSIGINCL,Number=.,Type=String,Description="Germline classification for a haplotype or genotype that includes this variant. Reported as pairs of VariationID:classification; multiple values are separated by a vertical bar">
##INFO=<ID=CLNVC,Number=1,Type=String,Description="Variant type">
##INFO=<ID=CLNVCSO,Number=1,Type=String,Description="Sequence Ontology id for variant type">
##INFO=<ID=CLNVI,Number=.,Type=String,Description="the variant's clinical sources reported as tag-value pairs of database and variant identifier">
##INFO=<ID=DBVARID,Number=.,Type=String,Description="nsv accessions from dbVar for the variant">
##INFO=<ID=GENEINFO,Number=1,Type=String,Description="Gene(s) for the variant reported as gene symbol:gene id. The gene symbol and id are delimited by a colon (:) and each pair is delimited by a vertical bar (|)">
##INFO=<ID=MC,Number=.,Type=String,Description="comma separated list of molecular consequence in the form of Sequence Ontology ID|molecular_consequence">
##INFO=<ID=ONCDN,Number=.,Type=String,Description="ClinVar's preferred disease name for the concept specified by disease identifiers in ONCDISDB">
##INFO=<ID=ONCDNINCL,Number=.,Type=String,Description="For included variant: ClinVar's preferred disease name for the concept specified by disease identifiers in ONCDISDBINCL">
##INFO=<ID=ONCDISDB,Number=.,Type=String,Description="Tag-value pairs of disease database name and identifier submitted for oncogenicity classifications, e.g. MedGen:NNNNNN">
##INFO=<ID=ONCDISDBINCL,Number=.,Type=String,Description="For included variant: Tag-value pairs of disease database name and identifier for oncogenicity classifications, e.g. OMIM:NNNNNN">
##INFO=<ID=ONC,Number=.,Type=String,Description="Aggregate oncogenicity classification for this single variant; multiple values are separated by a vertical bar">
##INFO=<ID=ONCINCL,Number=.,Type=String,Description="Oncogenicity classification for a haplotype or genotype that includes this variant. Reported as pairs of VariationID:classification; multiple values are separated by a vertical bar">
##INFO=<ID=ONCREVSTAT,Number=.,Type=String,Description="ClinVar review status of oncogenicity classification for the Variation ID">
##INFO=<ID=ONCCONF,Number=.,Type=String,Description="Conflicting oncogenicity classification for this single variant; multiple values are separated by a vertical bar">
##INFO=<ID=ORIGIN,Number=.,Type=String,Description="Allele origin. One or more of the following values may be added: 0 - unknown; 1 - germline; 2 - somatic; 4 - inherited; 8 - paternal; 16 - maternal; 32 - de-novo; 64 - biparental; 128 - uniparental; 256 - not-tested; 512 - tested-inconclusive; 1073741824 - other">
##INFO=<ID=RS,Number=.,Type=String,Description="dbSNP ID (i.e. rs number)">
##INFO=<ID=SCIDN,Number=.,Type=String,Description="ClinVar's preferred disease name for the concept specified by disease identifiers in SCIDISDB">
##INFO=<ID=SCIDNINCL,Number=.,Type=String,Description="For included variant: ClinVar's preferred disease name for the concept specified by disease identifiers in SCIDISDBINCL">
##INFO=<ID=SCIDISDB,Number=.,Type=String,Description="Tag-value pairs of disease database name and identifier submitted for somatic clinial impact classifications, e.g. MedGen:NNNNNN">
##INFO=<ID=SCIDISDBINCL,Number=.,Type=String,Description="For included variant: Tag-value pairs of disease database name and identifier for somatic clinical impact classifications, e.g. OMIM:NNNNNN">
##INFO=<ID=SCIREVSTAT,Number=.,Type=String,Description="ClinVar review status of somatic clinical impact for the Variation ID">
##INFO=<ID=SCI,Number=.,Type=String,Description="Aggregate somatic clinical impact for this single variant; multiple values are separated by a vertical bar">
##INFO=<ID=SCIINCL,Number=.,Type=String,Description="Somatic clinical impact classification for a haplotype or genotype that includes this variant. Reported as pairs of VariationID:classification; multiple values are separated by a vertical bar">

nstd102_clinical_sv.csv

Absolute path: /ssh:shannc@161.200.107.77:/data/project/stemcell/shannc/reference/variants/structural/nstd102_clinical_sv.csv Date added: 2024-12-17 Source: https://www.ncbi.nlm.nih.gov/dbvar/studies/nstd102/

Description

“Structural Variants with clinical assertions, submitted to ClinVar by external labs. dbVar now imports all placements from ClinVar as “submitted” and only remaps what is missing in order to place all variants on both GRCh37 and GRCh38. See Variant Summary counts for nstd102 in dbVar Variant Summary. See the latest statistics for nstd102 in Summary of nstd102 (Clinical Structural Variants).”

Study ID,Variant ID,Variant Region type,Variant Call type,Sampleset ID,Method,Analysis ID,Validation,Variant Samples,Subject Phenotype,Clinical Interpretation,Assembly Name,Chromosome Accession,Chromosome,Outer Start,Start,Inner Start,Inner End,End,Outer End,Placement Type,Remap Score

somatic_sv.bed

Absolute path: /ssh:shannc@161.200.107.77:/data/project/stemcell/shannc/reference/variants/structural/somatic_sv.bed Date added: 2024-12-16 Source: https://ftp.ncbi.nlm.nih.gov/pub/dbVar/sandbox/dbvarhub/hg38/somatic_sv.bed

Description

Somatic structural variants from dbVar

# First line
chr1    61734   248930189       nssv459313      1000    .       61734   248930189       0,0,255      SNP array       copy number gain        Over 1MB        Gene(s) affected: Multiple, Position: chr1:61735-248930189, Size: 248868455, Type: copy number gain, Study: Walter et al 2009, Method: SNP array  Walter et al 2009       90 to 100       nssv15156970

Column explanation

1 string chrom; "Reference sequence chromosome or scaffold"
uint chromStart; "Start position of feature on chromosome"
uint chromEnd; "End position of feature on chromosome"
string name; "dbVar Variant Accession"
5 uint score; "Score"
char[1] strand; "+ or - for strand"
uint thickStart; "Start position of feature on chromosome"
uint thickEnd; "End position of feature on chromosome"
uint reserved; "Colors indicate Variant Type and Clinical Significance."
10 string method; "Discovery Method type"
string type; "Variant Type"
string length; "Variant Length Range"
string label; "Gene, Position, Size, Type, Study, Method"
string study; "dbVar Study"
string overlap; "Range of Reciprocal Overlap with Pathogenic Variant"
string pathogenic_acc; "dbVar Pathogenic Variant Accession with highest reciprocal overlap"

Contains data from estd192 (COSMIC), Date: 2017-03-29 nstd102 (Clinical_Structural_Variants), Date: 2023-09-30 nstd202 (Ghazali_et_al_2021), Date: 2021-04-08 nstd94 (Helman_et_al_2014), Date: 2014-05-12 nstd11 (Walter_et_al_2009), Date: 2009-10-13 nstd125 (Wills_et_al_2016), Date: 2016-09-20

dbvar.GRCh38.variant_call.somatic.vcf.gz

Absolute path: /ssh:shannc@161.200.107.77:/data/project/stemcell/shannc/reference/variants/dbvar.GRCh38.variant_call.somatic.vcf.gz Date added: FILL Source: dbVar

Description

All somatic variant calls on GRCh38 from dbVar

##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=20231030
##reference=GCF_000001405.39
##ALT=<ID=<CNV>,Description="Copy number variable region">
##ALT=<ID=<DEL>,Description="Deletion relative to the reference">
##ALT=<ID=<DUP>,Description="Region of elevated copy number relative to the reference">
##ALT=<ID=<INS>,Description="Insertion of sequence relative to the reference">
##ALT=<ID=<INV>,Description="Inversion of reference sequence">
##INFO=<ID=DBVARID,Number=1,Type=String,Description="ID of this element in dbVar">
##INFO=<ID=CIEND,Number=2,Type=Integer,Description="Confidence interval around END for imprecise variants">
##INFO=<ID=CIPOS,Number=2,Type=Integer,Description="Confidence interval around POS for imprecise variants">
##INFO=<ID=DESC,Number=1,Type=String,Description="Any additional information about this call (free text, enclose in double quotes)">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
##INFO=<ID=IMPRECISE,Number=0,Type=Flag,Description="Imprecise structural variation">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=SVLEN,Number=.,Type=String,Description="Difference in length between REF and ALT alleles">
##INFO=<ID=CHR2,Number=1,Type=String,Description="Second (To) Chromosome in a translocation pair">
##INFO=<ID=REGIONID,Number=.,Type=String,Description="The parent variant region accession(s)">
##INFO=<ID=EXPERIMENT,Number=1,Type=Integer,Description="The experiment_id (from EXPERIMENTS tab) of the experiment that was used to generate this call">
##INFO=<ID=EVENT,Number=.,Type=String,Description="The parent variant region accession of a mutation event">
##INFO=<ID=LINKS,Number=.,Type=String,Description="Link(s) to external database(s) - see LINKS tab of dbVar submission template for examples">
##INFO=<ID=CLNSIG,Number=.,Type=String,Description="Clinical significance for this single variant">
##INFO=<ID=CLNACC,Number=.,Type=String,Description="Accessions and version numbers assigned by ClinVar">
##INFO=<ID=clinical_source,Number=1,Type=String,Description="Source of clinical significance">
##INFO=<ID=SOMATIC,Number=0,Type=Flag,Description="Indicates that the record is a somatic mutation. NOT for clinical assertions, i.e. cancer. See also ORIGIN.">
##INFO=<ID=ORIGIN,Number=1,Type=String,Description="Origin of allele, if known; should be one of (biparental, de novo, germline, inherited, maternal, not applicable, not provided, not-reported, paternal, tested-inconclusive, uniparental, unknown, see ClinVar for details). See also SOMATIC">
##INFO=<ID=PHENO,Number=.,Type=String,Description="Phenotype(s) thought to associated with this call. NOT for clinical assertions (submit to ClinVar). (free text, enclose in double quotes)">
##INFO=<ID=SAMPLE,Number=1,Type=String,Description="sample_id from dbVar submission; every call must have SAMPLE or SAMPLESET, but NOT BOTH">
##INFO=<ID=SAMPLESET,Number=1,Type=Integer,Description="sampleset_id from dbVar submission; every call must have SAMPLESET or SAMPLE but NOT BOTH">
##INFO=<ID=VALIDATED,Number=0,Type=Flag,Description="Validated by follow-up experiment">
##INFO=<ID=SEQ,Number=1,Type=String,Description="Variation sequence">
##INFO=<ID=AC,Number=.,Type=Integer,Description="Global Allele count">
##INFO=<ID=AF,Number=.,Type=Float,Description="Global Allele frequency">
##INFO=<ID=AN,Number=.,Type=String,Description="Global Allele name">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO

ENCFF356LFX.bed

Absolute path: /ssh:shannc@161.200.107.77:/data/project/stemcell/shannc/reference/blacklists/ENCFF356LFX.bed Date added: ??? Source: Encode project

Description

Blacklisted regions provided b the Encode project for the GRCh38 assembly

Clingen-Dosage-Sensitivity-2024-12-16.csv

Description

Contains curated assignments of dosage sensitivity to genes and regions

"GENE/REGION","HGNC/ISCA","GRCh37","GRCh38","HAPLOINSUFFICIENCY","TRIPLOSENSITIVITY","ONLINE REPORT","DATE"

FilesExpand file tree

MANIFEST.org

Latest commit

History

MANIFEST.org

File metadata and controls

mouse_STR

tcr_imgt_c

TCR_VDJ_families

TFLink_Homo_sapiens_interactions_SS_mitab_v1.0.tsv

human_core_ regulator data

signalink_liver_network.csv

cell_markers.yaml, gene_sets.yaml

disco_cell_types.csv

construct_sequences

u133a_mapping.tsv

mammaprint_candidate_genes.tsv

IEDB_data.tsv

CEDAR_data.tsv

vdjdb.h5ad

McPAS-TCR.csv

MGI_Exome_Capture_V5_lifted.bed

MGI_Exome_Capture_V5.bed

Adult_Human_Skin.pkl

9_healthy_reference_AP_large_intestine_finalmodel.pkl

CellMarker2_human.csv

Homo_sapiens.GRCh38.113.ref_flat

GCA_000001405.15_GRCh38_no_alt_analysis_set_100.bw

TCGA_HCC

2024-06-18_IntOGen-Drivers

census.csv

Homo_sapiens.GRCh38.cdna.all.fa.gz

kallisto_GRCh38_ensembl.idx

Homo_sapiens.GRCh38.113.gtf.gz

Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz

SureSelectHumanAllExonV6Hg38

GRCh38_gencode_v44_CTAT_lib_Oct292023.plug-n-play.tar.gz

GCF_000001405.40_GRCh38.p14_genomic.sdf

cosmicv3.4_signatures.csv

Clingen-Gene-Disease-Summary-2024-12-19.csv

aggregated_msi.tsv

Description

aggregated_cnv.tsv

Description

gnomADv4.1.0_Exomes

Description

Cosmic_CompleteCNA_v101_GRCh38.tsv

Description

gnomADv4.1.0/random

Description

gatk_resources

Description

gnomad.v4.1.cnv.non_neuro.vcf.gz

Description

gnomad.v4.1.sv.sites.vcf.gz

Description

clinvar.vcf.gz

Description

nstd102_clinical_sv.csv

Description

somatic_sv.bed

Description

dbvar.GRCh38.variant_call.somatic.vcf.gz

Description

ENCFF356LFX.bed

Description

Clingen-Dosage-Sensitivity-2024-12-16.csv

Description