Relative Path: analyses/data/mouse_STR
Date Added: <2026-04-07 Tue>
- Using NGS for STR profiling isn’t the norm, and information for the standard 18 mouse loci (specifically STR motif AND genomic coordinates) aren’t readily accessible.
- mouse_STR_genomic_locations_patent.csv: Fig 1. USRE49835E1 - Mouse cell line authentication - Google Patents (mm38)
- mouse_STR_loci_info.csv Table 3 of the Authentication of Human and Mouse Cell Lines by Short Tandem Repeat (STR) DNA Genotype Analysis - Assay Guidance Manual (mm39)
- mouse_STR_primers.csv Table 11 of Interlaboratory study to validate a STR profiling method for intraspecies identification of mouse cell lines (mm38)
- mouse_STR_sequencing_primers.csv Table S2. of Short tandem repeat profiling via next-generation sequencing for cell line authentication (mm38)
- mouse_STR_multiplex_pcr_primers.csv Table 7. of the Assay Guidance Manual (mm39)
Relative Path: analyses/data/tcr_imgt_c
Date Added: <2026-03-18 Wed>
Source: IMGT
- IMGT reference sequences for human TCR C chains to use for TCR constructs
- For TRAC, the C-REGION (which contains all of the cDNA) was extracted from the X02592 accession and concatenated with a stop codon
- No cDNA pages were available for TRBC[1,2] so instead for each allele, sequences EX1-4 (exons) were obtained and concatenated together with a stop codon
- Links
Relative Path: analyses/data/TCR_VDJ_families
Date Added: <2026-03-12 Thu>
Source: In-house
Amino acid sequences of codon-optimized TCR V and constant chains
Relative Path: analyses/data/TFLink_Homo_sapiens_interactions_SS_mitab_v1.0.tsv
Date Added: <2026-02-25 Wed>
Source: Homo sapiens small-scale interaction table
In MITAB format (see here). All entries were verified with small-scale experimental evidence
Relative Path: analyses/data/hcc_reference/human_core_{Regulator_Gene,TF_Target}.txt
Date Added: <2026-02-25 Wed>
Source: Core data
Regulatory interactions for core human genes
Relative Path: analyses/data/hcc_reference/signalink_liver_network.cys
Date Added: <2026-02-24 Tue>
Source: Signalink v3.1 Liver
- Obtained by selecting…
- species: “human”
- all pathways
- Pathway regulators, pathway mambers, and transcriptional regulators
- liver localization
Relative Path: analyses/data/hcc_reference/cell_markers.yaml
Date Added: <2026-02-11 Wed>
Source: Custom
Cell markers and gene sets used for annotating HCC scRNA-seq data. Generated with analyses/scrnaseq_hcc/get_markers.R
Relative Path: analyses/data/disco_cholangiocyte.csv
Date Added: <2026-02-11 Wed>
Source: Cell types from DISCO
Relative Path: analyses/pdac_tcr/construct_sequences
Date Added: <2026-02-05 Thu>
Source: External construct sequences
- The following sequences were obtained from supplementary table 3 of the linked paper
furin_linker.fastamurinized_trac.fastamurinized_trbc.fastagibson_assembly_3p.fastagibson_assembly_5p.fasta
Relative Path: analyses/data/u133a_mapping.tsv
Date Added: <2025-11-12 Wed>
Source: Biomart
Mapping between Ensembl IDs, HUGO gene symbols, NCBI IDs, and Affymetrix U133A probe ids. Obtained from Ensembl biomart (version 115) with the GRCh38.p14 assembly
Relative Path: analyses/data/mammaprint_candidate_genes.tsv
Date Added: <2025-11-12 Wed>
Source: Gene expression profiling predicts clinical outcome of breast cancer
List of 231 prognosis reporter genes obtained by Veer et al. (Supplementary Table 2 of the linked paper). The 70 genes belonging to the final prognostic signature (used in the Mammaprint test) are indicated by the final_list column. These are the 70 genes with the highest correlation coefficients
Relative Path: analyses/data/IEDB_data.tsv
Date Added: <2025-10-06 Mon>
Source: https://downloads.iedb.org/misc/TCRMatch/IEDB_data.tsv
IEDB data formatted for TCRMatch
Relative Path: analyses/data/CEDAR_data.tsv
Date Added: <2025-10-06 Mon>
Source: https://github.com/IEDB/TCRMatch
Epitope data from the CEDAR database as provided in the 1.3 release of TCRMatch
Relative Path: analyses/data/vdjdb.h5ad
Date Added: <2025-09-26 Fri>
Source: scirpy.datasets.vdjdb — scirpy
Complete VDJdb database (see the homepage here), last updated 30 July, 2025
Relative Path: analyses/data/McPAS-TCR.csv
Date Added: <2025-09-26 Fri>
Source: Download page
Complete McPAS-TCR database (last updated September 10, 2022).
“McPAS-TCR is a manually curated catalogue of T cell receptor (TCR) sequences that were found in T cells associated with various pathological conditions in humans and in mice. It is meant to link TCR sequences to their antigen target or to the pathology and organ with which they are associated”
Absolute path: /ssh:shannc@161.200.107.77:/data/project/stemcell/shannc/reference/exome_kits/MGI_V5/MGI_Exome_Capture_V5_lifted.bed
Date added: <2025-03-03 Mon>
Source: Custom
MGI_Exome_Capture_V5.bed file with coordinates lifted over from the hg19 assembly to hg38.
Liftover process carried out with the web interface of UCSC LiftOver with default parameters. Failed regions are in “liftover_failures.bed” in the same directory
Absolute path: /ssh:shannc@161.200.107.77:/data/project/stemcell/shannc/reference/exome_kits/MGI/_V5/MGI_Exome_Capture_V5.bed
Date added: <2025-03-03 Mon>
Source: MGIEasy Exome Capture V5 capture bed file
Absolute path: /home/shannc/Bio_SDD/chula-stem/analyses/data/Adult_Human_Skin.pkl
Date added: <2025-02-11 Tue>
Source: CellTypist Adult human skin model
Absolute path: /home/shannc/Bio_SDD/chula-stem/analyses/data/9_healthy_reference_AP_large_intestine_finalmodel.pkl
Date added: <2025-02-11 Tue>
Source: Large intestine
Cell typist model of Large intestine cells collected from adult/paediatric samples (Pan-GI cell atlas)
Absolute path: /home/shannc/Bio_SDD/chula-stem/analyses/data/CellMarker2_human.csv
Date added: <2025-02-11 Tue>
Source: CellMarker 2.0 Human Cell markers
Human cell markers from CellMarker 2 used for marker-based annotation e.g. CellAssign
Absolute path: /ssh:shannc@161.200.107.77:/data/project/stemcell/shannc/reference/genomes/Homo_sapiens.GRCh38.113.ref_flat
Date added: <2025-01-31 Fri>
Source: Custom
refFlat file of the GRCh38 assembly, generated to use with gatk’s CollectRnaSeqMetrics from the Ensembl gtf file. Converted using the following command (taken from broadinstitute/picard#805 )
gtfToGenePred \
-genePredExt \
-geneNameAsName2 \
-ignoreGroupsWithoutExons \
!{params.gtf} \
/dev/stdout | \
awk 'BEGIN { OFS="\t"} {print $12, $1, $2, $3, $4, $5, $6, $7, $8, $9, $10}'Absolute path: /ssh:shannc@161.200.107.77:/data/project/stemcell/shannc/reference/tool_specific/GCA_000001405.15_GRCh38_no_alt_analysis_set_100.bw
Date added: <2025-01-20 Mon>
Source: https://s3.amazonaws.com/purecn/GCA_000001405.15_GRCh38_no_alt_analysis_set_100.bw
100-kmer mappability file for hg38, intended for use with PureCN, prepared by the Waldron lab.
Absolute path: /ssh:shannc@161.200.107.77:/data/project/stemcell/shannc/public_data/TCGA_HCC
Date added: <2025-01-15 Wed>
Source: TCGA Hepatocellular Carcinoma data from (available on GDC)
STAR raw read counts of hepatocellular carcinoma (both normal and tumor) samples downloaded from the Genomic Data Portal. The data are from the TCGA project - Comprehensive and Integrative Genomic Characterization of Hepatocellular Carcinoma
Absolute path: /home/shannc/Bio_SDD/chula-stem/analyses/data/2024-06-18_IntOGen-Drivers
Date added: <2025-01-09 Thu>
Source: Intogen
Intogen driver gene data
Absolute path: /home/shannc/Bio_SDD/chula-stem/analyses/data/census.csv
Date added: <2025-01-10 Fri>
Source: Cancer Gene Census
COSMIC cancer gene census
Absolute path: /ssh:shannc@161.200.107.77:/data/project/stemcell/shannc/reference/transcriptomes/Homo_sapiens.GRCh38.cdna.all.fa.gz
Date added: <2025-01-09 Thu>
Source: Ensembl cDNA
All cDNA sequences corresponding to Ensembl human genes (genome build GRCh38), excluding nCRNA
Absolute path: /ssh:shannc@161.200.107.77:/data/project/stemcell/shannc/reference/transcriptomes/Homo_sapiens.GRCh38.cdna.all.fa.gz
Date added: <2025-01-09 Thu>
Source: Custom
Kallisto index (kallisto index -i ... ) of Ensembl cDNA sequences (Homo_sapiens.GRCh38.cdna.all.fa.gz) converted into RNA with seqkit seq --dna2rna
Absolute path: /ssh:shannc@161.200.107.77:/data/project/stemcell/shannc/reference/genomes/Homo_sapiens.GRCh38.113.gtf.gz
Date added: <2025-01-09 Thu>
Source: Ensembl annotation data
Ensembl gene annotation data for human genome GRCh38, version 113
Absolute path: /ssh:shannc@161.200.107.77:/data/project/stemcell/shannc/reference/genomes/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz
Date added: <2025-01-09 Thu>
Source: Ensembl genomes
Ensembl’s primary assembly of the GRCh38 genome, added into the directory on 2024-8-15
Absolute path: /ssh:shannc@161.200.107.77:/data/project/stemcell/shannc/reference/exome_kits/SureSelectHumanAllExonV6Hg38
Date added: <2024-11-22 Thu>
Source: Agilent SureDesign
Download package for the SureSelectHumanAllExon V6 Hg38 capture kit.
The *_Covered.bed file are taken to be baits, *_Regions.bed to be targets
From this thread
Absolute path: /ssh:shannc@161.200.107.77:/data/project/stemcell/shannc/reference/GRCh38_gencode_v44_CTAT_lib_Oct292023.plug-n-play.tar.gz
Date added: <2025-01-01 Wed>
Source: https://data.broadinstitute.org/Trinity/CTAT_RESOURCE_LIB/
Reference genome and protein-coding gene annotation set, for use in fusion detection with STAR-Fusion
Absolute path: /ssh:shannc@161.200.107.77:/data/project/stemcell/shannc/reference/genomes/GCF_000001405.40_GRCh38.p14_genomic.sdf
Date added: <2024-12-30 Mon>
Source: Custom
RTG Sequence Data File of the GRCh38 reference genome, see the docs <2024-12-30 Mon> Generated for use with vcfeval
Absolute path: /home/shannc/Bio_SDD/chula-stem/nextflow/config/cosmicv3.4_signatures.csv
Date added: <2024-12-25 Wed>
Source: Custom
Contains metadata about cosmic signatures (v3.4) to aid in reporting the results of SigProfiler
Signature,Class,Proposed_aetiology,Studies,COSMIC_link
- Proposed aetiologies are taken directly from the COSMIC page
Absolute path: /ssh:shannc@161.200.107.77:/data/project/stemcell/shannc/reference/therapeutics/Clingen-Dosage-Sensitivity-2024-12-16.csv
Date added: <2024-12-19 Thu>
Source: ClinGen
A summary of Gene-Disease validity curations by ClinGen. Used primarily to identify potentially important genes that are affected by called repetitive regions
Absolute path: /ssh:shannc@161.200.107.77:/data/project/stemcell/shannc/reference/variants/structural/aggregated_msi.tsv
Date added: <2024-12-18 Wed>
Source: Custom
dbVar and clinVar repetitive element data, combined using a custom script. Intended to help identify when observed repetitive elements (mainly microsatellites and tandem duplications) in the data have been found in other experiments Includes somatic tandem duplications from dbVar and microsatellites from clinvar
chr\tstart\tstop\tgenes\tsource\ttype\taccession
Absolute path: /ssh:shannc@161.200.107.77:/data/project/stemcell/shannc/reference/variants/structural/aggregated_cnv.tsv
Date added: <2024-12-18 Wed>
Source: Custom
dbVar and gnomAD CNV data, combined using a custom script. Intended to help identify when observed CNVs in the data have been found in other experiments Includes somatic cnvs from dbVar and gnomad.v4.1 non-neuro cnvs
chr\tstart\tstop\tgenes\tsource\ttype\taccession
Absolute path: /ssh:shannc@161.200.107.77:/data/project/stemcell/shannc/reference/variants/gnomADv4.1.0_Exomes
Date added: <2024-11-6>
Source: https://gnomad.broadinstitute.org/downloads#v4
GnomAD exome data for all available chromosomes
Absolute path: /ssh:shannc@161.200.107.77:/data/project/stemcell/shannc/reference/variants/structural/Cosmic_CompleteCNA_v101_GRCh38.tsv
Date added: <2024-12-13 Fri>
Source: https://cancer.sanger.ac.uk/cosmic/download/cosmic/v101/completecna
Cosmic copy number data
Absolute path: /ssh:shannc@161.200.107.77:/data/project/stemcell/shannc/reference/variants/gnomADv4.1.0_Exomes/random
Date added: <2024-12-18 Wed>
Source: Custom
Randomly sampled subsets of all the gnomAD exome data, taking 10% of the original variants. To resolve memory issues with running on the complete gnomAD set
Absolute path: /ssh:shannc@161.200.107.77:/data/project/stemcell/shannc/reference/variants/gatk_resources
Date added: <2024-12-17 Tue>
Source: https://gatk.broadinstitute.org/hc/en-us/articles/360035890811-Resource-bundle
Contains 1000G project data, and Mills gold standard indels. Recommended for use with BQSR, as described in this faq
Absolute path: /ssh:shannc@161.200.107.77:/data/project/stemcell/shannc/reference/variants/structural/gnomad.v4.1.cnv.non_neuro.vcf.gz
Date added: <2024-12-17 Tue>
Source: https://datasetgnomad.blob.core.windows.net/dataset/release/4.1/genome_sv/gnomad.v4.1.sv.sites.vcf.gz.tbi
Gnomad Copy number data, obtained the same way as the SV data
Absolute path: /ssh:shannc@161.200.107.77:/data/project/stemcell/shannc/reference/variants/structural/gnomad.v4.1.sv.sites.vcf.gz
Date added: FILL
Source: https://datasetgnomad.blob.core.windows.net/dataset/release/4.1/genome_sv/gnomad.v4.1.sv.sites.vcf.gz.tbi
Structural variants from gnomAD, see this for an overview of the data generation
##FILTER=<ID=PASS,Description="All filters passed">
##ALT=<ID=CNV,Description="Copy Number Polymorphism">
##CPX_TYPE_INS_iDEL="Insertion with deletion at insertion site."
##CPX_TYPE_INVdel="Complex inversion with 3' flanking deletion."
##CPX_TYPE_INVdup="Complex inversion with 3' flanking duplication."
##CPX_TYPE_dDUP="Dispersed duplication."
##CPX_TYPE_dDUP_iDEL="Dispersed duplication with deletion at insertion site."
##CPX_TYPE_delINV="Complex inversion with 5' flanking deletion."
##CPX_TYPE_delINVdel="Complex inversion with 5' and 3' flanking deletions."
##CPX_TYPE_delINVdup="Complex inversion with 5' flanking deletion and 3' flanking duplication."
##CPX_TYPE_dupINV="Complex inversion with 5' flanking duplication."
##CPX_TYPE_dupINVdel="Complex inversion with 5' flanking duplication and 3' flanking deletion."
##CPX_TYPE_dupINVdup="Complex inversion with 5' and 3' flanking duplications."
##CPX_TYPE_piDUP_FR="Palindromic inverted tandem duplication, forward-reverse orientation."
##CPX_TYPE_piDUP_RF="Palindromic inverted tandem duplication, reverse-forward orientation."
##FILTER=<ID=FAIL_MANUAL_REVIEW,Description="Low-quality variant that did not pass manual review of supporting evidence">
##FILTER=<ID=HIGH_NCR,Description="Unacceptably high rate of no-call GTs">
##FILTER=<ID=IGH_MHC_OVERLAP,Description="SVs that are overlapped by over 50% by IGH or MCH regions, these variants are of low confidence">
##FILTER=<ID=LOWQUAL_WHAM_SR_DEL,Description="deletions under1Kb that are uniquely from wham and have SR-only support">
##FILTER=<ID=MULTIALLELIC,Description="Multiallelic site">
##FILTER=<ID=OUTLIER_SAMPLE_ENRICHED,Description="SVs that are enriched for non-reference genotypes in outlier samples, likely indicating noisy or unreliable genotypes">
##FILTER=<ID=REDUNDANT_LG_CNV,Description="Multiple large CNVs called at the same locus likely indicates unreliable clustering and/or low-quality multiallelic locus">
##FILTER=<ID=REFERENCE_ARTIFACT,Description="Likely reference artifact sites that are homozygous alternative in over 99% of the samples">
##FILTER=<ID=UNRESOLVED,Description="Variant is unresolved">
##FORMAT=<ID=CN,Number=1,Type=Integer,Description="Predicted copy state">
##FORMAT=<ID=CNQ,Number=1,Type=Integer,Description="Read-depth genotype quality">
##FORMAT=<ID=EV,Number=.,Type=String,Description="Classes of evidence supporting final genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=MANUAL,Number=1,Type=String,Description="Reason for a failure from manual review">
##FORMAT=<ID=PE_GQ,Number=1,Type=Integer,Description="Paired-end genotype quality">
##FORMAT=<ID=PE_GT,Number=1,Type=Integer,Description="Paired-end genotype">
##FORMAT=<ID=RD_CN,Number=1,Type=Integer,Description="Predicted copy state">
##FORMAT=<ID=RD_GQ,Number=1,Type=Integer,Description="Read-depth genotype quality">
##FORMAT=<ID=SR_GQ,Number=1,Type=Integer,Description="Split read genotype quality">
##FORMAT=<ID=SR_GT,Number=1,Type=Integer,Description="Split-read genotype">
##INFO=<ID=ALGORITHMS,Number=.,Type=String,Description="Source algorithms">
##INFO=<ID=BOTHSIDES_SUPPORT,Number=0,Type=Flag,Description="Variant has read-level support for both sides of breakpoint.Indicates higher-confidence variants.">
##INFO=<ID=CHR2,Number=1,Type=String,Description="Chromosome for END coordinate">
##INFO=<ID=CPX_INTERVALS,Number=.,Type=String,Description="Genomic intervals constituting complex variant.">
##INFO=<ID=CPX_TYPE,Number=1,Type=String,Description="Class of complex variant.">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the structural variant">
##INFO=<ID=END2,Number=1,Type=Integer,Description="End position of the structural variant on CHR2">
##INFO=<ID=EVIDENCE,Number=.,Type=String,Description="Classes of random forest support.">
##INFO=<ID=LOW_CONFIDENCE_REPETITIVE_LARGE_DUP,Number=0,Type=Flag,Description="Duplications over 5Kb that are overlapped by segmental duplicates or simple repeats, these variants are not as high confident as others because of the affect by the repetitive sequences">
And more...
Absolute path: /ssh:shannc@161.200.107.77:/data/project/stemcell/shannc/reference/variants/structural/clinvar.vcf.gz
Date added: 2024-12-12
Source: https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar_20241208.vcf.gz
All vcf data from clinVar for GRCh38
##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=2024-12-08
##source=ClinVar
##reference=GRCh38
##ID=<Description="ClinVar Variation ID">
##INFO=<ID=AF_ESP,Number=1,Type=Float,Description="allele frequencies from GO-ESP">
##INFO=<ID=AF_EXAC,Number=1,Type=Float,Description="allele frequencies from ExAC">
##INFO=<ID=AF_TGP,Number=1,Type=Float,Description="allele frequencies from TGP">
##INFO=<ID=ALLELEID,Number=1,Type=Integer,Description="the ClinVar Allele ID">
##INFO=<ID=CLNDN,Number=.,Type=String,Description="ClinVar's preferred disease name for the concept specified by disease identifiers in CLNDISDB">
##INFO=<ID=CLNDNINCL,Number=.,Type=String,Description="For included Variant : ClinVar's preferred disease name for the concept specified by disease identifiers in CLNDISDB">
##INFO=<ID=CLNDISDB,Number=.,Type=String,Description="Tag-value pairs of disease database name and identifier submitted for germline classifications, e.g. OMIM:NNNNNN">
##INFO=<ID=CLNDISDBINCL,Number=.,Type=String,Description="For included Variant: Tag-value pairs of disease database name and identifier for germline classifications, e.g. OMIM:NNNNNN">
##INFO=<ID=CLNHGVS,Number=.,Type=String,Description="Top-level (primary assembly, alt, or patch) HGVS expression.">
##INFO=<ID=CLNREVSTAT,Number=.,Type=String,Description="ClinVar review status of germline classification for the Variation ID">
##INFO=<ID=CLNSIG,Number=.,Type=String,Description="Aggregate germline classification for this single variant; multiple values are separated by a vertical bar">
##INFO=<ID=CLNSIGCONF,Number=.,Type=String,Description="Conflicting germline classification for this single variant; multiple values are separated by a vertical bar">
##INFO=<ID=CLNSIGINCL,Number=.,Type=String,Description="Germline classification for a haplotype or genotype that includes this variant. Reported as pairs of VariationID:classification; multiple values are separated by a vertical bar">
##INFO=<ID=CLNVC,Number=1,Type=String,Description="Variant type">
##INFO=<ID=CLNVCSO,Number=1,Type=String,Description="Sequence Ontology id for variant type">
##INFO=<ID=CLNVI,Number=.,Type=String,Description="the variant's clinical sources reported as tag-value pairs of database and variant identifier">
##INFO=<ID=DBVARID,Number=.,Type=String,Description="nsv accessions from dbVar for the variant">
##INFO=<ID=GENEINFO,Number=1,Type=String,Description="Gene(s) for the variant reported as gene symbol:gene id. The gene symbol and id are delimited by a colon (:) and each pair is delimited by a vertical bar (|)">
##INFO=<ID=MC,Number=.,Type=String,Description="comma separated list of molecular consequence in the form of Sequence Ontology ID|molecular_consequence">
##INFO=<ID=ONCDN,Number=.,Type=String,Description="ClinVar's preferred disease name for the concept specified by disease identifiers in ONCDISDB">
##INFO=<ID=ONCDNINCL,Number=.,Type=String,Description="For included variant: ClinVar's preferred disease name for the concept specified by disease identifiers in ONCDISDBINCL">
##INFO=<ID=ONCDISDB,Number=.,Type=String,Description="Tag-value pairs of disease database name and identifier submitted for oncogenicity classifications, e.g. MedGen:NNNNNN">
##INFO=<ID=ONCDISDBINCL,Number=.,Type=String,Description="For included variant: Tag-value pairs of disease database name and identifier for oncogenicity classifications, e.g. OMIM:NNNNNN">
##INFO=<ID=ONC,Number=.,Type=String,Description="Aggregate oncogenicity classification for this single variant; multiple values are separated by a vertical bar">
##INFO=<ID=ONCINCL,Number=.,Type=String,Description="Oncogenicity classification for a haplotype or genotype that includes this variant. Reported as pairs of VariationID:classification; multiple values are separated by a vertical bar">
##INFO=<ID=ONCREVSTAT,Number=.,Type=String,Description="ClinVar review status of oncogenicity classification for the Variation ID">
##INFO=<ID=ONCCONF,Number=.,Type=String,Description="Conflicting oncogenicity classification for this single variant; multiple values are separated by a vertical bar">
##INFO=<ID=ORIGIN,Number=.,Type=String,Description="Allele origin. One or more of the following values may be added: 0 - unknown; 1 - germline; 2 - somatic; 4 - inherited; 8 - paternal; 16 - maternal; 32 - de-novo; 64 - biparental; 128 - uniparental; 256 - not-tested; 512 - tested-inconclusive; 1073741824 - other">
##INFO=<ID=RS,Number=.,Type=String,Description="dbSNP ID (i.e. rs number)">
##INFO=<ID=SCIDN,Number=.,Type=String,Description="ClinVar's preferred disease name for the concept specified by disease identifiers in SCIDISDB">
##INFO=<ID=SCIDNINCL,Number=.,Type=String,Description="For included variant: ClinVar's preferred disease name for the concept specified by disease identifiers in SCIDISDBINCL">
##INFO=<ID=SCIDISDB,Number=.,Type=String,Description="Tag-value pairs of disease database name and identifier submitted for somatic clinial impact classifications, e.g. MedGen:NNNNNN">
##INFO=<ID=SCIDISDBINCL,Number=.,Type=String,Description="For included variant: Tag-value pairs of disease database name and identifier for somatic clinical impact classifications, e.g. OMIM:NNNNNN">
##INFO=<ID=SCIREVSTAT,Number=.,Type=String,Description="ClinVar review status of somatic clinical impact for the Variation ID">
##INFO=<ID=SCI,Number=.,Type=String,Description="Aggregate somatic clinical impact for this single variant; multiple values are separated by a vertical bar">
##INFO=<ID=SCIINCL,Number=.,Type=String,Description="Somatic clinical impact classification for a haplotype or genotype that includes this variant. Reported as pairs of VariationID:classification; multiple values are separated by a vertical bar">
Absolute path: /ssh:shannc@161.200.107.77:/data/project/stemcell/shannc/reference/variants/structural/nstd102_clinical_sv.csv
Date added: 2024-12-17
Source: https://www.ncbi.nlm.nih.gov/dbvar/studies/nstd102/
“Structural Variants with clinical assertions, submitted to ClinVar by external labs. dbVar now imports all placements from ClinVar as “submitted” and only remaps what is missing in order to place all variants on both GRCh37 and GRCh38. See Variant Summary counts for nstd102 in dbVar Variant Summary. See the latest statistics for nstd102 in Summary of nstd102 (Clinical Structural Variants).”
Study ID,Variant ID,Variant Region type,Variant Call type,Sampleset ID,Method,Analysis ID,Validation,Variant Samples,Subject Phenotype,Clinical Interpretation,Assembly Name,Chromosome Accession,Chromosome,Outer Start,Start,Inner Start,Inner End,End,Outer End,Placement Type,Remap Score
Absolute path: /ssh:shannc@161.200.107.77:/data/project/stemcell/shannc/reference/variants/structural/somatic_sv.bed
Date added: 2024-12-16
Source: https://ftp.ncbi.nlm.nih.gov/pub/dbVar/sandbox/dbvarhub/hg38/somatic_sv.bed
Somatic structural variants from dbVar
# First line
chr1 61734 248930189 nssv459313 1000 . 61734 248930189 0,0,255 SNP array copy number gain Over 1MB Gene(s) affected: Multiple, Position: chr1:61735-248930189, Size: 248868455, Type: copy number gain, Study: Walter et al 2009, Method: SNP array Walter et al 2009 90 to 100 nssv15156970
Column explanation
1 string chrom; "Reference sequence chromosome or scaffold"
uint chromStart; "Start position of feature on chromosome"
uint chromEnd; "End position of feature on chromosome"
string name; "dbVar Variant Accession"
5 uint score; "Score"
char[1] strand; "+ or - for strand"
uint thickStart; "Start position of feature on chromosome"
uint thickEnd; "End position of feature on chromosome"
uint reserved; "Colors indicate Variant Type and Clinical Significance."
10 string method; "Discovery Method type"
string type; "Variant Type"
string length; "Variant Length Range"
string label; "Gene, Position, Size, Type, Study, Method"
string study; "dbVar Study"
string overlap; "Range of Reciprocal Overlap with Pathogenic Variant"
string pathogenic_acc; "dbVar Pathogenic Variant Accession with highest reciprocal overlap"
Contains data from estd192 (COSMIC), Date: 2017-03-29 nstd102 (Clinical_Structural_Variants), Date: 2023-09-30 nstd202 (Ghazali_et_al_2021), Date: 2021-04-08 nstd94 (Helman_et_al_2014), Date: 2014-05-12 nstd11 (Walter_et_al_2009), Date: 2009-10-13 nstd125 (Wills_et_al_2016), Date: 2016-09-20
Absolute path: /ssh:shannc@161.200.107.77:/data/project/stemcell/shannc/reference/variants/dbvar.GRCh38.variant_call.somatic.vcf.gz
Date added: FILL
Source: dbVar
All somatic variant calls on GRCh38 from dbVar
##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=20231030
##reference=GCF_000001405.39
##ALT=<ID=<CNV>,Description="Copy number variable region">
##ALT=<ID=<DEL>,Description="Deletion relative to the reference">
##ALT=<ID=<DUP>,Description="Region of elevated copy number relative to the reference">
##ALT=<ID=<INS>,Description="Insertion of sequence relative to the reference">
##ALT=<ID=<INV>,Description="Inversion of reference sequence">
##INFO=<ID=DBVARID,Number=1,Type=String,Description="ID of this element in dbVar">
##INFO=<ID=CIEND,Number=2,Type=Integer,Description="Confidence interval around END for imprecise variants">
##INFO=<ID=CIPOS,Number=2,Type=Integer,Description="Confidence interval around POS for imprecise variants">
##INFO=<ID=DESC,Number=1,Type=String,Description="Any additional information about this call (free text, enclose in double quotes)">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
##INFO=<ID=IMPRECISE,Number=0,Type=Flag,Description="Imprecise structural variation">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=SVLEN,Number=.,Type=String,Description="Difference in length between REF and ALT alleles">
##INFO=<ID=CHR2,Number=1,Type=String,Description="Second (To) Chromosome in a translocation pair">
##INFO=<ID=REGIONID,Number=.,Type=String,Description="The parent variant region accession(s)">
##INFO=<ID=EXPERIMENT,Number=1,Type=Integer,Description="The experiment_id (from EXPERIMENTS tab) of the experiment that was used to generate this call">
##INFO=<ID=EVENT,Number=.,Type=String,Description="The parent variant region accession of a mutation event">
##INFO=<ID=LINKS,Number=.,Type=String,Description="Link(s) to external database(s) - see LINKS tab of dbVar submission template for examples">
##INFO=<ID=CLNSIG,Number=.,Type=String,Description="Clinical significance for this single variant">
##INFO=<ID=CLNACC,Number=.,Type=String,Description="Accessions and version numbers assigned by ClinVar">
##INFO=<ID=clinical_source,Number=1,Type=String,Description="Source of clinical significance">
##INFO=<ID=SOMATIC,Number=0,Type=Flag,Description="Indicates that the record is a somatic mutation. NOT for clinical assertions, i.e. cancer. See also ORIGIN.">
##INFO=<ID=ORIGIN,Number=1,Type=String,Description="Origin of allele, if known; should be one of (biparental, de novo, germline, inherited, maternal, not applicable, not provided, not-reported, paternal, tested-inconclusive, uniparental, unknown, see ClinVar for details). See also SOMATIC">
##INFO=<ID=PHENO,Number=.,Type=String,Description="Phenotype(s) thought to associated with this call. NOT for clinical assertions (submit to ClinVar). (free text, enclose in double quotes)">
##INFO=<ID=SAMPLE,Number=1,Type=String,Description="sample_id from dbVar submission; every call must have SAMPLE or SAMPLESET, but NOT BOTH">
##INFO=<ID=SAMPLESET,Number=1,Type=Integer,Description="sampleset_id from dbVar submission; every call must have SAMPLESET or SAMPLE but NOT BOTH">
##INFO=<ID=VALIDATED,Number=0,Type=Flag,Description="Validated by follow-up experiment">
##INFO=<ID=SEQ,Number=1,Type=String,Description="Variation sequence">
##INFO=<ID=AC,Number=.,Type=Integer,Description="Global Allele count">
##INFO=<ID=AF,Number=.,Type=Float,Description="Global Allele frequency">
##INFO=<ID=AN,Number=.,Type=String,Description="Global Allele name">
#CHROM POS ID REF ALT QUAL FILTER INFO
Absolute path: /ssh:shannc@161.200.107.77:/data/project/stemcell/shannc/reference/blacklists/ENCFF356LFX.bed
Date added: ???
Source: Encode project
Blacklisted regions provided b the Encode project for the GRCh38 assembly
Absolute path: /ssh:shannc@161.200.107.77:/data/project/stemcell/shannc/reference/therapeutics/Clingen-Dosage-Sensitivity-2024-12-16.csv
Date added: 2024-12-16
Source: Clingen curated dosage sensitivity regions
Contains curated assignments of dosage sensitivity to genes and regions
"GENE/REGION","HGNC/ISCA","GRCh37","GRCh38","HAPLOINSUFFICIENCY","TRIPLOSENSITIVITY","ONLINE REPORT","DATE"