-
Notifications
You must be signed in to change notification settings - Fork 4
Gene Information
- Gene information: gene_info.csv
- GTEx gene expression and disease relevance: gene_disease_gtex_tissue_expression.csv.gz
Filename: gene_info.csv
All the gene features you might need to build your prediction model are stored in this file. It contains the symbol, identifiers (gene locus and protein), type (e.g. protein-coding or not), cellular location, protein class) of the genes as well as their Gene Ontology annotations.
| Column name | Description |
|---|---|
| symbol | gene symbol |
| hgnc_id | HGNC official identifier |
| entrez_id | Entrez gene identifier |
| ensembl_gene_id | ENSEMBL gene identifier |
| uniprot_id | UniProt protein identifier (note that there can be several proteins for one gene) |
| locus_type | type of this genomic locus |
| locus_group | group classification for this genomic locus |
| go_id | Gene Ontology (GO) term identifier |
| go_label | Gene Ontology (GO) term name |
| evidence_type | evidence type according to GO (please refer to ) |
| reported_count | how many times this type of evidence has been reported (useful for replicability) |
| protein_class | ChEMBL druggable genome classification of the protein |
| target_class | target class |
| topology_type | topology information |
| target_location | Cellular location |
| ExAC_LoF | Resilient to Loss of Function according to ExAC |
| pc_mouse_gene_identity | mouse ortholog |
| GTEX_median_all_tissues | median expression across all GTEx tissues |
| description | gene description |
Filename: gene_disease_gtex_tissue_expression.csv.gz
This compressed file contains the relation between 2 important pieces of information to build the prediction model:
- A) The relevant tissue for a disease from a systematic mining of the scientific literature (see this scientific report by Vinod Kumar and colleagues at GSK)
- B) The genes specifically expressed in the disease-affected tissue
Hence, it's possible to combine the tissue and expression in your model to assess if successful drug targets are also expressed at the protein-level.
| Column name | Description |
|---|---|
| entrez_id | Entrez gene identifier |
| ensembl_gene_id | ENSEMBL gene identifier |
| symbol | gene symbol |
| disease_id | disease identifier |
| disease_label | disease name |
| tissue_label | tissue name as described in GTEx |
| source | GTEx version 6 |
| max_fold_change | gene expression fold change (if mRNA expression in the indicated tissue for this gene is at least 5-fold above the median tissue and within 5-fold of the highest expression tissue) |
| expression_score | normalised gene expression score for max_fold_change |
In the example below, the gene MUC7 is specifically expressed in the Salivary Gland.
gunzip -c gene_disease_gtex_tissue_expression.csv.gz | head -5
0,entrez_id,ensembl_gene_id,symbol,disease_id,disease_label,tissue_label,source,max_fold_change,expression_score
0,4589,ENSG00000171195,MUC7,EFO_0007383,Mumps virus infectious disease,Minor Salivary Gland,GTExv6,57385.21,0.99
1,4589,ENSG00000171195,MUC7,EFO_1000384,Mixed Tumor of the Salivary Gland,Minor Salivary Gland,GTExv6,57385.21,0.99
2,4589,ENSG00000171195,MUC7,EFO_0003826,salivary gland neoplasm,Minor Salivary Gland,GTExv6,57385.21,0.99
3,4589,ENSG00000171195,MUC7,EFO_1000344,Major Salivary Gland Carcinoma,Minor Salivary Gland,GTExv6,57385.21,0.99We can double-check that on the Open Targets portal:
