Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md
create_index_pchic_sqlite3_db.py	create_index_pchic_sqlite3_db.py
digest_genome.py	digest_genome.py
get_tissue_genes.py	get_tissue_genes.py
init_eqtl_genotype_db.py	init_eqtl_genotype_db.py
init_eqtl_meta.py	init_eqtl_meta.py
init_fragment_db.py	init_fragment_db.py
init_gene_variant_db.py	init_gene_variant_db.py
init_hic_db_postgres.py	init_hic_db_postgres.py
init_hic_db_sqlite.py	init_hic_db_sqlite.py
init_hic_meta.py	init_hic_meta.py
init_pchic_gene_tables.py	init_pchic_gene_tables.py
init_pchic_meta.py	init_pchic_meta.py
init_pchic_sqlite_db.py	init_pchic_sqlite_db.py
init_variant_fragment_table_pchic.py	init_variant_fragment_table_pchic.py
tidy_promoter_baitmap_file.py	tidy_promoter_baitmap_file.py

CoDeS3D: Contextualising Developmental SNPs in Three Dimensions

Database setup

CoDeS3D requires a PostgreSql database to run. This is because it integrates different types and sources of data, and using a database improves efficiency.

Install PostgreSql if you haven't already.
If you plan to use more than a few Hi-C or eQTL datasets, you may want to create a tablespace for your database. To do that,
1. Create a directory in location with adequate disk space
```
mkdir /path/to/tablespace
chown -R postgres /path/to/tablespace
chrgrp -R postgres /path/to/tablespace
```
1. Start psql as postgres and create tablespace
```
psql -U postgres -h localhost
CREATE TABLESPACE tblspace_codes3d LOCATION '</path/to/tablespace>';
```
1. Confirm that a tablespace is created
```
\db+
```
Create the codes3d user,

CREATE USER codes3d WITH PASSWORD 'your password' CREATEDB;
CREATE DATABASE codes3d OWNER codes3d;

Then create the codes3d_commons database,

CREATE DATABASE codes3d_commons OWNER codes3d tablespace tblspace_codes3d;
\q

Hi-C data pre-processing and installation

First process your raw Hi-C data with the Juicer pipeline. Ensure that you use the GRCh38 reference human genome. The final outputs of the Hi-C processing should have the following columns: name of the paired-ends reads, strand (str), chromosome (chr), position (pos), restriction site fragment (frag), and mapping quality score (score) for both read pairs. (Read pairs with mapping quality score < 30 will be removed during installation.)
Create a sub-directory for the processed Hi-C libraries in codes3d/lib/hic/<restriction enzyme>. After executing the subsequent steps, your Hi-C library directory should have a structure like the following:

codes3d/lib/hic/DpnII/
├── GCA_000001405.15_GRCh38_no_alt_analysis_set.DpnII.fragments.bed
├── GCA_000001405.15_GRCh38_no_alt_analysis_set.DpnII.fragments.db
├── GCA_000001405.15_GRCh38_no_alt_analysis_set.fa
├── dna.fragments.db
└── hic_data
    ├── A549-ACE2_0h_Ho2020
    │   ├── A549_ACE2_0h_Ho2020_GSM4955378_merged_nodups.db
    │   └── A549_ACE2_0h_Ho2020_GSM4955379_merged_nodups.db
    |
    └── TeloHAEC_Lalonde2019
        ├── TeloHAEC_GSM3593256_merged_nodups.db
        └── TeloHAEC_GSM3593257_merged_nodups.db

Digest the human genome with the restriction enzyme that was used to create your Hi-C library. See python data_preparation/digest_genome.py -h for help.
Load the Hi-C libraries into sqlite3 or PostgreSql databases with the appropriate helper scripts: data_preparation/hic_db_sqlite.py for sqlite3 or data_preparation/hic_db_postgres.py for PostgreSql.
Update the lib/meta_info/meta_hic.txt with information about the Hi-C cells. This file should have the following columns: library, name, cell_type, tissue, disease sex, ethnicity, karyotype, age, enzyme, and tags.
Finally, load the Hi-C meta-info to the codes3d_commons PostgreSql database with data_preparation/init_hic_meta.py.

eQTL data

CoDeS3D requires genotype and expression datasets that are prepared using GTEx's genotype and RNA-seq pipelines. In addition, follow steps 1, 2, and 3 of the GTEx's QTL pipeline to produce your normalised expression and combined covariates files.
Place all the required file in a sub-directory in codes3d/lib/eqtls/. Your file structure should look like this:

codes3d/lib/eqtls/GEUVADIS/
├── GEUVADIS.445_samples.GRCh38.20170504.maf01.filtered.nodup.lookup_table.txt.gz
├── GEUVADIS.445_samples.rnaseqc_tpm.gct.gz
├── GEUVADIS.445_samples.rnaseqc_tpm_median.gct.gz
├── GEUVADIS_covariates
│   └── GEUVADIS.445_samples.covariates.txt
├── GEUVADIS_expression_matrices
│   ├── GEUVADIS.445_samples.normalized_expression.bed.gz
│   └── GEUVADIS.445_samples.normalized_expression.bed.gz.tbi
└── GEUVADIS_genotypes
    ├── GEUVADIS.445_samples.GRCh38.20170504.maf01.filtered.nodup.bed
    ├── GEUVADIS.445_samples.GRCh38.20170504.maf01.filtered.nodup.bim
    ├── GEUVADIS.445_samples.GRCh38.20170504.maf01.filtered.nodup.fam
    ├── GEUVADIS.445_samples.GRCh38.20170504.maf01.filtered.nodup.log
    ├── GEUVADIS.445_samples.GRCh38.20170504.maf01.filtered.nodup.vcf.gz
    └── GEUVADIS.445_samples.GRCh38.20170504.maf01.filtered.nodup.vcf.gz.tbi

Update the codes3d/lib/meta_info/meta_eqtls.txt file to include the name, tissue, project, and tags of your eQTL project.
Load the updated meta-info to the codes3d_commons database using the data_preparation/init_eqtl_meta.py helper script.
Create a database for your eQTL project. The name should have the syntax eqtls_<eQTL project name>

CREATE DATABASE eqtls_<eQTL project name> owner codes3d tablespace <tablespace>;

Create a variants table in your new eQTL database using

data_preparation/init_gene_variant_db.py

Create variant_lookup_<restriction fragment> tables for the restriction enzymes that used to generate your Hi-C libraries

data_preparation/init_fragment_db.py

Your eQTL database should have the following tables:

eqtls_geuvadis=> \d
                 List of relations
 Schema |          Name          | Type  |  Owner  
--------+------------------------+-------+---------
 public | variant_lookup_hindiii | table | codes3d
 public | variant_lookup_mboi    | table | codes3d
 public | variants               | table | codes3d
(3 rows)

Finally, update docs/codes3d.conf to point CoDeS3D scripts to your eQTL datasets e.g.

.
.
.
[GEUVADIS]
eqtl_dir=../lib/eqtls/GEUVADIS/
GENE_FP: %(eqtl_dir)sGEUVADIS.445_samples.rnaseqc_tpm_median.gct.gz
GENOTYPES_DIR: %(eqtl_dir)sGEUVADIS_genotypes
GENOTYPES_FP: %(GENOTYPES_DIR)s/GEUVADIS.445_samples.GRCh38.20170504.maf01.filtered.nodup.vcf.gz
COVARIATES_DIR: %(eqtl_dir)sGEUVADIS_covariates
EXPRESSION_DIR: %(eqtl_dir)sGEUVADIS_expression_matrices
.
.

Reference tables

CoDeS3D needs genes and SNP reference files to run. The gene reference file should be in .BED format: chrom, start, end, gene name, gene id (Ensembl). Similarly, you will SNP reference files (e.g. human_9606_b151_GRCh38p7). Lastly, CoDeS3D uses the RsMergeArch.bcp.gz reference to resolve SNPs with multiple rsIDs. Your reference files should have the following structure.

codes3d/lib/reference_files/
├── genes
│   ├── gene_reference.bed
│   ├── gene_reference.db
│   └── gene_reference_sorted.txt
└── snps
    ├── human_9606_b151_GRCh38p7
    │   ├── bed_chr_1.bed
    │   ├── bed_chr_10.bed
    │   ├── bed_chr_11.bed
    │   ├── bed_chr_12.bed
    │   ├── bed_chr_13.bed
    │   ├── bed_chr_14.bed
    │   ├── bed_chr_15.bed
    │   ├── bed_chr_16.bed
    │   ├── bed_chr_17.bed
    │   ├── bed_chr_18.bed
    │   ├── bed_chr_19.bed
    │   ├── bed_chr_2.bed
    │   ├── bed_chr_20.bed
    │   ├── bed_chr_21.bed
    │   ├── bed_chr_22.bed
    │   ├── bed_chr_3.bed
    │   ├── bed_chr_4.bed
    │   ├── bed_chr_5.bed
    │   ├── bed_chr_6.bed
    │   ├── bed_chr_7.bed
    │   ├── bed_chr_8.bed
    │   ├── bed_chr_9.bed
    │   ├── bed_chr_MT.bed
    │   ├── bed_chr_X.bed
    │   └── bed_chr_Y.bed
    ├── human_9606_b151_GRCh38p7_RsMergeArch.bcp.gz
    └── human_9606_b151_GRCh38p7_RsMergeArch.pairs.bcp.gz

Create a genes table in the codes3d_commons database. Make sure the .BED file has gene coordinates in GrCh38.

data_preparation/init_gene_variant_db.py

Create gene_lookup_<restriction enzyme> lookup tables in the codes3d_commons database.

data_preparation/init_fragment_db.py

Your codes3d_commons database should look like the following:

codes3d_commons=> \d
               List of relations
 Schema |        Name         | Type  |  Owner  
--------+---------------------+-------+---------
 public | gene_lookup_hindiii | table | codes3d
 public | gene_lookup_mboi    | table | codes3d
 public | genes               | table | codes3d
 public | meta_eqtls          | table | codes3d
 public | meta_hic            | table | codes3d
(5 rows)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

data_preparation

data_preparation

README.md

CoDeS3D: Contextualising Developmental SNPs in Three Dimensions

Database setup

Hi-C data pre-processing and installation

eQTL data

Reference tables

Files

data_preparation

Directory actions

More options

Directory actions

More options

Latest commit

History

data_preparation

Folders and files

parent directory

README.md

CoDeS3D: Contextualising Developmental SNPs in Three Dimensions

Database setup

Hi-C data pre-processing and installation

eQTL data

Reference tables