ARSENAL

This repo contains all code for the ARSENAL language modeling project.

Pretrained models and relevant data not from other publications can be found at https://www.synapse.org/Synapse:syn72351987/wiki/. Please see the wiki at this site for important information on the various files.

To set up a conda environment for this repo, run bash setup_env.sh [ENV NAME]

Model Training

To train an ARSENAL model, you will need a config yml file. Examples can be found in the src/regulatory_lm/config/ folder. This file should contain all the relevant parameters for the model's embedder, encoder, and decoder (lm head) modules, relevant training parameters (learning rate, number of epochs, mask probability, etc...), and relevant data files (training dataset, reference genome, etc...)

You can view the modeling options in src/regulatory_lm/modeling/model.py, and an exhaustive list of the parameters used in the relevant training script.

To train an ARSENAL model, navigate to the src/ folder and run the following command:

python -m regulatory_lm.modeling.train_peaks_with_repeat_suppression_and_fourier_loss [PATH_TO_CONFIG]

Note that this script implements the Fourier attribution loss described in the paper's addendum. To disable this loss as is done in the paper's main text, simply set the loss weight to 0 in the config file. This is done by default.

Important Notebooks

We provide notebooks for important use cases of the ARSENAL model

notebooks/regulatory_region_analysis.ipynb - runs visualization and nucleotide dependency analyses for supplied regulatory regions

notebooks/guided_generation.ipynb - runs supervised model-guided sequence generation as demonstrated in the paper. Can easily be extended to other use cases and objectives. notebooks/chrombpnet_comparison.ipynb demonstrates how to use the ChromBPNet model to evaluate these generations if desired. notebooks/differential_seq_stats.ipynb demonstrates how to use ChromBPNet to predict activities of a series of generated sequences and compare across models.

notebooks/supervised_variant_scoring_african.ipynb and notebooks/supervised_variant_scoring_yoruban.ipynb - runs statistics on supervised variant scores (see below) - requires ground truth scores from DART-EVAL.

Downstream Supervised Models

To apply ARSENAL embeddings to train a downstream ChromBPNet model, this repo should be installed (probably in its own environment).

First, run export ARSENAL_MODEL_DIR=[PATH TO ARSENAL REPO]

To train an ARSENAL+ChromBPNet model, run the following command: chrombpnet train --model_type arsenal-chrombpnet --out_dir [OUTPUT DIR] --input_embedding_dim 768 --arsenal_output_type embedding --peaks [PEAK FILE] --negatives [NEGATIVE FILE] --bigwig [BIGWIG FILE] --bias [BIAS MODEL FILE] --fasta [REFERENCE GENOME] --chrom_sizes [CHROM SIZES FILE] --arsenal_model [ARSENAL MODEL .PTH FILE] --arsenal_input_size 350 --num_layers_avg [LAST N EMBEDDING LAYERS TO AVERAGE]

To score variants using this trained model, run the following command: snp_score -l [VARIANT LIST] -g [REFERENCE GENOME] -s [CHROM SIZES FILE] --model_type arsenal-chrombpnet --model [BEST MODEL .ckpt FILE] --out_prefix [OUTPUT PREFIX/DIR] --total_shuf 2

To train a regular ChromBPNet model for comparison, run the following command: chrombpnet train --model_type chrombpnet --out_dir [OUTPUT DIR] --peaks [PEAK FILE] --negatives [NEGATIVE FILE] --bigwig [BIGWIG FILE] --bias [BIAS MODEL FILE] --fasta [REFERENCE GENOME] --chrom_sizes [CHROM SIZES FILE]

To score variants using this trained model, run the following command: snp_score -l [VARIANT LIST] -g [REFERENCE GENOME] -s [CHROM SIZES FILE] --model_type chrombpnet --model [BEST MODEL .ckpt FILE] --out_prefix [OUTPUT PREFIX/DIR] --total_shuf 2

TF-MoDISco Analysis

To run TF-MoDISco analysis on ARSENAL models, navigate to src/ and run the following command: bash regulatory_lm/evals/run_modisco_pipeline.sh [PEAK FILE] [ARSENAL MODEL DIR] [CHECKPOINT NUMBER] [OUTPUT DIR] [DATA FORMAT (bed or narrowpeak)] [REFERENCE GENOME] [MEME MOTIF DB]

DART-EVAL Benchmarking

We include benchmarking on two zero-shot DART-EVAL tasks in the ARSENAL paper. Code to run these tasks exists in the regulatory_lm branch of that repo, and full documentation for running the commands can be found there. Note that you will likely need to install the rotary-embedding-torch package to the environment you use for DART-EVAL.

Analysis With Other DNALMs

We include results from Caduceus and HyenaDNA in the paper.

The notebook notebooks/nuc_deps_other_models.ipynb allows for visualization of regulatory regions using these two models.

To run TF-MoDISco using these models, you can run the following commands:

bash regulatory_lm/evals/modisco_dnalms/run_modisco_pipeline_hyena.sh [PEAK FILE] [OUTPUT DIR] [DATA FORMAT (bed or narrowpeak)] [REFERENCE GENOME] [MEME MOTIF DB]

bash regulatory_lm/evals/modisco_dnalms/run_modisco_pipeline_caduceus.sh [PEAK FILE] [OUTPUT DIR] [DATA FORMAT (bed or narrowpeak)] [REFERENCE GENOME] [MEME MOTIF DB]

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
notebooks		notebooks
src/regulatory_lm		src/regulatory_lm
README.md		README.md
environment.yml		environment.yml
setup_env.sh		setup_env.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ARSENAL

Model Training

Important Notebooks

Downstream Supervised Models

TF-MoDISco Analysis

DART-EVAL Benchmarking

Analysis With Other DNALMs

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ARSENAL

Model Training

Important Notebooks

Downstream Supervised Models

TF-MoDISco Analysis

DART-EVAL Benchmarking

Analysis With Other DNALMs

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages