-
Notifications
You must be signed in to change notification settings - Fork 59
Generate contribution score bigwigs
ChromBPNet models have two heads - (1) profile head (to predict the shape of a profile) and (2) counts head (to predict the total counts in a profile). Please use the chrombpnet_nobias.h5 model for this. Here we will provide tools to derive sequence contribution scores for either or both the heads.
chrombpnet contribs_bw [-h] -m MODEL_H5 -r REGIONS -g GENOME -c CHROM_SIZES -op OUTPUT_PREFIX [-pc {counts,profile} [{counts,profile} ...]] [-os OUTPUT_PREFIX_STATS] [-t TQDM]
[-d DEBUG_CHR [DEBUG_CHR ...]]
required arguments:
-m MODEL_H5, --model-h5 MODEL_H5
Path model .h5 file
-r REGIONS, --regions REGIONS
10 column bed file of regions for contribution score predictions
-g GENOME, --genome GENOME
Genome fasta
-c CHROM_SIZES, --chrom-sizes CHROM_SIZES
Chromosome sizes 2 column tab-separated file
-op OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX
Output prefix for bigwig files
optional arguments:
-pc {counts,profile} [{counts,profile} ...], --profile-or-counts {counts,profile} [{counts,profile} ...]
use either counts or profile or both for running shap (defaults to running both)
-os OUTPUT_PREFIX_STATS, --output-prefix-stats OUTPUT_PREFIX_STATS
Output stats on bigwig
-t TQDM, --tqdm TQDM Use tqdm. If yes then you need to have it installed.
-d DEBUG_CHR [DEBUG_CHR ...], --debug-chr DEBUG_CHR [DEBUG_CHR ...]
Run for specific chromosomes only (e.g. chr1 chr2) for debugging
Note: Note that prefix can include a directory path and prefix for the output file. Make sure that the directory in output_prefix exists. Make sure that regions in the input bed file can be expanded to inputlen (default to 2114) regions without overflowing out of the chromosomes. If this condition is not satisfied the program will return with a error or provided unexpected output. From version 0.1.6 onwards this case is handled by the code.
Note: If you are doing large scale runs, keep in mind that the disk usage of
output_prefix.profile_scores.h5 andoutput_prefix.counts_scores.h5 can be lot around 3Gb each. One way to compress this is to store onlyshapfield in the .h5 files and delete theprojected_shapandrawkey fields. This structure is however needed for TFModisco runs, so you can reconstruct them temporarily as follows - therawfield is simply the one-hot encoding of the regions inoutput_prefix.interpreted_regions.bedand theprojected_shapis equivalent toshapmultiplied by theraw` value. For more information refer to this link - https://github.com/kundajelab/chrombpnet/blob/a101087d90f193e1cc6dba1bad1b5fc9761aa129/chrombpnet/evaluation/interpret/interpret.py#L43.
The following two files are created using the output_prefix as prefix for the output when both profile_or_counts includes both profile and counts. If only one of the option is used, only one of the files is generated. Note that prefix can include a directory path and prefix for the output file.
-
output_prefix.profile_scores.h5: A h5 file containing a dictionary with contribution scores for the profile head of the chrombpnet model. The dictionary consists of the following keysraw,shapandprojected_shap. The values of each of these keys are explained below. All the values have the shape(L,4,inputlen)where L is the number of inputs given inregionsbed file andinputlenis inferred from the input length of themodel_h5. This.h5file can be input tochrombpnet modisco_motifscommand for running TF-Modisco.-
rawkey consists of one hot encoding of a sequence from theregionsbed file. -
shapkey consists of hypothetical contribution scores obtained from doing deepshap. These values will be used in the MODISCO pipeline (for de-novo motif discovery) next. -
projected_shapkey consists of values obtained by multiplyingshapvalues withrawvalues. These values will be used in building browser tracks.
-
-
output_prefix.profile_scores.bw: Contribution scores stored in bigwig format. Can be visualized with dynseq track in Washu Genomics browser. -
output_prefix.counts_scores.h5: A h5 file containing a dictionary with contribution scores for the count head of the chrombpnet model. Has similar structure as theoutput_prefix.profile_scores.h5. This.h5file can be input tochrombpnet modisco_motifscommand for running TF-Modisco -
output_prefix.counts_scores.bw: Contribution scores stored in bigwig format. Can be visualized with dynseq track in Washu Genomics browser. -
output_prefix.interpreted_regions.bed: A bed file with regions used for interpretations. The script filters regions whose resulting sequences dont have the same length asinput_leninferred from the model. This bed file might have regions fewer thanregionsprovided as input.