Skip to content

Bin log2 ratios missing due to preceding gc-masked bins #547

Description

@tridgley

Summary: ~1% of the bins in our sample.cnr files do not have log2 ratios
Version: CNVkit 0.9.7

Details: The CNVkit batch pipeline was initially run for creating a flat reference based on the hg38 genome and calling CNV on aligned WGS data for tumor samples only (no normal samples):
/tridgley/wgs_sequences/CNVcalls/$ python3 cnvkit.py batch sample.bam --normal --fasta ../reference/hg38/genome.fa --output-reference ../CNVref/hg38_flat_reference.cnn --method wgs

For CNV calling of additional samples thereafter, we used the hg38_flat_reference:
/tridgley/wgs_sequences/CNVcalls/$ python3 cnvkit.py batch sample.bam -r ../CNVref/hg38_flat_reference.cnn -p 0

At the fix step of the batch pipeline, 4862 bins were thrown out due gc>0.7 or gc<0.3 according to mask_bad_bins in fix.py:
if 'gc' in cnarr:
mask |= (cnarr['gc'] > .7) | (cnarr['gc'] < .3)
return mask
This line was output, as expected:
Keeping 582687 of 587549 bins

Let bb_i be the position of each bad bin identified by the fix. After the masking step, the next good bin at position bb_i+i gets an empty log2 ratio in the cnr file. In our specific case, there were 4724 bins with missing log2 ratios. For instance, here is the first bad bin in our genome that leads to the problem being reported:

hg38_flat_reference.cnn (Bad bin at 996335-1001339 has gc>0.7):
chromosome start end gene log2 depth gc rmask spread
chr1 996335 1001339 - 0 1 0.707834 0.0603517 0
chr1 1001339 1006344 - 0 1 0.630569 0.267932 0
chr1 1006344 1011348 - 0 1 0.519185 0.60012 0

target.cnn (Bins have sufficient coverage and log2 coverage is present):
chromosome start end gene depth log2
chr1 996335 1001339 - 18.8086 4.23332
chr1 1001339 1006344 - 15.4959 3.95382
chr1 1006344 1011348 - 12.5116 3.64519

sample.cnr (Bad bin at pos bb_1 was masked, but the next good bin 1001339-1006344 at pos bb_1+1 is missing its log2 ratio):
chromosome start end gene depth log2 weight
chr1 991331 996335 - 12.497 -0.337158 0.896654
chr1 1001339 1006344 - 15.4959 0.896664
chr1 1006344 1011348 - 12.5116 -0.272669 0.896654

Consequently, this affects all good bins at positions bb_2+2 after the 2nd mask, bb_3+3 after the 3rd mask, etc. It appears that the old unmasked indices are being referenced for log CN removal, so this offset worsens across the genome (see IGV screenshot below). This problem cascades during segmentation, resulting in many large segments without log2 ratios.

Screen Shot 2020-10-20 at 7 33 51 PM

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions