Summary: ~1% of the bins in our sample.cnr files do not have log2 ratios
Version: CNVkit 0.9.7
Details: The CNVkit batch pipeline was initially run for creating a flat reference based on the hg38 genome and calling CNV on aligned WGS data for tumor samples only (no normal samples):
/tridgley/wgs_sequences/CNVcalls/$ python3 cnvkit.py batch sample.bam --normal --fasta ../reference/hg38/genome.fa --output-reference ../CNVref/hg38_flat_reference.cnn --method wgs
For CNV calling of additional samples thereafter, we used the hg38_flat_reference:
/tridgley/wgs_sequences/CNVcalls/$ python3 cnvkit.py batch sample.bam -r ../CNVref/hg38_flat_reference.cnn -p 0
At the fix step of the batch pipeline, 4862 bins were thrown out due gc>0.7 or gc<0.3 according to mask_bad_bins in fix.py:
if 'gc' in cnarr:
mask |= (cnarr['gc'] > .7) | (cnarr['gc'] < .3)
return mask
This line was output, as expected:
Keeping 582687 of 587549 bins
Let bb_i be the position of each bad bin identified by the fix. After the masking step, the next good bin at position bb_i+i gets an empty log2 ratio in the cnr file. In our specific case, there were 4724 bins with missing log2 ratios. For instance, here is the first bad bin in our genome that leads to the problem being reported:
hg38_flat_reference.cnn (Bad bin at 996335-1001339 has gc>0.7):
chromosome start end gene log2 depth gc rmask spread
chr1 996335 1001339 - 0 1 0.707834 0.0603517 0
chr1 1001339 1006344 - 0 1 0.630569 0.267932 0
chr1 1006344 1011348 - 0 1 0.519185 0.60012 0
target.cnn (Bins have sufficient coverage and log2 coverage is present):
chromosome start end gene depth log2
chr1 996335 1001339 - 18.8086 4.23332
chr1 1001339 1006344 - 15.4959 3.95382
chr1 1006344 1011348 - 12.5116 3.64519
sample.cnr (Bad bin at pos bb_1 was masked, but the next good bin 1001339-1006344 at pos bb_1+1 is missing its log2 ratio):
chromosome start end gene depth log2 weight
chr1 991331 996335 - 12.497 -0.337158 0.896654
chr1 1001339 1006344 - 15.4959 0.896664
chr1 1006344 1011348 - 12.5116 -0.272669 0.896654
Consequently, this affects all good bins at positions bb_2+2 after the 2nd mask, bb_3+3 after the 3rd mask, etc. It appears that the old unmasked indices are being referenced for log CN removal, so this offset worsens across the genome (see IGV screenshot below). This problem cascades during segmentation, resulting in many large segments without log2 ratios.

Summary: ~1% of the bins in our sample.cnr files do not have log2 ratios
Version: CNVkit 0.9.7
Details: The CNVkit batch pipeline was initially run for creating a flat reference based on the hg38 genome and calling CNV on aligned WGS data for tumor samples only (no normal samples):
/tridgley/wgs_sequences/CNVcalls/$ python3 cnvkit.py batch sample.bam --normal --fasta ../reference/hg38/genome.fa --output-reference ../CNVref/hg38_flat_reference.cnn --method wgs
For CNV calling of additional samples thereafter, we used the hg38_flat_reference:
/tridgley/wgs_sequences/CNVcalls/$ python3 cnvkit.py batch sample.bam -r ../CNVref/hg38_flat_reference.cnn -p 0
At the fix step of the batch pipeline, 4862 bins were thrown out due gc>0.7 or gc<0.3 according to mask_bad_bins in fix.py:
if 'gc' in cnarr:mask |= (cnarr['gc'] > .7) | (cnarr['gc'] < .3)return maskThis line was output, as expected:
Keeping 582687 of 587549 binsLet bb_i be the position of each bad bin identified by the fix. After the masking step, the next good bin at position bb_i+i gets an empty log2 ratio in the cnr file. In our specific case, there were 4724 bins with missing log2 ratios. For instance, here is the first bad bin in our genome that leads to the problem being reported:
hg38_flat_reference.cnn (Bad bin at 996335-1001339 has gc>0.7):
chromosome start end gene log2 depth gc rmask spreadchr1 996335 1001339 - 0 1 0.707834 0.0603517 0chr1 1001339 1006344 - 0 1 0.630569 0.267932 0chr1 1006344 1011348 - 0 1 0.519185 0.60012 0target.cnn (Bins have sufficient coverage and log2 coverage is present):
chromosome start end gene depth log2chr1 996335 1001339 - 18.8086 4.23332chr1 1001339 1006344 - 15.4959 3.95382chr1 1006344 1011348 - 12.5116 3.64519sample.cnr (Bad bin at pos bb_1 was masked, but the next good bin 1001339-1006344 at pos bb_1+1 is missing its log2 ratio):
chromosome start end gene depth log2 weightchr1 991331 996335 - 12.497 -0.337158 0.896654chr1 1001339 1006344 - 15.4959 0.896664chr1 1006344 1011348 - 12.5116 -0.272669 0.896654Consequently, this affects all good bins at positions bb_2+2 after the 2nd mask, bb_3+3 after the 3rd mask, etc. It appears that the old unmasked indices are being referenced for log CN removal, so this offset worsens across the genome (see IGV screenshot below). This problem cascades during segmentation, resulting in many large segments without log2 ratios.