Skip to content

Feature Request: giraffe mapping of CRAMs with multiple RGs #151

@jjfarrell

Description

@jjfarrell

The present WDL workflow only supports only one RG per cram. When the workflow creates paired end fastq files, the RG is not preserved and insert size estimates are based on reads from all RGs in the cram. This is problematic when each RG may have different insert sizes. If the RG is added to the paired end fastq files, unfortunately the kmc step breaks and does not recognize the read pairs.

If a cram has multiple RGs, the cram should initially be split into multiple bam files for each RG (https://www.htslib.org/doc/samtools-split.html). The paired-end fastq files collated from the bams could then preserve the RG. The giraffe alignment will then be based on the insert size of each RG in the cram. Each giraffe mapped RG bam can then be merged into one bam with each of the RGs in the header and each read properly tagged with the original RG.

This is how the RG is presently specified when using giraffe in the wdl tasks. Each RG is specified as "1" and the RGs in the original cram are lost.

        vg giraffe \
          --progress \
          --read-group "ID:1 LB:lib1 SM:~{in_sample_name} PL:illumina PU:unit1" \
          --sample "~{in_sample_name}" \
          --output-format BAM \
          ~{in_giraffe_options} \
          --ref-paths ~{in_ref_dict} \
          -f ~{in_left_read_pair_chunk_file} -f ~{in_right_read_pair_chunk_file} \
          -x ~{in_xg_file} \
          -H ~{in_gbwt_file} \
          -g ~{in_ggbwt_file} \
          -d ~{in_dist_file} \
          -m ~{in_min_file} \
          -t ~{in_map_cores} > ~{in_sample_name}.${READ_CHUNK_ID}.bam
    >>>

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions