Skip to content

Bamsurgeon addsnv.py takes 4 hours to add one SNP to a bam file of size 3.1MB ? #225

@TrangNg-Th

Description

@TrangNg-Th

Hi,
I recently used Bamsurgeon to add an SNP at a specific location for a specific region of the bam file. I processed the bam file and extracted a smaller bam file that contains only a small region around the SNP that I'm interested in. However, upon running the script, bamsurgeon addsnp.py takes around 4 hours for one last step: "merging reads".
My question is: What does this merging step do, and is this step important for the final modified bam file?

For clarification, here was the script that I used:
singularity exec
--bind $extracted_reads_dir:/extracted_reads
--bind $ref_fasta_dir:/ref_Macaq
--bind $input_snp_dir:/input_snps
--bind $output_snp_dir:/output_snp
--bind $tmp_dir:/tmp
bamsurgeon.sif python3
-O bamsurgeon/bin/addsnv.py
--procs 1
-v /input_snps/$test_snv
-f /extracted_reads/$filename
-r /ref_Macaq/modif_ref-relabeled.fna
--vcf /tmp/${sample}${chromosome}${start_pos}-${end_pos}modif.vcf
-o /output_snp/${sample}
${chromosome}_${start_pos}-${end_pos}_modif.bam
--tmpdir /tmp
--aligner mem
--seed 1234
--mindepth 3
--coverdiff 0.1
--ignoreref
--picardjar /picard.jar

The messages given by bamsurgeon:
16:25:16.194 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Sat May 11 16:25:16 EDT 2024] SamToFastq INPUT=/tmp/haplo_chr10_10571334_10571334.tmpbam.49293d2d-1fcf-49cb-8767-f632f79c7457.bam FASTQ=/tmp/haplo_chr10_10571334_10571334.tmpbam.49293d2d-1fcf-49cb-8767-f632f79c7457.1.fastq SECOND_END_FASTQ=/tmp/haplo_chr10_10571334_10571334.tmpbam.49293d2d-1fcf-49cb-8767-f632f79c7457.2.fastq INCLUDE_NON_PRIMARY_ALIGNMENTS=false VALIDATION_STRINGENCY=SILENT OUTPUT_PER_RG=false COMPRESS_OUTPUTS_PER_RG=false RG_TAG=PU RE_REVERSE=true INTERLEAVE=false INCLUDE_NON_PF_READS=false CLIPPING_MIN_LENGTH=0 READ1_TRIM=0 READ2_TRIM=0 VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Sat May 11 16:25:16 EDT 2024] Executing as ****** on Linux 4.18.0-513.9.1.el8_9.x86_64 amd64; OpenJDK 64-Bit Server VM 11.0.22+7-post-Ubuntu-0ubuntu220.04.1; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.27.3
[Sat May 11 16:25:17 EDT 2024] picard.sam.SamToFastq done. Elapsed time: 0.03 minutes.
Runtime.totalMemory()=5821693952
INFO 2024-05-11 16:25:17,766 haplo_chr10_10571334_10571334 aligning /tmp/haplo_chr10_10571334_10571334.tmpbam.49293d2d-1fcf-49cb-8767-f632f79c7457.1.fastq,/tmp/haplo_chr10_10571334_10571334.tmpbam.49293d2d-1fcf-49cb-8767-f632f79c7457.2.fastq with mem
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 76 sequences (11476 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 38, 0, 0)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (380, 434, 479)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (182, 677)
[M::mem_pestat] mean and std.dev: (432.43, 74.17)
[M::mem_pestat] low and high boundaries for proper pairs: (83, 776)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 76 reads in 0.039 CPU sec, 0.041 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem -t 1 -M -Y /ref_Macaq/modif_ref-relabeled.fna /tmp/haplo_chr10_10571334_10571334.tmpbam.49293d2d-1fcf-49cb-8767-f632f79c7457.1.fastq /tmp/haplo_chr10_10571334_10571334.tmpbam.49293d2d-1fcf-49cb-8767-f632f79c7457.2.fastq
[main] Real time: 5.075 sec; CPU: 3.588 sec
INFO 2024-05-11 16:25:23,580 haplo_chr10_10571334_10571334 avgincover: 38.000000, avgoutcover: 38.000000
INFO 2024-05-11 16:25:23,753 done making mutations, merging mutations into /extracted_reads/39237_chr10_10566334-10576334_sampled.bam --> /output_snp/39237_chr10_10566334-10576334_modif.bam
INFO 2024-05-11 20:30:28,614 loading donor reads into dictionary...
INFO 2024-05-11 20:30:28,652 secondary reads count:0
INFO 2024-05-11 20:30:28,652 supplementary reads count:0
INFO 2024-05-11 20:30:28,653 loaded 76 reads, (0 excluded, 0 null or secondary or supplementary--> ignored)

INFO 2024-05-11 20:30:28,746 replaced 76 reads (0 excluded )

INFO 2024-05-11 20:30:28,746 kept 0 secondary reads.

INFO 2024-05-11 20:30:28,747 kept 0 supplementary reads.

INFO 2024-05-11 20:30:28,747 ignored 0 non-primary reads in target BAM.

INFO 2024-05-11 20:30:29,559 vcf output written to /tmp/39237_chr10_10566334-10576334_modif.vcf39237_chr10_10566334-10576334_modif.addsnv.snp_chr10_10571334.vcf

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions