Conversation
tedsharpe
left a comment
There was a problem hiding this comment.
Just a few suggestions. Might want to omit "/cromwell_root" wherever it appears, as I suggested in the last PR.
| @@ -1,17 +1,11 @@ | |||
| version: 1.2 | |||
There was a problem hiding this comment.
I haven't reviewed this. Going to assume it's OK.
| @@ -0,0 +1,92 @@ | |||
| # incorporate samtools and bcftools, plus some additional but most common genomics tools | |||
There was a problem hiding this comment.
Not reviewed here -- appears to be the same as the version in sh_ingest_high_pass_inputs, which I've already reviewed.
| @@ -0,0 +1,12 @@ | |||
| VERSION = 0.1.2 | |||
There was a problem hiding this comment.
Not reviewed here -- appears to be the same as the version in sh_ingest_high_pass_inputs, which I've already reviewed.
| @@ -0,0 +1,6 @@ | |||
| #!/bin/bash | |||
There was a problem hiding this comment.
Not reviewed here -- appears to be the same as the version in sh_ingest_high_pass_inputs, which I've already reviewed.
| time gcloud storage cp ~{bam} ~{local_bam} | ||
|
|
||
| mv ~{bai} "~{local_bam}.bai" | ||
| mosdepth -x -n -Q1 ~{prefix} ~{local_bam} || echo "mosdepth failed somehow" |
There was a problem hiding this comment.
Is it really your intention to swallow mosdepth failures with just a message to the log?
| samtools view --no-PG -H ~{local_bam} > header.txt | ||
| grep -v "^@SQ" header.txt | ||
|
|
||
| # fix SM in the RG lines | ||
| grep "^@RG" header.txt > rg_lines.txt | ||
| if ! grep -qF "SM:" rg_lines.txt; then | ||
| sed -i "s/$/SM:tbd/" rg_lines.txt | ||
| fi | ||
| awk -v sm="~{sample_name}" -F '\t' 'BEGIN {OFS="\t"} { for (i=1; i<=NF; ++i) { if ($i ~ "SM:") $i="SM:"sm } print}' \ | ||
| rg_lines.txt \ | ||
| > fixed_rg_lines.txt | ||
| cat fixed_rg_lines.txt | ||
|
|
||
| # paste things back | ||
| grep -v "^@RG" header.txt > otherlines.txt | ||
| cat otherlines.txt fixed_rg_lines.txt > fixed_header.txt |
There was a problem hiding this comment.
All this could be replaced with a 1-liner:
samtools view --no-PG -H "~{local_bam}" | \
sed -E "/^@RG/s/( SM:[^ ]*|$)/ SM:~{sample_name}/" > fixed_header.txt
#these are tabs --- ^ ^
| } | ||
| } | ||
|
|
||
| task ResetSamplename { |
There was a problem hiding this comment.
This seems like an awfully dangerous thing to do (arbitrary replacement of sample name). The task doesn't appear to be in use, currently. I hope we'll use it judiciously.
| ################################################################################### | ||
| # generate PBI | ||
| call PBUtils.PBIndex as Index {input: bam = uBAM} | ||
| call BU.GetReadGroupInfo as RG {input: uBAM = uBAM, keys = ['SM', 'LB', 'PU']} |
There was a problem hiding this comment.
doesn't look like we actually need the SM value
* update Hifiasm to version 0.19.5 * update how Hifiasm outputs are compressed (bgz replacing gz), also * monitor hifiasm resources usage
* update docker used in PBSV tasks to the version coming with official SMRTLink releases (2.9.0) * change how the 2-step PBSV process is done (following the recommended way now)
* to version 2.0.7 * using TRF bed * conditionally phase sv (requires phased bam) * generates its own vcf.gz and tbi
Overhaul how small variants are called in the WG pipelines
* default to use DV to call small variants, Clair3 analysis needs to be requested explicitly
* retire the Pepper toolchain completely from the CCS pipeline, using DV directly
* for R10.4+ ONT data, also use DV directly
* older ONT data would still use the PEPPER-DV-Margin pipeline
* offers GPU version (though based on, it's not worth it yet)
* update how bam haplotagging is done
Cleanup structural variants calling
* experiment with SNF2 phasing SV calls (implicitly depends on small variants calling now)
* tune PBSV calling
- discover now supports --hifi
- output vcf.gz and tbi
- less verbose logging by default
Misc.:
* optimizations to BAM merging and metrics workflow
* updates coverage collection step
* new R script to visualize log from vm_monitoring_script.sh
* organize dockstore.yml file a bit
* make WDL validation shell script more usable
* update pbmm2 and pbindex to versions in SMRTLink
* update GeneralUtils.wdl
- two bash-like new tasks [CoerceMapToArrayOfPairs, CoerceArrayOfPairsToMap]
- cleanup task CollapseArrayOfStrings
* update resource allocations to tasks
- NanoplotFromBam (also changes docker)
- MosDepthWGS
* incorporates gcloud cli (not just gsutil) * integrate libdeflate for more speedups
incorporate new tasks and optimize them
* [CountMethylCallReads, GatherReadsWithoutMethylCalls]
from sh_beans
* [GetPileup, BamToRelevantPileup]
from sh_more_atomic_qc
* [GetReadGroupLines, GetSortOrder, SplitNameSortedUbam]
from sh_ont_fc
* [SamtoolsFlagStats, ParseFlagStatsJson]
from sh_trvial_stats
* [FilterBamByLen, InferSampleName]
from sh_seqkit
* [CountAlignmentRecords, StreamingBamErrored, CountAlignmentRecordsByFlag]
from sh_maha_aln_metrics
* [ResetSamplename]
from sh_ingest_singlerg
* [MergeBamsWithSamtools]
from sh_ont_fc.Utils.wdl
* [BamToFastq]
from sh_more_bam_qcs
and optimize it with
sh_ingest_singlerg.Utils.wdl
delete
* GetSortOrder as that's now implemented in GatherBamMetadata
* Drop2304Alignments as that's no longer used
update dockers to the latest
c905a5a to
a9ae6c3
Compare
CHERRY-PICK FROM VARIOUS QC/METRICS BRANCHES:
* collect information about ML/MM tags in a long-read BAM
(sh_beans)
* a heuristic way to find peaks in a distribution (using dyst)
(sh_dyst_peaker)
* filter reads by length in a BAM
* collect some read quality stats from (length-filtered) FASTQ/BAM
(sh_seq_kit)
* VerifyBamID2 (for contamination estimation)
* naive sex-concordance check
(sh_more_atomic_qc)
* check fingerprint of a single BAM file
(sh_sample_fp)
* collect SAM flag stats
(sh_trivial_stats)
* make BeanCounter finalization optional
(wdl/pipelines/TechAgnostic/Utility/CountTheBeans.wdl)
* custom struct for sub-workflow config using a JSON
(wdl/pipelines/TechAgnostic/Utility/LongReadsContaminationEstimation.wdl)
* make fingerprint checking subworkflow control size filtering
(wdl/tasks/QC/FPCheckAoU.wdl)
(wdl/pipelines/TechAgnostic/Utility/VerifyBamFingerprint.wdl)
* fix a warning by IDE/miniwdl complaining WDL stdlib function length only applies to Array
(wdl/tasks/Utility/BAMutils.wdl)
* various updates to Finalize
(wdl/tasks/Utility/Finalize.wdl)
New tasks in (wdl/tasks/Utility/GeneralUtils.wdl) to
* correctly convert Map to TSV
* concatenate files
a9ae6c3 to
57dcfaf
Compare
* AlignAndCheckFingerprintCCS.wdl * CollectPacBioAlignedMetrics.wdl * CollectSMRTCellUnalignedMetrics.wdl
57dcfaf to
be67190
Compare
(CHRRY-PICK & follow up to PR 406)
* SampleLevelAlignedMetrics.wdl * PBCLRWholeGenome.wdl
* new struct in AlignedBamQCandMetrics.wdl to facilicate as-sub-workflow calling * change parameters name for fingerprint workflows
be67190 to
295be09
Compare
* make saving of reads without methylation SAM tags optional * better parameter naming
295be09 to
74388b3
Compare
ea29e7c to
29c570b
Compare
29c570b to
50d7235
Compare
34d754d to
c4b670f
Compare
(affects contamination estimation)
handle (ref-indepedent, alignment) a single piece of on-instrument demultiplexed CCS bam
c4b670f to
1e26de7
Compare
30c1a99 to
3feed59
Compare
3feed59 to
af68ab7
Compare
af68ab7 to
3e93ef1
Compare
Following up #418 with more optimizations, and two new pipelines for
ingesting single demultiplexed PacBio Hifi (u)BAM (aka a readgroup uBAM)