You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: Contributing.md
+19-8Lines changed: 19 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,7 @@
2
2
3
3
# `sequence_handling` Design Principles
4
4
5
-
This document records the guiding principles for developing and extending the [`sequence_handling`](https://www.google.com/search?q=%5Bhttps://github.com/MorrellLAB/sequence_handling%5D\(https://github.com/MorrellLAB/sequence_handling\)) pipeline. It is intended as an enduring reference for contributors, particularly during the ongoing modernization of the pipeline to support current GATK versions, long-read sequencing technologies (ONT and PacBio HiFi), and updated tooling.
5
+
This document records the guiding principles for developing and extending the [`sequence_handling`](https://github.com/MorrellLAB/sequence_handling) pipeline. It is intended as an enduring reference for contributors, particularly during the ongoing modernization of the pipeline to support current GATK versions, long-read sequencing technologies (ONT and PacBio HiFi), and updated tooling.
6
6
7
7
These principles apply to all new handlers, to revisions of existing handlers, and to supporting infrastructure such as config files and Slurm job scripts.
8
8
@@ -43,7 +43,7 @@ Consistent terminology prevents ambiguity across documentation, code, and issues
43
43
-**Config files** (e.g., `Config`) define all parameters for a run. A handler should source a single config and rely entirely on the variables defined there.
44
44
-**HelperScripts** (in `HelperScripts`) are scripts for handling multiple samples.
45
45
-**Sequence_Accessories** (e.g., `PanDepthCoverage.sh`) are tools that may be used occasionally or that supplement `sequence_handling`.
46
-
- This separation of concerns -- job logic in Handlers, scheduling logic in SlurmJobScripts, parameters in Config -- should be preserved as the pipeline grows.
46
+
- This separation of concerns - job logic in Handlers, scheduling logic in SlurmJobScripts, parameters in Config - should be preserved as the pipeline grows.
47
47
48
48
When adding support for new sequencing platforms or tools, follow this multi-layer structure. Do not mix scheduling directives into handler logic.
49
49
@@ -54,7 +54,7 @@ When adding support for new sequencing platforms or tools, follow this multi-lay
54
54
All shell code must be safe, readable, and linter-compliant.
55
55
56
56
- Use strict shell options at the top of every handler: `set -euo pipefail`. This ensures the script exits on errors, treats unset variables as errors, and catches failures in pipelines.
57
-
- All handlers should pass [`shellcheck`](https://www.google.com/search?q=%5Bhttps://www.shellcheck.net/%5D\(https://www.shellcheck.net/\)) without warnings. `shellcheck` compliance is the baseline standard.
57
+
- All handlers should pass [`shellcheck`](https://www.shellcheck.net/) without warnings. `shellcheck` compliance is the baseline standard.
58
58
- Where applicable, code should also satisfy [DeepSource](https://deepsource.com/) static analysis alerts.
59
59
- Variables that come from the config should be validated before use (e.g., check that a file path is non-empty and the file exists before passing it to a tool).
60
60
- Avoid `eval`, unquoted variable expansions in paths, and other patterns that are fragile or unsafe in HPC environments.
@@ -102,7 +102,7 @@ Code that runs silently and produces incorrect results is worse than code that f
102
102
Handlers should produce output that is useful for both humans and AI-assisted troubleshooting.
103
103
104
104
- Log the tool version, key parameters, input files, and output files at the start and end of each handler run. This information should appear in both stdout and the Slurm log.
105
-
- Capture metrics that are meaningful for QC: read counts, mapping rates, coverage depth, duplication rates, variant counts, and so forth -- whichever are appropriate for the handler's task.
105
+
- Capture metrics that are meaningful for QC: read counts, mapping rates, coverage depth, duplication rates, variant counts, and so forth - whichever are appropriate for the handler's task.
106
106
- Use structured, grep-friendly log lines where possible (e.g., `[Fastplong] Sample: ${SAMPLE} | Reads before: ${N_BEFORE} | Reads after: ${N_AFTER}`).
107
107
- Do not suppress stderr from tools unless you have explicitly handled the error conditions. Suppressed errors are a common source of silent failures in pipelines.
108
108
- Metadata captured at runtime (tool versions, parameters, timestamps) should be written to a per-sample or per-run log file that persists after the job completes.
@@ -128,7 +128,7 @@ All active development occurs on the `dev` branch.
128
128
129
129
- New handlers, refactored handlers, and bug fixes should be developed on `dev` (or on feature branches that merge into `dev`). The `main` branch reflects the last stable release.
130
130
-`dev` is the integration target: it should remain functional and pass basic testing at all times, even if individual features are incomplete.
131
-
- When a meaningful set of changes has accumulated -- for example, the addition of long-read support, or a GATK version bump -- prepare a new versioned release. Update the changelog, tag the release, and archive via Zenodo so the version is citable.
131
+
- When a meaningful set of changes has accumulated - for example, the addition of long-read support, or a GATK version bump - prepare a new versioned release. Update the changelog, tag the release, and archive via Zenodo so the version is citable.
132
132
- Pull requests into `dev` should include: updated config stanzas (if new parameters are introduced), handler-level comments explaining non-obvious logic, and a brief description in the PR of what changed and why.
133
133
- Breaking changes to the config format or handler interface should be clearly flagged in the changelog and in a migration note for existing users.
134
134
@@ -139,7 +139,7 @@ The `dev` branch is where the next version of `sequence_handling` is being built
139
139
## 8. Documentation
140
140
141
141
- Documentation of the workflow is provided in the `README.md` file. More specifics are available in a [wiki](https://github.com/MorrellLAB/sequence_handling/wiki).
142
-
-[`Dependencies`](https://www.google.com/search?q=%5Bhttps://github.com/MorrellLAB/sequence_handling/wiki/Dependencies%5D\(https://github.com/MorrellLAB/sequence_handling/wiki/Dependencies\)) are listed on a dedicated page, which should be updated so users can easily identify the tools needed for successful execution.
142
+
-[`Dependencies`](https://github.com/MorrellLAB/sequence_handling/wiki/Dependencies) are listed on a dedicated page, which should be updated so users can easily identify the tools needed for successful execution.
143
143
- Where necessary, clarifying comments should be included in code, particularly for more complex operations.
144
144
* * *
145
145
@@ -157,21 +157,32 @@ _Last updated: March 2026. To be revised as the pipeline evolves._
157
157
158
158
* * *
159
159
160
-
## Next steps
160
+
## Recently implemented
161
161
162
162
- Update from GATK v4.1 to GATK v4.6.
163
163
164
164
- Add an accessory script to generate an AllSites VCF for [pixy v2.0](https://github.com/ksamuk/pixy). The specific examples for using bcftools mpileup and GATK are [here](https://pixy.readthedocs.io/en/latest/generating_invar/generating_invar.html).
165
165
166
-
- Document all changes to the new version in a Release file similar to that for [v3.0.0](https://github.com/MorrellLAB/sequence_handling/releases/tag/v3.0.0). Recent updates to the dev branch include replacing' fastp' and' fastplong' with`fastp` and `fastplong` for quality assessment and adapter trimming. This replaces the full front end of the workflow. We have also added `minimap2` for long-read mapping and updated the `config` to include read-mapping presets for PacBio (HiFi reads) and ONT (Q20 reads).
166
+
- Document all changes to the new version in a Release file similar to that for [v3.0.0](https://github.com/MorrellLAB/sequence_handling/releases/tag/v3.0.0). Recent updates to the `dev` branch include replacing `fastp` and `fastplong` for quality assessment and adapter trimming. This replaces the full front end of the workflow. We have also added `minimap2` for long-read mapping and updated the `config` to include read-mapping presets for PacBio (HiFi reads) and ONT (Q20 reads).
167
167
168
168
- Make sure that the concatenation of gzipped fastq files uses `zcat` rather than `cat`. They don't produce the same results.
169
169
170
170
- Determine if new GATK indel or SV callers require any changes in our workflow.
171
171
172
+
172
173
- Assessment (March 2026): no immediate mandatory workflow changes for germline SNP/indel calling. The current `Haplotype_Caller -> Genomics_DB_Import -> Genotype_GVCFs` path remains valid under GATK 4.6.
173
174
- Indel-specific note: GATK 4.6 includes HaplotypeCaller fixes (including long-deletion edge cases), but does not introduce a replacement germline indel caller that requires handler redesign.
174
175
- SV-specific note: GATK 4.6 includes SV tooling improvements (for annotation/concordance), but full production SV calling is still typically handled by dedicated SV workflows/tools (e.g., GATK-SV WDL stack, pbsv, Sniffles2). Integrating a new SV-calling branch in `sequence_handling` is optional future work, not a blocker for this release.
175
176
- Recommended follow-up: add a dedicated design note before implementing any optional SV branch (inputs, caller choice, output normalization, and filtering strategy).
176
177
178
+
## Next steps
179
+
180
+
- Implement a long-read variant caller, probably [`Clair3`](https://github.com/HKU-BAL/Clair3), that will pick appropriate error models given our various read types already specified in the updated `config` file. The read types we know we need to handle include ONT R9 reads, ONT Q20 reads, and PacBio HiFi. For `Clair3`, it may also be important to track which basecaller was used for ONT reads; that would need to be added to the `config` file. This long-read path should bypass most of the GATK germline workflow and land on a VCF that enters a dedicated downstream filtering step. The recommended joint calling path for Clair3 outputs is [`GLnexus`](https://github.com/dnanexus-rnd/GLnexus) rather than GATK's Genomics_DB_Import/Genotype_GVCFs (since GLnexus is designed for consolidating VCFs from non-GATK callers). If needed, we should also be able to create an "AllSites VCF" similar to that for pixy.
181
+
182
+
- There is also a need to integrate code for some steps in handling Ultima Genomics UG100 resequencing data. For now, this could probably be exclusively in [sequence_accessories](https://github.com/MorrellLAB/sequence_accessories/tree/master). This involves joint variant calling with [GLnexus](https://github.com/dnanexus-rnd/GLnexus) using [GLnexus.sh](https://github.com/pmorrell/Utilities/blob/030effbd0599dd0a0d823cfa19c3bf90bd5e150c/variant_calling/GLnexus.sh#L15) and filtering of those variants using [UG100_filter.sh](https://github.com/MorrellLAB/sequence_accessories/blob/master/Accessories/UG100_filter.sh).
183
+
184
+
- Integrate existing code for `bcftools mpileup` calling as an alternative for "SNP-only" calling from short-read sequencing. The [bcftools_mpileup.sh](https://github.com/pmorrell/Utilities/blob/030effbd0599dd0a0d823cfa19c3bf90bd5e150c/bcftools_mpileup.sh) script could be added as an additional branch in the workflow after BAM files are sorted and indexed.
185
+
186
+
- In `sequence_handling_fastp`, we should probably trim the opening number selection to eliminate " 13 | GBS_Demultiplex (in progress)" and all the Nanopore Workflow options. We are integrating long read protocols into the main workflow as side channels. We can move handlers not being deployed to a "Deprecated" directory.
0 commit comments