PlasmidScreen detects engineered DNA in sequencing reads using a Kraken2 minimizer/kmer scan (taxid 32630 in sliding windows). Optional codon adaptation index (CAI) scoring uses DIAMOND blastx for ORF coordinates and host taxonomy, compared against host codon usage tables from the Codon Statistics Database.
Design specification (inputs/outputs): docs/DESIGN.md
| Step | Tool | Purpose |
|---|---|---|
| 1 | CSDB build | codon_tables.json + optional taxonomy_parents.json |
| 2 | Kraken2 | Engineered k-mer labels (Natural / Synthetic) |
| 3 | DIAMOND blastx | ORF intervals (qstart/qend) and host taxid (staxids) |
| 4 | CAI | Codon adaptation vs host reference (Natural reads only in full screen) |
| 5 | Combined call | ScreenResult.per_read[].engineered_overall / overall_label |
Each read in screen_result.per_read gets a single engineered_overall flag and overall_label (Natural or Synthetic) from the thresholds you pass to run_screen which can be a result from the two components K-mer scan, and Codon CAI which each are optionally ran. If either are found to contain engineering, ScreenResult object will report engineering.
| Signal | Parameter | Counts toward overall engineered when |
|---|---|---|
| K-mer scan | --threshold, --window_size |
Engineered minimizers (taxid 32630) in a window ≥ threshold → k-mer Synthetic |
| Codon CAI | --codon-cai-threshold (optional) |
CAI < threshold on a scored Natural read (requires codon step) |
- Without
--codon-cai-threshold, overall matches the k-mer scan only (engineered_scan.synthetic_count==overall_synthetic_count). - With
--codon-cai-threshold, a read can be k-mer Natural but overall Synthetic if CAI is low. - Use
screen_result.overall_synthetic_count,engineered_read_ids, andnatural_read_ids_overallfor run-level summaries (not onlyengineered_scan.synthetic_count).
The written engineered report TSV (report.txt) still records k-mer scan labels only. The combined decision lives in the library ScreenResult / ReadFlagDetail objects.
from plasmidScreen import (
build_codon_reference,
analyze_codon_adaptation,
run_screen,
taxids_from_kraken_output,
)
# Step 1 — build reference JSON from Codon Statistics Database
build_codon_reference(
"codon_usage/",
csdb_archive="/path/to/codonstatsdb_March2022.tar.gz", # optional; auto-download if omitted
)
# Full pipeline: Kraken engineered scan + codon optimization detection on Natural reads
screen_result = run_screen(
"reads.fa",
kraken_db="/path/to/kraken/db",
diamond_db="/path/to/protein.dmnd",
codon_usage_dir="codon_usage/",
engineered_report_path="engineered_report.txt", # omit for in-memory results only
codon_cai_engineered_threshold=0.7,
)
# K-mer-only vs combined counts (differ when --codon-cai-threshold is set)
print("k-mer synthetic:", screen_result.engineered_scan.synthetic_count)
print("overall synthetic:", screen_result.overall_synthetic_count)
print(len(screen_result.codon_adaptation))
for r in screen_result.per_read:
print(
r.read_id,
r.overall_label,
r.engineered_overall,
r.engineered_methods,
r.cai_vs_host,
)
print(screen_result.overall_synthetic_count, screen_result.engineered_read_ids)# Default: import all taxids in the CSDB archive (~5.2 GB download on first run)
python plasmidScreen.py build
# Use an existing CSDB archive (no download)
python plasmidScreen.py build \
--csdb-archive /data/codonstatsdb_March2022.tar.gz \
--no-download-csdb
# Subset: explicit taxids or taxids seen in a Kraken classifications file
python plasmidScreen.py build --taxids 9606,511145
python plasmidScreen.py build --taxids-file hosts.txtThe archive is cached under ~/.local/share/PlasmidScreen/ unless --csdb-archive is set.
Output: codon_usage/codon_tables.json and optionally taxonomy_parents.json.
python plasmidScreen.py screen reads.fa report.txt /path/to/kraken/db \
--diamond-db /path/to/protein.dmnd \
--codon-usage-dir ~/.local/share/PlasmidScreen/codon_usage \
--codon-usage-output codon_usage.tsv
# Reuse Kraken classifications (no Kraken subprocess)
python plasmidScreen.py screen reads.fa report.txt /path/to/kraken/db \
--no-run-kraken --kraken-output-path kraken.out \
--diamond-db /path/to/protein.dmnd
# Save DIAMOND TSV for debugging / codon-only reruns
python plasmidScreen.py screen reads.fa report.txt /path/to/kraken/db \
--diamond-db /path/to/protein.dmnd \
--debug-write-diamond-out --diamond-output-path diamond.tsv
# Re-run codon step from saved DIAMOND output
python plasmidScreen.py screen reads.fa report.txt /path/to/kraken/db \
--no-run-diamond --diamond-output-path diamond.tsv \
--no-run-kraken --kraken-output-path kraken.out
# Low CAI can flag overall engineered (on Natural reads with CAI scores)
python plasmidScreen.py screen reads.fa report.txt /path/to/kraken/db \
--diamond-db /path/to/protein.dmnd \
--codon-cai-threshold 0.7 \
--threshold 25
# Engineered k-mer scan only (no codon / DIAMOND; overall == k-mer labels)
python plasmidScreen.py screen reads.fa report.txt /path/to/kraken/db --skip-codon-usageIf a host taxid has no resolvable codon table, screening raises MissingCodonReferenceError — build or extend the reference first (no online fetch at runtime).