Skip to content

1613358894/TypeTreeFlow

TypeTreeFlow

TypeTreeFlow is a command-line LPSN-first type-strain genome acquisition and audit workflow for microbial novel species studies.

The current workflow starts from validly published correct species, discovers NCBI Assembly candidates, enriches candidate evidence from BioSample and culture collection metadata, prepares curator-reviewable type-strain selections, and writes stable manifests, audit tables, and run summaries. It is intentionally guarded: dry runs are safe by default, and real execution requires explicit opt-in flags.

The long-term goal is to collect auditable type-strain genomes and 16S sequences, compare a query genome against references with ANI, build a 16S phylogeny, and report reproducible tables, figures, name maps, and summaries. The current release focuses on the LPSN-first acquisition workflow, strict evidence boundaries, stable I/O contracts, resume behavior, fake-runner tested execution wrappers, and clear safety controls.

GTDB support is retained for legacy/local metadata workflows and as a discovery or evidence layer. It is not the authority for species boundaries in the current LPSN-first route. Manual external type-genome registration is implemented for curator-provided local FASTA files: TypeTreeFlow can validate external_genomes.tsv, plan installs, copy reviewed FASTA files into genomes/references/, and write external manifest.tsv and name_map.tsv records or merge them into an existing manifest when explicitly requested. It does not automate ATCC Genome Portal or other provider portals, does not log in, scrape, purchase, or download from external portals, and does not treat external_genome_id as an NCBI assembly_accession.

Current capabilities

  • Build a species checklist from an offline LPSN cache or guarded official LPSN API access, retaining validly published ICNP correct-name species and writing excluded-taxa audit rows.
  • Preserve user-provided checklist workflows for cases where users already have an authoritative nomenclatural source.
  • Generate NCBI Assembly candidates from a local discovery cache or guarded real NCBI assembly discovery with explicit opt-in.
  • Enrich candidate evidence from local or guarded Entrez BioSample metadata.
  • Parse culture collection deposit IDs from LPSN/checklist, NCBI Assembly, BioSample, strain, organism, and notes text as auditable evidence.
  • Prepare and validate offline strain-selection TSVs from candidate evidence, with strict, balanced, review-only, and representative policies. balanced auto-selects strong type-evidence candidates; representative is an exploratory top-ranked fallback and does not confirm type status.
  • Apply manual curator evidence from a review template when an external source confirms equivalence to an LPSN type-strain deposit.
  • Run one-command genus acquisition dry runs that preserve intermediate checklist, candidate, audit, selection, manifest, name-map, and summary files.
  • Drive guarded NCBI Datasets downloads from selected selection-TSV rows.
  • Register manually reviewed external genome FASTA files into genomes/references/, manifest.tsv, and name_map.tsv without using NCBI assembly accessions.
  • Plan provider registration proposals from curator-authored provider_request.tsv files as review-only outputs under provider/. Provider planning is always dry-run-only: it records counts and proposal rows for human review, but it does not log in, download, install FASTA files, write manifests, or change completion metrics.
  • Keep NCBI Assembly completion separate from external-inclusive completion; external registered genomes can improve local downstream readiness without changing NCBI-only completion counts.
  • Explicitly write completion audit tables from a species checklist and existing manifest with --write-completion-audit.
  • Summarize external registered genome records from an existing manifest in report-only mode, keeping them separate from NCBI Assembly-backed records.
  • Summarize existing provider registration planning outputs in report-only mode as review-only counts, without triggering provider planning, downloads, credential handling, FASTA installation, manifest changes, or completion metric changes.
  • Plan and run guarded resume-mode barrnap, FastANI, Entrez 16S fallback, and MAFFT/trimAl/IQ-TREE wrappers.
  • Select type-material records from local GTDB metadata TSVs for legacy or direct GTDB-based workflows.
  • Write report/summary.md from existing files without making species conclusions.

The CLI can run guarded resume-mode FastANI and write an ANI PNG from parsed results. It does not parse Newick trees. Guarded phylogeny execution writes a Newick treefile only; it does not render a tree figure.

See CHANGELOG.md for release notes.

Documentation

Start with docs/index.md for the full documentation map.

Historical plans and run evidence are indexed from docs/index.md. They are evidence snapshots, not current behavior contracts or required release gates.

Taxonomic scope

TypeTreeFlow's primary acquisition route is LPSN-first. LPSN or an equivalent authoritative checklist defines the expected species set; NCBI Assembly, BioSample, GTDB, and local caches are evidence/discovery layers for available genome and sequence data.

LPSN is the naming authority for validly published and legitimate prokaryotic names. TypeTreeFlow can filter LPSN-derived records to validly published correct-name species, including official correct name (...) annotations, and write excluded synonym, misspelling, not-validly-published, pro-correct, and Candidatus rows for review. It still does not make species conclusions: report/summary.md only reports traceable computational results from recorded manifests and output files.

For formal new-species publication work, review the generated checklist, candidate, selection, source-audit, manifest.tsv, name_map.tsv, and report/summary.md against LPSN or an equivalent authoritative checklist before drawing taxonomic conclusions. Use --source-audit-policy strict for formal downloads or publication-facing analyses when genome and 16S records are mixed from different sources.

Strict type-strain selection requires evidence tying an NCBI Assembly accession to the species type-strain equivalence set. A regular culture collection deposit for the same species is not enough unless it is explicitly part of, or proven equivalent to, that type strain.

Installation

Use Python 3.10 or newer.

python -m pip install -e .
python -m pip install -e ".[test]"

On Windows, editable installs place the typetreeflow console script in your Python Scripts directory. If typetreeflow --help is not found after pip install -e ., confirm that directory is on PATH. In PowerShell, you can print the expected Scripts directory with:

python -c "import site; print(site.USER_BASE + '\\Scripts')"

You can also continue to run the CLI directly:

python typetreeflow.py --help

Core Python dependencies are declared in pyproject.toml. Real guarded downloads additionally require the datasets executable on PATH. Real barrnap execution requires barrnap. Real FastANI execution requires fastANI. Real phylogeny execution requires mafft, trimal, and iqtree2. Some conda IQ-TREE builds install the executable as iqtree; create an iqtree2 alias/symlink or use a build that provides iqtree2. Entrez-backed operations require network access, --email, and the relevant enable flag.

Local environment files

TypeTreeFlow can load local KEY=VALUE environment files before reading environment defaults. If --env-file PATH is supplied, that file is loaded. Otherwise, existing .env, .env.local, typetreeflow.env, or lpsn.env files in the current directory are loaded when present. These files are intended to stay local and ignored by git; do not commit real credentials.

Copy typetreeflow.env.example to a local file such as lpsn.env, fill it in locally, then omit --email when running guarded NCBI/Entrez commands:

Copy-Item typetreeflow.env.example lpsn.env
# Edit lpsn.env locally. Do not commit it.
python typetreeflow.py --env-file lpsn.env --version

Supported environment defaults:

  • TYPETREEFLOW_EMAIL: default for --email.
  • TYPETREEFLOW_API_KEY: default for --api-key.
  • TYPETREEFLOW_LPSN_EMAIL or TYPETREEFLOW_LPSN_USERNAME: official LPSN account identifier.
  • TYPETREEFLOW_LPSN_PASSWORD: official LPSN password.

Quickstart and common commands

Start with the high-level workflow commands. They wrap the lower-level stages, write run_state.json, and keep review/download boundaries explicit:

typetreeflow --help
python typetreeflow.py --help
typetreeflow --version
typetreeflow doctor

Recommended v2.2.0 workflows

For ordinary users, verify-genus is the main entry point. It prepares the LPSN-first checklist, NCBI Assembly candidate evidence, optional BioSample evidence, selection table, manifest, download preflight summary, report, and workflow state in one command.

Plan a genus verification from local caches:

typetreeflow verify-genus Fusobacterium \
  --lpsn-cache data/fusobacterium_lpsn_species_cache.tsv \
  --discovery-cache data/fusobacterium_discovery_records.tsv \
  --biosample-cache data/fusobacterium_biosample_records.tsv \
  --enrich-biosample \
  --policy balanced \
  --source-audit-policy strict \
  --strains-per-species 1 \
  --outdir results/fusobacterium_verify \
  --force

Plan the same shape with guarded live LPSN, NCBI Assembly, and BioSample lookups. Network use is opt-in and requires --email for NCBI/Entrez:

typetreeflow verify-genus Fusobacterium \
  --enable-lpsn-api \
  --enable-ncbi-discovery \
  --enable-biosample-entrez \
  --email user@example.org \
  --policy balanced \
  --source-audit-policy strict \
  --outdir results/fusobacterium_verify \
  --force

By default, verify-genus stops at reviewable planning. Review selection/user_selection.tsv, selection/download_preflight_summary.tsv, manifest.tsv, and report/summary.md before any real download. To accept the generated selection and execute guarded NCBI Datasets downloads, pass both download opt-ins:

typetreeflow verify-genus Fusobacterium \
  --lpsn-cache data/fusobacterium_lpsn_species_cache.tsv \
  --discovery-cache data/fusobacterium_discovery_records.tsv \
  --biosample-cache data/fusobacterium_biosample_records.tsv \
  --enrich-biosample \
  --policy balanced \
  --source-audit-policy strict \
  --outdir results/fusobacterium_verify \
  --auto-accept-selection \
  --enable-downloads \
  --force

The download pair is deliberately strict: --enable-downloads is ignored for real execution unless paired with --auto-accept-selection in verify-genus. For a manual stop, omit both or add --review-required.

After genomes are ready, high-level 16S extraction can be requested with barrnap:

typetreeflow verify-genus Fusobacterium \
  --lpsn-cache data/fusobacterium_lpsn_species_cache.tsv \
  --discovery-cache data/fusobacterium_discovery_records.tsv \
  --biosample-cache data/fusobacterium_biosample_records.tsv \
  --enrich-biosample \
  --policy balanced \
  --outdir results/fusobacterium_verify \
  --auto-accept-selection \
  --enable-downloads \
  --extract-16s barrnap \
  --force

--extract-16s barrnap depends on a genome-ready manifest produced by guarded download or external local FASTA registration, and it requires barrnap on PATH.

Inspect or continue a run:

typetreeflow status --outdir results/fusobacterium_verify
typetreeflow next-step --outdir results/fusobacterium_verify
typetreeflow status --outdir results/fusobacterium_verify --json

Package a reviewed delivery directory:

typetreeflow package-results \
  --outdir results/fusobacterium_verify \
  --delivery-dir results/fusobacterium_delivery \
  --include all

The delivery package includes manifest, selected-accession and evidence summaries, optional reports, copied genome FASTA files, optional 16S FASTA files, and run_state.json when present. It does not copy credentials, environment files, API keys, NCBI ZIP caches, pytest caches, or temporary directories.

Run the release verification matrix for balanced plus representative policies:

typetreeflow verify-release-genus Fusobacterium \
  --lpsn-cache data/fusobacterium_lpsn_species_cache.tsv \
  --discovery-cache data/fusobacterium_discovery_records.tsv \
  --biosample-cache data/fusobacterium_biosample_records.tsv \
  --enrich-biosample \
  --outdir results/v2_2_0_release_verification \
  --policies balanced,representative \
  --force

This writes per-policy outdirs plus results/v2_2_0_release_verification/verification_matrix.tsv and release_verification_summary.md.

Selection policy semantics:

Policy Automatic selection Intended use
strict Only strict-confirmed / LPSN type-strain matches. Formal type-strain download planning.
balanced Only strong type-evidence rows: strict_confirmed or likely_type_material. Default candidate collection when type-material evidence is required but strict LPSN match evidence may be incomplete.
representative Top-ranked fallback per species, including ordinary unconfirmed candidates. Exploratory downloads only; unconfirmed rows are marked representative_only and representative_not_type_confirmed.
review-only None. Complete manual review before selection.

Do not mix the evidence tiers. strict_confirmed is strict type-strain evidence. likely_type_material is a reviewable risk layer, not strict deposit-equivalent completion. representative_only is exploratory only and must not be counted as strict type-strain completion. Representative-only manifests also carry type_confirmation_status=representative_not_type_confirmed and preflight summaries carry representative_only_scope=exploratory_only_not_strict_type_strain_completion.

BioSample enrichment is recommended for strict and balanced selection because BioSample deposit IDs can improve evidence quality. Strict confirmation still requires an accepted NCBI/BioSample deposit ID to match the LPSN/checklist type-strain equivalence set, or accepted curator evidence proving that equivalence. Type-material wording alone remains likely_type_material.

For external provider data, keep planning and local FASTA registration separate. TypeTreeFlow does not automatically log in to, scrape, purchase from, or download from ATCC, DSMZ, JCM, NCTC, or other provider portals. Provider planning is a metadata/review handoff only:

typetreeflow \
  --plan-provider-registration data/provider_request.tsv \
  --outdir results/provider_plan \
  --force

That command writes review files under provider/; it does not write manifest.tsv, name_map.tsv, external_genomes.tsv, installed FASTA files, or NCBI download plans. After a curator legally obtains a FASTA and records the local path, checksum, type-material assertion, and terms review, register it explicitly:

typetreeflow \
  --register-external-genomes data/external_genomes.tsv \
  --outdir results/external_registration \
  --dry-run

typetreeflow \
  --register-external-genomes data/external_genomes.tsv \
  --outdir results/external_registration \
  --merge-manifest

External registered genomes keep provider-native IDs in external fields and manifest notes. They must not be mixed into NCBI assembly_accession.

Advanced/manual recovery commands

The lower-level primitives remain supported for developers, audits, and special recovery work. They are not the recommended entry point for ordinary runs. Prefer verify-genus, status, next-step, and package-results unless you need to repair or inspect one stage in isolation.

Run a minimal legacy/local GTDB dry run:

typetreeflow \
  --genus Aliivibrio \
  --gtdb-metadata tests/fixtures/gtdb_metadata_small.tsv \
  --outdir output_dry_run \
  --dry-run

Audit a user-provided species checklist:

typetreeflow \
  --genus Aliivibrio \
  --gtdb-metadata tests/fixtures/gtdb_metadata_small.tsv \
  --species-checklist examples/species_checklist_minimal.tsv \
  --dry-run

Convert an offline LPSN child-taxa export into a species checklist:

typetreeflow \
  --lpsn-child-taxa examples/fusobacterium_lpsn_child_taxa_minimal.tsv \
  --write-species-checklist results/offline_smoke/species_checklist_from_lpsn.tsv \
  --write-excluded-lpsn-taxa results/offline_smoke/excluded_lpsn_child_taxa.tsv

Generate candidates from a local discovery cache:

typetreeflow \
  --species-checklist results/offline_smoke/species_checklist_from_lpsn.tsv \
  --discover-assembly-candidates \
  --discovery-cache examples/discovery_records_minimal.tsv \
  --outdir results/offline_smoke \
  --dry-run

Prepare an offline selection TSV:

typetreeflow \
  --outdir results/offline_smoke \
  --prepare-selection \
  --selection-policy balanced \
  --strains-per-species 1

Selection policy semantics:

Policy Automatic selection Intended use
strict Only strict-confirmed / LPSN type-strain matches. Formal type-strain download planning.
balanced Only strong type-evidence rows: strict_confirmed or likely_type_material. Default candidate collection when type-material evidence is required but strict LPSN match evidence may be incomplete.
representative Top-ranked fallback per species, including ordinary unconfirmed candidates. Exploratory downloads only; unconfirmed rows are marked representative_only and representative_not_type_confirmed.
review-only None. Complete manual review before selection.

balanced and representative are intentionally different. balanced still requires strong type evidence before preselecting a row. representative may download a useful genome for exploration, but it is not type-strain confirmation and must not be counted as strict completion. Selection TSV rows carry evidence_level values strict_confirmed, likely_type_material, or representative_only; manifest notes carry matching type_confirmation_status values confirmed_type_strain, likely_type_material, or representative_not_type_confirmed. Generated selection rows also include semicolon-delimited ranking_reasons and, for unselected strict/balanced candidates, blocking_reasons to explain ranking evidence and policy blockers without changing selection behavior.

For strict or balanced acquisition, enable BioSample enrichment and guarded BioSample Entrez lookup when real NCBI lookups are appropriate:

typetreeflow \
  --species-checklist results/fusobacterium_acquisition/species_checklist.tsv \
  --discover-assembly-candidates \
  --enable-ncbi-discovery \
  --enrich-biosample \
  --enable-biosample-entrez \
  --email user@example.org \
  --selection-policy balanced \
  --outdir results/fusobacterium_acquisition_refresh \
  --force

For exploratory representative planning, keep it dry-run and review the representative_only rows before treating any output as biological evidence:

typetreeflow \
  --outdir results/offline_smoke \
  --prepare-selection \
  --selection-policy representative \
  --strains-per-species 1 \
  --dry-run

Validate and plan from a curator-edited selection:

typetreeflow \
  --outdir results/offline_smoke \
  --selection-tsv results/offline_smoke/selection/user_selection.tsv \
  --dry-run \
  --force

Manual external genome registration dry run:

typetreeflow \
  --register-external-genomes examples/external_genomes_minimal.tsv \
  --outdir results/external_registration_minimal \
  --dry-run

This validates examples/external_genomes_minimal.tsv and writes external_genome_registration_results.tsv and external_genome_install_plan.tsv for review. Valid rows are planned for genomes/references/<normalized_id>.fna; invalid rows are retained as skipped plan rows. It does not create manifest.tsv, copy FASTA files, or run the NCBI download workflow. The bundled example uses a tiny synthetic FASTA fixture and external_source=external_registered_fixture; it is only for workflow demonstration and is not a real provider or ATCC genome download. Relative genome_fasta_path values are resolved relative to the TSV location. Manual registration assumes the curator has already obtained any external FASTA through permitted means outside TypeTreeFlow. The CLI does not log in to, scrape, purchase from, or download from external provider portals.

Provider registration planning dry run:

typetreeflow \
  --plan-provider-registration provider_request.tsv \
  --outdir results/provider_spike

Minimal synthetic provider planning fixture:

python typetreeflow.py --plan-provider-registration examples/provider_request_minimal.tsv --outdir results/provider_plan_minimal --force

This writes provider/provider_registration_plan.tsv and provider/proposed_external_genomes.tsv for curator review. The command is dry-run-only even without --dry-run. It reads the request TSV and writes review files only; it does not contact provider portals, log in, handle credentials, download or copy FASTA files, write external_genomes.tsv, manifest.tsv, name_map.tsv, or create cache/ncbi/download_plan.tsv. Existing provider planning outputs require --force to overwrite. The bundled minimal provider request is synthetic and provider-neutral; it validates reviewable plan and proposal outputs only, not provider automation. Provider proposal rows do not count toward NCBI Assembly strict completion or external-inclusive completion. If a curator accepts proposed rows, the handoff is manual: prepare a local external_genomes.tsv and run the existing external registration workflow explicitly. Provider planning notes call out missing terms review, local FASTA paths, SHA-256 checksums, and manual-review flags so the handoff can be completed without treating proposal rows as installed genomes.

Install reviewed external genome FASTA files:

typetreeflow \
  --register-external-genomes examples/external_genomes_minimal.tsv \
  --outdir results/external_registration_minimal

Non-dry-run registration writes the same validation results and install plan, then copies only planned FASTA files to genomes/references/ and writes external_genome_install_results.tsv, manifest.tsv, and name_map.tsv. External manifest rows keep assembly_accession empty, use external_registered_genome provenance, and preserve the external genome ID in notes. Invalid rows do not block valid rows from installing or being written to the manifest, but the CLI exits non-zero when any row is skipped as invalid, fails, has an installed checksum mismatch, or no manifest-eligible row remains. This still does not write an NCBI download plan or report.

If manifest.tsv already exists, non-dry-run external registration is protected by default and exits with an error. Use --merge-manifest to append eligible external registered genome rows to the existing manifest while preserving existing NCBI rows and record order:

typetreeflow \
  --register-external-genomes data/external_genomes.tsv \
  --outdir results/fusobacterium_acquisition \
  --merge-manifest

The merge keeps existing records first, appends new external records, skips duplicates with the same external genome ID or installed genome path, and stabilizes only new conflicting record_id or normalized_id values. --force remains the overwrite mode for rebuilding the external registration manifest from install results, and cannot be combined with --merge-manifest. Dry-runs never merge manifest files.

Once the manifest exists, --report-only can generate report/summary.md from existing files. External registered genomes appear in their own section and in provenance counts, but remain separate from NCBI Assembly-backed records. Registered external genomes with installed local FASTA paths can enter downstream planning as mixed-provenance references. If existing provider planning outputs are present under provider/, the same report also adds review-only provider registration planning counts, including proposed rows missing local FASTA paths or checksums. It does not read provider_request.tsv, rerun provider planning, download, log in, install proposed genomes, write manifests, or change completion audit metrics.

typetreeflow \
  --outdir results/external_registration_minimal \
  --report-only

To explicitly write completion audit tables from a checklist and existing manifest, run:

typetreeflow \
  --species-checklist <path> \
  --outdir <outdir> \
  --write-completion-audit

This writes source_audit/completion_audit.tsv and source_audit/completion_summary.tsv. --report-only only consumes an existing completion summary when present; it does not generate the audit. The completion audit reports NCBI Assembly strict completion separately from external-inclusive strict completion. Registered external genomes can improve external-inclusive local readiness after validation and manifest registration, but they do not change NCBI Assembly strict completion. Manifest records marked representative_only, representative_not_type_confirmed, or likely_type_material are risk-layered for review and do not inflate strict completion counts.

Run the LPSN-first genus acquisition path from local caches:

typetreeflow \
  --acquire-genus Fusobacterium \
  --lpsn-cache data/fusobacterium_lpsn_species_cache.tsv \
  --discovery-cache data/fusobacterium_discovery_records.tsv \
  --biosample-cache data/fusobacterium_biosample_records.tsv \
  --enrich-biosample \
  --selection-policy strict \
  --source-audit-policy strict \
  --strains-per-species 1 \
  --outdir results/fusobacterium_acquisition \
  --dry-run

Run the same acquisition shape with guarded live lookups:

typetreeflow \
  --acquire-genus Fusobacterium \
  --enable-lpsn-api \
  --enable-ncbi-discovery \
  --email user@example.org \
  --enable-synonym-discovery \
  --selection-policy strict \
  --source-audit-policy strict \
  --strains-per-species 1 \
  --outdir results/fusobacterium_acquisition \
  --dry-run

For strict or balanced selection, BioSample evidence can improve type-material coverage before final review. Because --acquire-genus is a dry-run orchestrator, use guarded BioSample Entrez during a real discovery/enrichment refresh, then reuse the written caches in the acquisition dry run: Strict confirmation still requires a BioSample/NCBI-derived deposit ID to match an LPSN type-strain ID; type-material wording alone remains likely_type_material.

typetreeflow \
  --species-checklist results/fusobacterium_acquisition/species_checklist.tsv \
  --discover-assembly-candidates \
  --enable-ncbi-discovery \
  --enrich-biosample \
  --enable-biosample-entrez \
  --email user@example.org \
  --selection-policy balanced \
  --outdir results/fusobacterium_acquisition_refresh \
  --force

Drive guarded downloads from a reviewed selection TSV:

typetreeflow \
  --outdir results/fusobacterium_acquisition \
  --selection-tsv results/fusobacterium_acquisition/selection/user_selection.tsv \
  --selection-policy strict \
  --source-audit-policy strict \
  --strains-per-species 1 \
  --enable-downloads \
  --force

Selection-driven dry-runs and real downloads write selection/download_preflight_summary.tsv before the download plan is acted on. It summarizes selected evidence risk and plan status counts, including strict_confirmed, likely_type_material, representative_only, external_registered, download_planned, and download_not_applicable. representative_only is explicitly exploratory and is not strict type-strain completion.

Resume with existing outputs:

python typetreeflow.py --outdir results --resume --dry-run
python typetreeflow.py --outdir results --resume --dry-run --skip-ani --skip-tree

Run a query-genome ANI dry run from a legacy/local manifest path:

typetreeflow \
  --genus Bacillus \
  --gtdb-metadata gtdb_metadata.tsv \
  --query-genome query.fna \
  --query-16s query_16s.fasta \
  --outdir results \
  --threads 8 \
  --dry-run

Write a report from existing files only:

python typetreeflow.py --outdir results --report-only

Manual curator-evidence helper commands are documented in docs/lpsn_first_acquisition.md. Use them when generated candidates require publication, culture collection, or explicit BioSample/INSDC evidence before strict selection.

Safety model

--dry-run is the default development and review mode. It writes plans, manifests, summaries, and fake-runner outputs where appropriate, but does not contact remote services or run guarded external tools.

Real actions require explicit opt-in flags. For analysis/download stages, --dry-run has precedence over every enable flag. A command containing --dry-run --enable-downloads still performs a dry run.

--resume continues from an existing output directory. --force rebuilds outputs that would otherwise be protected. --resume and --force are mutually exclusive.

Candidate discovery and BioSample enrichment are acquisition stages: local-cache modes are offline, while guarded real NCBI/Entrez modes require --email and explicit enable flags. --api-key is optional and is passed through to Biopython Entrez when provided.

The source-audit gate is controlled by:

--source-audit-policy permissive|warn|strict
  • permissive: do not block selection-driven planning on source audit status.
  • warn: default; allow planning but preserve warning rows for review.
  • strict: block source-audit-sensitive rows unless evidence supports the selected genome/source relationship.

Guarded real execution flags

Stage Enable flag Notes
downloads --enable-downloads Guarded NCBI Datasets ZIP download path.
barrnap --enable-barrnap Resume-mode local 16S extraction when barrnap is installed.
Entrez 16S --enable-entrez --email user@example.org Guarded 16S fallback; dry runs never contact Entrez.
BioSample Entrez --enable-biosample-entrez --email user@example.org Guarded BioSample enrichment; local --biosample-cache mode remains offline.
NCBI assembly discovery --enable-ncbi-discovery --email user@example.org Guarded real candidate discovery; local --discovery-cache mode remains offline.
FastANI --enable-fastani Resume-mode local ANI when fastANI is installed and --query-genome is provided.
phylogeny --enable-phylo Resume-mode MAFFT, trimAl, and IQ-TREE wrappers.
LPSN API --enable-lpsn-api Guarded official LPSN API adapter for --lpsn-genus; local --lpsn-cache mode remains offline.

Examples:

typetreeflow \
  --genus Bacillus \
  --gtdb-metadata gtdb_metadata.tsv \
  --outdir results \
  --enable-downloads
typetreeflow \
  --outdir results \
  --resume \
  --enable-entrez \
  --email user@example.org
typetreeflow \
  --outdir results \
  --resume \
  --query-genome query.fna \
  --enable-fastani \
  --skip-tree
typetreeflow \
  --outdir results \
  --resume \
  --enable-phylo \
  --skip-ani

Output directories

TypeTreeFlow writes stable, reviewable outputs under --outdir. The most common top-level paths are:

  • manifest.tsv: selected records and local file paths.
  • name_map.tsv: normalized IDs and source labels.
  • taxonomy/: checklist comparison and species-scope audit outputs.
  • candidates/: assembly candidates and diagnostics.
  • selection/: generated and curator-edited selection TSVs, plus download preflight risk summaries.
  • manual_review_report.md: human-readable report for strict/balanced species left unselected by the manual review template workflow.
  • source_audit/: genome/16S and culture collection source audit rows.
  • provider/: dry-run-only provider registration plans and proposed external genome rows.
  • cache/ncbi/: NCBI download plans, discovery caches, and lookup caches.
  • genomes/references/: installed genome FASTA references.
  • rrna/: 16S plans, extracted sequences, and Entrez fallback outputs.
  • ani/: FastANI plans, raw outputs, summaries, and optional plot.
  • phylo/: 16S alignment, trimming, and IQ-TREE outputs.
  • report/summary.md: traceable report from recorded files.
  • run_summary.json: machine-readable run summary.

See docs/output_layout.md for path contracts, docs/schemas.md for table fields, and docs/statuses.md for status values.

Testing

Install test dependencies, then run:

python -m pip install -e ".[test]"
pytest -q

On Windows environments where the default pytest temp directory is blocked, use a repository-local temporary directory:

pytest tests/test_docs_consistency.py -q --basetemp .pytest_tmp -p no:cacheprovider

Contributing

Before changing behavior, read CONTRIBUTING.md and docs/maintenance.md. Keep README as the user entry point; put detailed design, path, schema, status, release, and historical evidence material in the relevant docs.

Citation

If you use this workflow in a study, cite the repository version or release tag and the external tools/databases used, including LPSN, NCBI Datasets, GTDB, barrnap, FastANI, MAFFT, trimAl, and IQ-TREE as applicable.

License

See LICENSE.

Known limitations

  • TypeTreeFlow does not make taxonomic species conclusions. It reports recorded computational results and audit evidence for human review.
  • LPSN is the nomenclatural authority for the LPSN-first route; GTDB is retained as a metadata/evidence layer and for legacy/local workflows.
  • Official LPSN API use requires the optional lpsn Python client and credentials configured outside this repository.
  • NCBI discovery, BioSample enrichment, Entrez fallback, downloads, barrnap, FastANI, and phylogeny execution are guarded and require explicit opt-in.
  • Guarded real FastANI execution is resume-only, requires --query-genome, and requires the fastANI executable on PATH.
  • Guarded real phylogeny execution is resume-only and requires mafft, trimal, and iqtree2 on PATH.
  • Candidate generation can read a local discovery cache, or contact NCBI only with --enable-ncbi-discovery --email.
  • Synonym-aware candidate discovery is off by default and available only with --enable-synonym-discovery; synonym hits require manual review and remain assigned to the checklist correct species.
  • External registered genomes are summarized from manifest state; merging is limited to appending installed external records to an existing manifest and does not merge external rows into the NCBI download workflow.

About

TypeTreeFlow: tools and workflows for type strain data download and release packaging.

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages