Genome Reconstruction (ONT)

🧬 Project Overview

This repository documents the computational workflow for De Novo Genome Assembly using Oxford Nanopore Technologies (ONT). In this project, we move beyond "re-sequencing" (mapping to a known template) and instead perform an exercise in inductive logic: we use the raw physical overlaps of stochastic DNA fragments to infer the original biological architecture of a genome.

The Significance of Long-Read Assembly

Genome assembly is the "ground truth" of biology. Using ONT long reads allows us to solve the Short Read Problem by spanning complex repetitive regions that are otherwise computationally intractable.

Phylogenetic Inference: Understanding evolutionary trajectories with high-resolution genomic data.
Structural Variant Discovery: Identifying large-scale rearrangements (insertions, inversions) that short reads often miss.
Functional Annotation: Transitioning from "syntax" (nucleotide strings) to "semantics" (gene function).

⚙️ The Technical Mechanics: Overlap-Layout-Consensus (OLC)

Unlike short-read assemblers that use De Bruijn Graphs, long-read assembly typically utilizes the OLC paradigm.

Overlap: Every read is compared against others to find shared sequences.
Layout: A graph is constructed where reads are nodes and overlaps are edges. We simplify this graph to find the most likely linear path (a Hamiltonian path).
Consensus: Because raw ONT reads have a higher error rate, we use statistical "voting" and Bayesian polishing to determine the correct base calls.

🛠 The Computational Pipeline

The workflow is segmented into four logical phases, executed via modular environments on the Innovator HPC:

Phase	Methodology	Goal
I. QC & Filtering	Deductive pruning of "adapter" noise and artifacts.	Ensure data integrity.
II. OLC Assembly	Constructing an overlap graph via `Flye`.	Maximizing N50 (Contiguity).
III. Polishing	Neural Network-based error correction via `Medaka`.	Maximizing Q-Score (Accuracy).
IV. Annotation	HMM-based gene prediction via `Prokka`.	Functional Interpretation.

🧬 De Novo Genome Reconstruction & Annotation (ONT)

This repository documents the computational workflow for De Novo Genome Assembly using Oxford Nanopore Technologies (ONT) long-read sequencing. This project was developed as a pedagogical framework for high school students to explore bioinformatics through the lens of graph theory and inductive logic.

🔬 Project Philosophy

The Logic of Long Reads

Unlike traditional short-read sequencing, which functions like a fragmented puzzle, ONT provides long reads that allow us to span complex genomic repeats.

Inductive Assembly: We do not rely on a reference "map." Instead, we use the raw physical overlaps of DNA molecules to infer the underlying biological architecture.
The Noise-Length Trade-off: While ONT provides massive contiguity (long fragments), it introduces stochastic base-calling errors. Our workflow applies Bayesian polishing to refine the final consensus sequence.
Biological Semantics: We move from syntax (the raw $A, T, C, G$ code) to semantics (the meaning of genes) through functional annotation.

⚙️ The Technical Mechanics: Overlap-Layout-Consensus (OLC)

In this project, we utilize the OLC paradigm to reconstruct the genome:

Overlap: Identifying shared sequences between reads via all-pairs alignment.
Layout: Simplifying the overlap graph to find the most probable linear path.
Consensus: Using neural networks to "vote" on the correct nucleotide at every position.

[Image of Overlap-Layout-Consensus OLC assembly algorithm]

🛠 Workflow & Methodology

The analysis is performed on the Innovator HPC (SDState) using modular environments to ensure computational reproducibility.

Phase	Methodology	Tools	Goal
I. Quality Control	Deductive filtering	`NanoPlot`, `FastQC`	Ensure data integrity and $Q$-score validity.
II. Assembly	OLC Reconstruction	`Flye`, `Minimap2`	Maximize N50 (Contiguity).
III. Polishing	Neural Network Correction	`Medaka`, `Racon`	Maximize Consensus Accuracy.
IV. Annotation	HMM Gene Prediction	`Prokka`, `BUSCO`	Functional Interpretation.

[Image of genome assembly and annotation pipeline workflow]

🖥️ Innovator HPC Initial Setup

Follow these steps to establish your workspace and build the necessary computational environments.

# 1. Sign in to the Innovator Cluster
ssh [your_username]@innovator.sdstate.edu

# 2. Navigate to your research space and create the project skeleton
# This follows a logical hierarchy to separate raw inputs from derived outputs.
mkdir -p genome_project/{data,scripts,envs,output,tutorials,logs}
cd genome_project

# 3. Load the Anaconda module available on Innovator
module load anaconda

# 4. Initialize conda for your shell (only needs to be done once)
conda init bash
source ~/.bashrc

# 5. Create the specialized Assembly environment
# This includes Raven for OLC assembly and NanoPolish for signal-level refinement
conda create -n assembly_env -c bioconda -c conda-forge \
    flye \
    raven-assembler \
    seqkit \
    nanopolish \
    racon \
    minimap2 \
    samtools -y

# 6. Create the Evaluation environment
conda create -n eval_env -c bioconda -c conda-forge \
    busco \
    quast \
    seqkit -y

# 7. Create the Circularization environment
conda create -n circulator_env -c bioconda -c conda-forge \
    circlator -y

# 8. Create the Annotation environment (including Enveomics/Evo2)
conda create -n annotation_env -c bioconda -c conda-forge \
    prokka \
    enveomics-python -y

On the Innovator HPC, managing resources is an exercise in capacity planning. You must balance the technical requirements of your tools with the available hardware to avoid "over-provisioning" (wasting resources) or "under-provisioning" (causing the tool to crash).

1. Surveying the Cluster with `sinfo`

The sinfo command is your "logical map" of the cluster. It tells you which partitions are available and how many nodes are currently idle.

To see all partitions:

sinfo

To see detailed node status (Memory/CPUs):

sinfo -o "%P %n %c %m %G %a"

(Note: %c = CPUs, %m = Memory, %G = Generic Resources/GPUs)

2. Launching an Interactive Session

For testing or pre-processing with SeqKit, you shouldn't run code on the login node. Instead, start an Interactive Session. This "borrows" a compute node's resources while keeping you in a live terminal.

# Requesting an interactive session on the 'comp' partition
srun --partition=comp --nodes=1 --cpus-per-task=4 --mem=16G --pty bash

--pty bash: This is critical—it opens a "Pseudo-Terminal" so you can interact with the node.

3. Logic for Selecting CPUs, Memory, and GPUs

To select the appropriate amount of resources, you must look at the algorithmic complexity of the tool you are using.

A. CPU Selection (Parallelization)

Tool Type: Does the tool support multi-threading (e.g., flye --threads or prokka --cpus)?
Strategy: Most assemblers (Flye, Raven) scale well up to 16 or 32 CPUs. Beyond that, you often hit "diminishing returns" where the overhead of managing threads outweighs the speed gain.
For*Calculation: For bacterial genomes, 32GB to 64GB is usually sufficient. For larger genomes, a rough estimate is ** the size of your input dataset**.

C. GPU Selection

mula:** If you have 1 million reads, 8–16 CPUs is usually a "sweet spot" for assembly.

B. Memory (RAM) Selection

The "Rule of Thumb": Assembly is memory-intensive because the Overlap Graph must be stored in RAM.
Tools: Most traditional assemblers are CPU-bound. You only need a GPU for:

Basecalling: (e.g., Guppy or Dorado).
Polishing: Certain versions of Medaka use GPU acceleration (via TensorFlow) to speed up neural network inference.

Command: Use --gres=gpu:1 in your sbatch or srun command if the tool specifically supports CUDA.

📊 Resource Guide for your Workflow

Tool	Recommended CPUs	RAM	GPU Needed?
SeqKit	4–8	8GB	No
Flye / Raven	16–32	64GB+	No
NanoPolish	8–16	32GB	No
Medaka	4–8	16GB	Yes (if available)
Prokka	8	16GB	No

🔬 Mentee Challenge: Efficiency

Ask your mentee to run a small job and then use seff [JOB_ID] after it finishes.

The Goal: If seff shows "CPU Efficiency: 10%," they requested too many CPUs.
The Logic: Why is it ethically and technically better to request exactly what you need rather than "the maximum possible"? (Hint: Think about the queue wait times for other researchers).

Would you like me to write a "Resource Monitor" script that the mentee can run during an interactive session to see their real-time RAM usage?

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
envs		envs
tutorials		tutorials
README.md		README.md
git-control		git-control

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Genome Reconstruction (ONT)

🧬 Project Overview

The Significance of Long-Read Assembly

⚙️ The Technical Mechanics: Overlap-Layout-Consensus (OLC)

🛠 The Computational Pipeline

🧬 De Novo Genome Reconstruction & Annotation (ONT)

🔬 Project Philosophy

The Logic of Long Reads

⚙️ The Technical Mechanics: Overlap-Layout-Consensus (OLC)

🛠 Workflow & Methodology

🖥️ Innovator HPC Initial Setup

1. Surveying the Cluster with `sinfo`

2. Launching an Interactive Session

3. Logic for Selecting CPUs, Memory, and GPUs

A. CPU Selection (Parallelization)

C. GPU Selection

B. Memory (RAM) Selection

📊 Resource Guide for your Workflow

🔬 Mentee Challenge: Efficiency

About

Uh oh!

Releases

Packages

dillnelson2o/Genome-Assembly-Annotation

Folders and files

Latest commit

History

Repository files navigation

Genome Reconstruction (ONT)

🧬 Project Overview

The Significance of Long-Read Assembly

⚙️ The Technical Mechanics: Overlap-Layout-Consensus (OLC)

🛠 The Computational Pipeline

🧬 De Novo Genome Reconstruction & Annotation (ONT)

🔬 Project Philosophy

The Logic of Long Reads

⚙️ The Technical Mechanics: Overlap-Layout-Consensus (OLC)

🛠 Workflow & Methodology

🖥️ Innovator HPC Initial Setup

1. Surveying the Cluster with sinfo

2. Launching an Interactive Session

3. Logic for Selecting CPUs, Memory, and GPUs

A. CPU Selection (Parallelization)

C. GPU Selection

B. Memory (RAM) Selection

📊 Resource Guide for your Workflow

🔬 Mentee Challenge: Efficiency

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

1. Surveying the Cluster with `sinfo`

Packages