TRITEX Barley Assembly Protocol

During my time at IPK, Germany (June & July 2025), we sequenced and assembled a barley landrace following the TRITEX protocol.

This repository contains the scripts used to assemble the genome from HiFi and Hi-C reads.

The TRITEX protocol is well-documented and clearly explained. However, executing the code is not always straightforward. As of now, I understand that the Genetics and Genomic Crop Research group is actively working to improve and update the protocol, making it easier to execute.

In the meantime, this repository provides the scripts and Conda environments I used throughout the entire process, which may help others attempting a similar assembly.

Contact me for any doubt: [email protected] or find me on twitter

We will refer to our barley landrace as GDB_136, used MorexV3 as reference genome, and we based our work on the original TRITEX repository. However, we used modified scripts and Bash code to better follow and adapt the pipeline.

Clone the original repository and update the scripts according to your software paths, project names, data specifications, and locations.

Below is the file and directory structure of the working environment:

tritex_barley_assembly/
├── 01.5_splitFASTQ.sh
├── 01_bam2fastq.sh
├── 02_contig_assembly.sh
├── 03_check_contig_stats.sh
├── 04_gfa2fa.sh
├── 05_digest_enzyme_HiC.sh
├── 06_HiC_mapping.sh
├── 07_prepare_reference.sh
├── 08_guide_map_create.R
├── 09_map_guide_map.sh
├── 10_create_assembly_object.R
├── 11_break_chimeric_scaffold.R
├── 12_fix_chimeras_scaffold.R
├── 13_make_HiC_matrix.R
├── 14_second_round_scaffold.R
├── 15_500kb_scaffolds.R
├── 16_collinear_plot.R
├── 17.05_MapInspectorShiny.R          # This steps are optional, and up to today, I have not been able to run it successully
├── 17_MapInspectorShiny.R             # This steps are optional, and up to today, I have not been able to run it successully
├── 18_pseudomol_stats.R
├── 19_compile_pseudomol.R
#
#                                      # Until here you can see our own scripts to follow the pipeline. Down here find all data, outputs and TRITEX repository.
#
├── assembly/                          # Empty folder to store assembly data
├── HiC/                               # Folder containing the raw HiC data
├── Hi-Fi/                             # Folder containing the raw Hi-Fi data
├── HOWTO.txt
├── mapping/                           # Empty folder that will contain the HiC mapped data to the scaffolds
├── MorexV3.fa                         # The reference genome of the species
├── output/                            # Empty folder to store output files
├── pseudomolecules/                   # Empty folder to store pseudomolecules compiled
├── test_env.yml                       # yml file to create a conda environment to perform (almost) whole pipeline
└── tritexassembly.bitbucket.io/       # The repository with the scripts, download and modified from IPK repositoy

Feel free to adapt this structure to your own naming conventions or computational setup. The initial data folders should look like:

IPK/
├── HiC/
│   ├── 3401768_GDB136_S3_L002_R1_001.fastq.gz  
│   └── 3401768_GDB136_S3_L002_R2_001.fastq.gz  
└── Hi-Fi/
    └── Gm84096_250523_032626_s1.hifi_reads.bam

Note: you may have the Hi-Fi data in fastq format. This is allright too

Download as well the reference genome in the main folder, MorexV3 in this case for barley.

Start cloning the repository:

git clone https://github.com/jsarriaa/TRITEX_Barley_assembly_protocol.git

And create and activate the environment to work on:

conda env create -f test_env.yml
conda activate IPK-tritex

NOTE

TRITEX was originally prepared to work with some proprietary software such as NovoSort. In this case, we have adapted it to work with open-source alternatives.

Initially, the pipeline was designed for macOS systems, with some scripts using macOS-specific syntax (e.g., .zsh instead of .sh). Some of those scripts have been modified to work on Linux servers.

R scripts were originally intended to be run in RStudio. However, this may not be convenient when working on remote servers (via SSH), where graphical interfaces are unavailable. All custom R scripts have been adapted to run in the command line using Rscript. Feel free to adapt them back to RStudio if you prefer working in that environment.

NOTE 2
These are the recommended software versions for the TRITEX pipeline, as suggested by the authors:

Tools

Z shell, v5.0.2

GNU Parallel, v20150222

Hifiasm, v0.15.1-r334

gfatools, v0.5

SAMtools, v1.9

bgzip & tabix (via BCFtools), v1.9

Seqtk, v1.0-r76

Bedtools, v2.26.0

R, v3.5.1

Pigz (parallel gzip), v2.3.4

Minimap2, v2.14

Cutadapt, v1.15

EMBOSS, v6.6.0

BBMap, v37.93

faToTwoBit & twoBitInfo (UCSC tools), v3.69

If you don't have access to the commercial NovoSort, you can follow the alternative SAMtools-based approach provided in the TRITEX shell scripts.

Required R Libraries

data.table, v1.11.8

stringi, v1.2.4

igraph, v1.2.2

zoo, v1.8-3

parallel (included in R-core ≥ 2.14)

Transform Hi-Fi format (if needed)

You may have the Hi-Fi data in a .bam file instead of a compressed .fastq. Run this script to addapt the data for TRITEX protocol:

 ## Edit variables on the script
./01_bam2fastq.sh

Split Hi-Fi data

Hi-Fi is heavy, and split it make the scaffold building easier and more optimal. Run:

./01.5_splitFASTQ.sh Hi-Fi/<input.fastq>

It is prepaired to split the file into 20 parts. Edit line parts=20 if you desire.

Contigs building

Using Hifiasm build contigs from the Hi-Fi.

Set up properly the directory and variables and run:

./02_contig_assembly.sh

Check and generate your contigs stats, runnning:

./03_check_contig_stats.sh

It will outcome files summarizing the generated contigs from hifiasm, and the uncontigs ones as well.

Interesting data, as the N50, N90, mean and contig size can be extracted from it.

Then, transform de gfa file into fasta format, also seting contigs names:

./04_gfa2fa.sh

Map HiC data

First, you need to digest in silico the hifiasm assembly with the restriction enzyme used for the preparation of the Hi-C libraries.

At the following script remember to properly indicate which enzyme has been used to prepare your library.

./05_digest_enzyme_HiC.sh

Take care about the script at bitbucket/shell/digest_emboss.zsh. You may have to change the paths for your exectuables, such as restrict, bedtools and rebase.

Now, map the HiC data. Ensure at the script set up the variables, enzyme used, and executables paths. Output will be stored at mapping folder.

./06_HiC_mapping.sh

Creating the guide map table

From a reference genome, create a guide map table. Take a look with calm to the TRITEX guide map documentation to understand the database.

Obtain from a reference genome a single copy 100bp regions. In this case, barley, we are using MorexV3.

bitbucket/miscellaneous/mask_assembly.zsh script requires to change the paths to the executables.

Also remember to change at 07_prepare_reference.sh your own paths and variables and run:

./07_prepare_reference.sh

After masking, prepare the guide map. Take a look and set up the variables, and prepare the output files names.

./08_guide_map_create.R

Mapping the guide map sequences to the genome

Modify with your variables and run:

./09_map_guide_map.sh

From now on, work will be mainly performed at R. You must start creating the assembly object as an R database (remember to set variables before running):

Rscript 10_create_assembly_object.R

All ouptuts from now on will be located at /pseudomolecules folder. Future steps will generate more databases. It is recomended to name databases to be able to distingish (for instance, adding the date).

Breaking chimeric scaffold

By mistake scaffolds wrongly constructed may appear (in example, sequences from diference chromosomes but a similar region might be combined together, and must be corrected manually).

For this, generate diagnostic plots:

Rscript 11_break_chimeric_scaffold.R

It generate plots for all contigs of more than 1 Mb (do not forget to change variables).

The curation step must be performed by you manually. Take a carefully view to the generated pdf. You should find chimeras such as pages 3, 6 and 9 of this example.

Then, specify the bin positions in where the chimeras have to be broken. Do it editing the script 12_fix_chimeras_scaffold.R. Modify the variable i as the page you found an chimera, and and set and estimated range where you observe the chimera. Add as many times you need this block of code. Once you have all ready, run:

Rscript 12_fix_chimeras_scaffold.R

Now, you will have a new assembly object. Check at the new pdf produced that now the chimeric scaffolds are now properly divided, and carry on:

Make HiC contact matrix

Prepare the variables at the file and run the script:

Rscript 13_make_HiC_matrix.R

Check the produced pdfs and excel sheet.

If everything is allright, carry on. Let's do a second round of looking to chimeras. Now, decreasing down to at least 500kb scaffolds. Prepare and run the script to first fixing chimeras, or plotting specific scaffolds:

Rscript 14_second_round_scaffold.R

Then: (remember to use the new updated database, and set up variables)

Rscript 15_500kb_scaffolds.R

Colinear plot

This step is not mandatory but recomended. Looks for collinearities between the scaffolds, helping to spot inverted or missplaced things. Pay attention at the number of chromosomes that your species have. Once set everything up, run:

Rscript 16_collinear_plot.R

Manual scaffold curation: Map Inspector Shiny

IMPORTANT NOTE: Those following scripts have not been tested. Use on your own knowledge. Recomended trying to follow the original TRITEX guide.

At the colinear plots you can observe inverted scaffolds, like at this example.

When you spot one of this scaffolds to be corrected, go to the assembly excel file (produced at steps as the 15 script) and modify the new orientation value for the scaffold.

Download and use Shiny, you can follow this manual recomended by TRITEX pipeline.

You can try to run 17.05_MapInspectorShiny.R and 17_MapInspectorShiny.R with Rscript but note I have not tested them properly, because Map Inspector Shiny can not be installed with the specifications provided by the pipeline documentation. Try it on your own if you consider.

Compile pseudomolecules

Note: at scripts I indicate you should work with the corrected database by Map shiny inspector if you have, if not (as me) work with the obtained after the script nº15.

First, lets get some data from the outcome pseudomolecules. Prepare the script with the variables (look over the script that you have to specify the number of chr of your species):

Rscript 18_pseudomol_stats.R

And compile it:

Same, set uo variables and work over the corrected database You must prperly modify the file: /tritexassembly.bitbucket.io/R/pseudomolecule_construction.R and edit the compile_psmol function with the proper paths as variables.

Rscript 19_compile_pseudomol.R

NOTE: take a lot of care about the executable version of samtools. It is different (and must be) than last versions and the recomended one. It worked for me installing samtools version 1.21 in another path different and used it at this step, while the rest of the pipeline only worked for me at version 1.22.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TRITEX Barley Assembly Protocol

Transform Hi-Fi format (if needed)

Split Hi-Fi data

Contigs building

Map HiC data

Creating the guide map table

Mapping the guide map sequences to the genome

Breaking chimeric scaffold

Make HiC contact matrix

Colinear plot

Manual scaffold curation: Map Inspector Shiny

Compile pseudomolecules

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
Hi-Fi		Hi-Fi
HiC		HiC
R		R
assembly		assembly
mapping		mapping
miscellaneous		miscellaneous
output		output
pseudomolecules		pseudomolecules
shell		shell
tritexassembly.bitbucket.io		tritexassembly.bitbucket.io
01.5_splitFASTQ.sh		01.5_splitFASTQ.sh
01_bam2fastq.sh		01_bam2fastq.sh
02_contig_assembly.sh		02_contig_assembly.sh
03_check_contig_stats.sh		03_check_contig_stats.sh
04_gfa2fa.sh		04_gfa2fa.sh
05_digest_enzyme_HiC.sh		05_digest_enzyme_HiC.sh
06.log		06.log
06_HiC_mapping.sh		06_HiC_mapping.sh
07_prepare_reference.sh		07_prepare_reference.sh
08_guide_map_create.R		08_guide_map_create.R
09_map_guide_map.sh		09_map_guide_map.sh
10_create_assembly_object.R		10_create_assembly_object.R
11_break_chimeric_scaffold.R		11_break_chimeric_scaffold.R
12_fix_chimeras_scaffold.R		12_fix_chimeras_scaffold.R
13_make_HiC_matrix.R		13_make_HiC_matrix.R
14_second_round_scaffold.R		14_second_round_scaffold.R
15_500kb_scaffolds.R		15_500kb_scaffolds.R
16_collinear_plot.R		16_collinear_plot.R
17.05_MapInspectorShiny.R		17.05_MapInspectorShiny.R
17_MapInspectorShiny.R		17_MapInspectorShiny.R
18_pseudomol_stats.R		18_pseudomol_stats.R
19_compile_pseudomol.R		19_compile_pseudomol.R
README.md		README.md

jsarriaa/TRITEX_Barley_assembly_protocol

Folders and files

Latest commit

History

Repository files navigation

TRITEX Barley Assembly Protocol

Transform Hi-Fi format (if needed)

Split Hi-Fi data

Contigs building

Map HiC data

Creating the guide map table

Mapping the guide map sequences to the genome

Breaking chimeric scaffold

Make HiC contact matrix

Colinear plot

Manual scaffold curation: Map Inspector Shiny

Compile pseudomolecules

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages