-
Notifications
You must be signed in to change notification settings - Fork 28
Misassembly Correction
Currently, RaGOO supports two ways of correcting potentially misassembled contigs. The first is chimeric contig correction, which simply relies on assembly-to-reference alignments to identify chimeric contigs. The second is misassembly correction, which first utilizes assembly-to-reference alignments to find potential misassemblies, then aligns sequencing reads to the assembly to validate the proposed breakpoints. The former is much faster but less robust. The later is slower, and requires additional data, but allows for more confidence in the misassembly corrections. See below for further details regarding each.
As stated above, this technique uses assembly-to-reference alignments to identify potentially misassembled contigs. Then, it aligns user-provided sequencing reads to the assembly to validate potential break points in the assembly. As with chimeric contig correction, one can provide a GFF file, and contigs will not be broken within any feature specified therein. The gff file will also be lifted-over to the new corrected assembly. To invoke this mode in RaGOO, one must supply sequencing reads with the -R
parameter. These can be in fasta or fastq format, and may be Gzipped. As of today, only one file is allowed, so please concatenate your data into one file if necessary. In the near future, I will add support for multiple files.
In addition to the -R
parameter, the user must specify the type of sequencing reads with the -T
parameter. As of today, only short-reads 'sr' and long accurate reads 'corr' (such as error corrected long reads or CCS reads) are allowed. Though both have reasonable runtimes, short-reads is currently the slowest option. For a tomato genome (~1 Gbp) and ~40X coverage of short-read data, the whole pipeline takes a few hours. I am actively working on ways to allow for the use of long noisy reads, but it has proven to be a little less straight forward.
The output for misassembly correction can be found in ragoo_output/ctg_alignments
. It contains the final broken fasta file (<prefix>.misasm.break.fa
) and an updated gff (<prefix>.misasm.broken.gff
) file if provided. When using downstream tools, such as a final lift-over step, these are the files used as input. That is because, after misassembly correction, it is the broken contigs, not the original contigs, that become ordered and oriented to produce the final ragoo.fasta
file.
Chimeric contig correction only relies on assembly-to-reference alignments to break potentially chimeric contigs. Since it does not have a validation step as in misassembly correction, it looks for very large structural differences between the two assemblies. The main advantage is that it is very fast and does not require additional data.
First, put all necessary files in the current working directory:
ln -s /path/to/contigs.fasta
ln -s /path/to/reference.fasta
ln -s /path/to/genes.gff # optional
Next, chimeric contig correction is invoked with the -b
flag.
ragoo.py -b contigs.fasta reference.fasta
One can also specify a gff file with the -gff
flag. This ensures that contigs are never broken within a gff feature. This also lifts-over the gff coordinates to the new chimera-broken contigs.
ragoo.py -b -gff genes.gff contigs.fasta reference.fasta
All of the results from chimera breaking can be found in ./ragoo_output/chimera_break
. First, input contigs are aligned to the reference, and those alignments (inter_contigs_against_ref.paf
) are used to produce an initial set of broken contigs (<prefix>.inter.chimera.broken.fa
). This initial set only addresses interchromosomal chimeras. Next, the same process is repeated for intrachromosomal breaks (intra_contigs_against_ref.paf
and <prefix>.intra.chimera.broken.fa
).
If a gff file was provided, there will also be two lifted-over gff files corresponding to the two aforementioned fasta files (<prefix>.inter.chimera_broken.gff
and <prefix>.intra.chimera_broken.gff
). Again, no contig will be broken within a gff feature.
Notably, <prefix>.intra.chimera.broken.fa
is now the new set of contigs which RaGOO will use for ordering and orienting. That means that the final pseudomolecules are an ordering and orienting of these contigs, and all intermediate output refers to these contig headers. Accordingly, <prefix>.intra.chimera_broken.gff
is the new set of gff features which should be used for downstream lift-over.
I would like to acknowledge Steven Salzberg and Aleksey Zimin for their help with the misassembly correction. The underlying algorithms come from Aleksey Zimin and the MaSuRCA assembler.