Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions docs/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,14 +102,15 @@ SPAdes algorithms. This includes:
- MPI-aware version of SPAdes pipeline hpcSPAdes

These tools are not built by default and therefore must be built separately. One
can pass `-SPADES_ENABLE_PROJECTS="semicolon-separated list of projects"` to enable building only
can pass `-DSPADES_ENABLE_PROJECTS="semicolon-separated list of projects"` to enable building only
subset of SPAdes components. The components are:

- `spades`
- [`spades_tools`](standalone.md)
- [`binspareader`](binspreader.md)
- [`binspreader`](binspreader.md)
- [`pathracer`](pathracer.md)
- [`spaligner`](spaligner.md)
- [`splitter`](splitter.md)
- [`hpcspades`](hpc.md)

By default, only SPAdes and SPAdes tools are enabled (so
Expand Down
151 changes: 151 additions & 0 deletions docs/splitter.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
# SPlitteR - Repeat resolution in assembly graph using synthetic long reads

SPlitteR is a tool that uses synthetic long reads (SLRs) to improve the contiguity of HiFi assemblies. Given a SLR library and a HiFi assembly graph in the GFA format, SpLitteR resolves repeats in the assembly graph using linked-reads and generates a simplified (more contiguous) assembly graph with corresponding scaffolds.

## Compilation

To compile SPlitteR, run

```
./spades_compile.sh -DSPADES_ENABLE_PROJECTS=splitter
```

## Input

The tool requires

- Assembly graph file in [GFA 1.0 format](https://github.com/GFA-spec/GFA-spec/blob/master/GFA1.md), with scaffolds included as path lines.
- SLR library in [YAML format](running.md#specifying-multiple-libraries-with-yaml-data-set-file). The tool supports SLR libraries produced using 10X Genomics Chromium and UST TELL-Seq technologies. Other SLR technologies, such as stLFR or LoopSeq can potentially be used as an input if converted to 10X or TELL-Seq format.

SPlitteR supports [LJA](https://github.com/AntonBankevich/LJA) and [Flye](https://github.com/fenderglass/Flye) assembly graphs out of the box. Other assembly graphs should prefferably be converted into blunt format by e.g. [GetBlunted](https://github.com/vgteam/GetBlunted) utility.

### UST TELLSeq format

TELL-Seq library should include barcodes, left reads, and right reads as three separate FASTQ files.

For example, if you have a TELL-Seq library

``` bash
tellseq_reads_I1.fastq.gz
tellseq_reads_R1.fastq.gz
tellseq_reads_R2.fastq.gz
```

YAML file should look like this:

``` bash

[
{
orientation: "fr",
type: "tell-seq",
right reads: [
"/FULL_PATH_TO_DATASET/tellseq_reads_R2.fastq.gz"
],
left reads: [
"/FULL_PATH_TO_DATASET/tellseq_reads_R1.fastq.gz"
],
aux: [
"/FULL_PATH_TO_DATASET/tellseq_reads_I1.fastq.gz"
]
}
]
```

### 10X Genomics Chromium format

10X library should be in FASTQ format with barcodes attached as BC:Z or BX:Z tags:

```
@COOPER:77:HCYNTBBXX:1:1216:22343:0 BX:Z:AAAAAAAAAACATAGT
CCAGGTAGGATTATGGAATTGGTATAAGCGATCAAACTCAATATTTTTGGTGCGGTGACAGACGCCTTCTGGCAGATGATGGGCTTGTCGTAAGTGTGGT
+
GGAGGGAAGGGGIGIIAGAGAGGGGGIAGGGGGGGAGGGGGGGGGGGGAAAGGAGGGGGIGIGGGGGGGAGGAGGIGAIAGGIGGGGIGGGGGGGGGGGG
```

For example, if you have an SLR library

``` bash

lib_slr_1.fastq.gz
lib_slr_2.fastq.gz
```

YAML file should look like this:

``` bash

[
{
orientation: "fr",
type: "clouds10x",
right reads: [
"/FULL_PATH_TO_DATASET/lib_slr_2.fastq.gz"
],
left reads: [
"/FULL_PATH_TO_DATASET/lib_slr_1.fastq.gz"
]
}
]
```

### Command line

Synopsis: `splitter <graph (in binary or GFA)> <SLR library description (in YAML)> <path to output directory> [OPTION...]`

Main options:

- `-t` Number of threads to use (default: 1/2 of available threads)
- `--mapping-k` k-mer length for read mapping (default: 31)
- `-Gmdbg|-Gblunt` Assembly graph type: mDBG (LJA) or blunted (Flye)
- `-Mdiploid|-Mmeta` Repeat resolution mode (diploid or meta)
- `--assembly-info` Path to metaFlye assembly_info.txt file (meta mode, metaFlye graphs only)

Barcode index construction:
- `--count-threshold` Minimum number of reads for barcode index
- `--frame-size` Resolution of the barcode index
- `--length-threshold` Minimum scaffold graph edge length (meta mode option)
- `--linkage-distance` Reads are assigned to the same fragment on long edges based on the linkage distance
- `--min-read-threshold` Minimum number of reads for path cluster extraction
- `--relative-score-threshold` Relative score threshold for path cluster extraction
- `--sampling-factor` Downsample input SLR reads by this factor

Repeat resolution:
- `--score` Score threshold for link index.
- `--tail-threshold` Barcodes are assigned to the first and last <tail_threshold> nucleotides of the edge.
- `--scaffold-links` Use scaffold links in addition to graph links for repeat resolution

Developer options:
- `--ref` Reference path for repeat resolution evaluation
- `--bin-load` Load read-to-graph alignment
- `--debug` Produce lots of debug data, save read-to-graph alignment
- `--tmp-dir` Scratch directory to use
- `-h, --help ` Print help message

Example command lines:

- Assembly produced by LJA from HiFi diploid human dataset, with 10X SLR library (HPC compressed)\
`splitter lja_output/mdbg/mdbg.hpc.gfa 10x_dataset.yaml output -Mdiploid -Gmdbg`
- Assembly produced by metaFlye from metagenomic dataset, with TELL-Seq SLR library\
`splitter metaflye_output/assembly_graph.gfa tellseq_dataset.yaml output --assembly-info metaflye_output/assembly_info.txt -Mmeta -Gblunt`
-

## Output

SPlitteR stores all output files in output directory `<output_dir> `, which is set by the user.

- `<output_dir>/assembly_graph.gfa` input assembly graph in mDBG encoding
- `<output_dir>/resolved_graph.gfa` output assembly graph after repeat resolution
- `<output_dir>/contigs.fasta` output scaffolds

In addition

- `<output_dir>/edge_transform.tsv` map from input graph edges to resolved graph edges
- `<output_dir>/vertex_stats.tsv` Statistics for complex vertices
- `<output_dir>/resolved_graph.fasta` Sequences of the resolved graph edges

## References

If you are using **SPlitteR** in your research, please cite:

[Tolstoganov et al., 2024](https://doi.org/10.7717/peerj.18050)
2 changes: 1 addition & 1 deletion src/cmake/proj.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@

# Side-by-side subprojects layout: automatically set the
# SPADES_EXTERNAL_${project}_SOURCE_DIR using SPADES_ALL_PROJECTS
set(SPADES_ALL_PROJECTS "spades;hammer;ionhammer;corrector;spaligner;spades_tools;binspreader;pathracer;hpcspades")
set(SPADES_ALL_PROJECTS "spades;hammer;ionhammer;corrector;spaligner;spades_tools;binspreader;pathracer;splitter;hpcspades")
set(SPADES_EXTRA_PROJECTS "mts;online_vis;cds_subgraphs")
set(SPADES_KNOWN_PROJECTS "${SPADES_ALL_PROJECTS};${SPADES_EXTRA_PROJECTS}")
set(SPADES_ENABLE_PROJECTS "" CACHE STRING
Expand Down
5 changes: 5 additions & 0 deletions src/common/assembly_graph/core/construction_helper.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -137,6 +137,11 @@ class ConstructionHelper {
return graph_.CreateVertex(std::move(data), id1, id2);
}

VertexId CreateVertex(VertexData data, VertexData cdata, VertexId id1 = 0, VertexId id2 = 0) {
return graph_.CreateVertex(std::move(data), std::move(cdata), id1, id2);
}


};

}
16 changes: 13 additions & 3 deletions src/common/assembly_graph/core/debruijn_data.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ class DeBruijnDataMaster;

class DeBruijnVertexData {
friend class DeBruijnDataMaster;
typedef size_t LinkId;
typedef omnigraph::impl::LinkId LinkId;

enum OverlapKind {
ComplexOverlap,
Expand Down Expand Up @@ -255,11 +255,21 @@ class DeBruijnDataMaster {
return EdgeData(!(data.nucls()));
}

// FIXME: support complex overlaps
VertexData conjugate(const VertexData &data) const {
if (!data.has_complex_overlap() or data.links().empty())
return data.clone();
VERIFY_MSG(false, "Conjugation of complex overlap data is not implemented")
VERIFY_MSG(false, "Conjugate overlap data has to be provided separately for complex vertices")
}

template<class LinkConjugator>
VertexData conjugate(const VertexData &data, LinkConjugator &&link_conjugator) const {
if (!data.has_complex_overlap() || data.links().empty())
return data.clone();
std::vector<LinkId> conjugated_links;
conjugated_links.reserve(data.links().size());
for (const auto &lid : data.links())
conjugated_links.push_back(link_conjugator(lid));
return VertexData(conjugated_links);
}

size_t length(const EdgeData& data) const {
Expand Down
48 changes: 30 additions & 18 deletions src/common/assembly_graph/core/graph.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
#pragma once

#include "adt/iterator_range.hpp"
#include "id_storage.hpp"
#include "observable_graph.hpp"
#include "coverage.hpp"
#include "debruijn_data.hpp"
Expand All @@ -25,7 +26,7 @@ class DeBruijnGraph: public omnigraph::ObservableGraph<DeBruijnDataMaster> {
typedef base::EdgeId EdgeId;
typedef base::VertexId VertexId;
typedef base::VertexIt VertexIt;
typedef DataMasterT::LinkId LinkId;
typedef omnigraph::impl::LinkId LinkId;
typedef VertexIt VertexIterator;
typedef VertexIterator iterator; // for for_each
typedef const VertexIterator const_iterator; // for for_each
Expand All @@ -44,14 +45,15 @@ class DeBruijnGraph: public omnigraph::ObservableGraph<DeBruijnDataMaster> {

std::pair<EdgeId, EdgeId> link;
unsigned overlap;
LinkId conjugate_;
};

typedef std::vector<Link> LinkStorage;
LinkStorage link_storage_;
static constexpr unsigned LINK_ID_BIAS = 3;
omnigraph::IdStorage<Link> lstorage_;

public:
DeBruijnGraph(unsigned k)
: base(k), coverage_index_(*this), link_storage_{} {}
: base(k), coverage_index_(*this), lstorage_(LINK_ID_BIAS) {}

CoverageIndex<DeBruijnGraph>& coverage_index() {
return coverage_index_;
Expand All @@ -76,9 +78,22 @@ class DeBruijnGraph: public omnigraph::ObservableGraph<DeBruijnDataMaster> {
data(v).set_overlap(ovl);
}

using base::conjugate;

LinkId add_link(EdgeId e1, EdgeId e2, unsigned ovl) {
link_storage_.emplace_back(e1, e2, ovl);
return link_storage_.size() - 1;
LinkId fwd = lstorage_.create(e1, e2, ovl);
if (e1 == this->conjugate(e2)) {
lstorage_.at(fwd.int_id()).conjugate_ = fwd;
return fwd;
}
LinkId conj = lstorage_.create(this->conjugate(e2), this->conjugate(e1), ovl);
lstorage_.at(fwd.int_id()).conjugate_ = conj;
lstorage_.at(conj.int_id()).conjugate_ = fwd;
return fwd;
}

LinkId conjugate(LinkId id) const {
return lstorage_.at(id.int_id()).conjugate_;
}

void add_link(VertexId v, LinkId idx) {
Expand All @@ -102,7 +117,7 @@ class DeBruijnGraph: public omnigraph::ObservableGraph<DeBruijnDataMaster> {
links.erase(std::remove_if(links.begin(),
links.end(),
[this, &e](const LinkId &link_id) {
return link_storage_[link_id].link.first == e;
return lstorage_.at(link_id.int_id()).link.first == e;
}), links.end());
}

Expand All @@ -111,16 +126,16 @@ class DeBruijnGraph: public omnigraph::ObservableGraph<DeBruijnDataMaster> {
links.erase(std::remove_if(links.begin(),
links.end(),
[this, &e](const LinkId &link_id) {
return link_storage_[link_id].link.second == e;
return lstorage_.at(link_id.int_id()).link.second == e;
}), links.end());
}

auto links(VertexId v) const {
return data(v).links();
}

const auto& link(size_t idx) const {
return link_storage_[idx];
const auto& link(LinkId idx) const {
return lstorage_.at(idx.int_id());
}

bool is_complex(VertexId v) const {
Expand All @@ -141,16 +156,13 @@ class DeBruijnGraph: public omnigraph::ObservableGraph<DeBruijnDataMaster> {
VERIFY_MSG(false, "Link " << in.int_id() << " -> " << out.int_id() << " was not found for vertex " << v.int_id());
}

void lreserve(size_t size) { link_storage_.reserve(size); }
void lreserve(size_t size) { lstorage_.reserve(size); }

auto link_begin() { return link_storage_.begin(); }
auto link_end() { return link_storage_.end(); }
auto link_begin() const { return link_storage_.begin(); }
auto link_end() const { return link_storage_.end(); }
auto links() { return adt::make_range(link_begin(), link_end()); }
auto links() const { return adt::make_range(link_begin(), link_end()); }
size_t link_size() const { return lstorage_.size(); }

size_t link_size() const {return link_storage_.size(); }
VertexData ConjugateData(const VertexData &data) const {
return master().conjugate(data, [this](LinkId lid) { return conjugate(lid); });
}

using base::AddVertex;
using base::AddEdge;
Expand Down
Loading
Loading