|
| 1 | +# BBTools container |
| 2 | + |
| 3 | +Main tool: [BBTools](https://bbmap.org/) |
| 4 | + |
| 5 | +Code repository: https://sourceforge.net/projects/bbmap/ and https://github.com/bbushnell/BBTools |
| 6 | + |
| 7 | +Additional tools: |
| 8 | + |
| 9 | +- samtools: 1.23.1 |
| 10 | +- htslib: 1.23.1 |
| 11 | +- sambamba: 1.0.1 |
| 12 | + |
| 13 | +Basic information on how to use this tool: |
| 14 | + |
| 15 | +- executable: `*.sh` |
| 16 | +- help: Program descriptions and options are shown when running the shell scripts with no parameters. |
| 17 | +- version: `--version` |
| 18 | +- description: |
| 19 | +> BBTools is a suite of fast, multithreaded bioinformatics tools designed for analysis of DNA and RNA sequence data. BBTools can handle common sequencing file formats such as fastq, fasta, sam, scarf, fasta+qual, compressed or raw, with autodetection of quality encoding and interleaving. |
| 20 | +
|
| 21 | +Additional information: |
| 22 | + |
| 23 | +| Script | Purpose | Comment | |
| 24 | +|--------|---------|---------| |
| 25 | +| **bbcms.sh** | Performs error correction using a Count-Min Sketch | Intended for metagenome assembly | |
| 26 | +| **bbcountunique.sh** | Counts unique kmers in reads | | |
| 27 | +| **bbduk.sh** | Trims, filters or masks reads using kmers | | |
| 28 | +| **bbmap.sh** | Splice-aware aligner for short reads | | |
| 29 | +| **bbmapskimmer.sh** | BBMap version designed for high levels of multimapping | | |
| 30 | +| **bbmask.sh** | Masks references based on various things, such as sequence complexity | | |
| 31 | +| **bbmerge.sh** | Merges overlapping paired reads | | |
| 32 | +| **bbmerge-auto.sh** | Same as bbmerge, but tries to allocate all memory on the node | Use this version for kmer operations like extend | |
| 33 | +| **bbnorm.sh** | Normalizes reads based on coverage | Mainly for use prior to single-cell assembly | |
| 34 | +| **bbsplit.sh** | BBMap version that maps to multiple references simultaneously | Intended for decontamination; similar to Seal | |
| 35 | +| **bbversion.sh** | Prints the version of BBTools | | |
| 36 | +| **bbwrap.sh** | Wraps BBMap to process many files using same reference | Saves time by loading the index only once | |
| 37 | +| **calctruequality.sh** | Allows recalibration of quality scores from mapped reads | This generates the correction matrix; BBDuk does the recalibration | |
| 38 | +| **callgenes.sh** | Fast prokaryotic gene caller | Integrated into BBSketch | |
| 39 | +| **callvariants.sh** | Fast variant caller | | |
| 40 | +| **callvariants2.sh** | Same as callvariants.sh with the "multisample" flag | | |
| 41 | +| **clumpify.sh** | Shrinks compressed fastq files, and can remove duplicate reads | Also supports error correction | |
| 42 | +| **comparesketch.sh** | Compares sketches locally, without using a sketch server | | |
| 43 | +| **crossblock.sh** | Alias for decontaminate.sh | | |
| 44 | +| **cutgff.sh** | Cuts out features defined by gff file | E.g, generates one fasta entry per gene from a gff and an assembly | |
| 45 | +| **cutprimers.sh** | Cuts out subregions of ribosomes | Mainly for 16S analysis | |
| 46 | +| **decontaminate.sh** | Pool-level decontamination for single-cell MDA-amplified genomes | | |
| 47 | +| **dedupe.sh** | Removes duplicate and fully-contained sequences | Can also be used to cluster 16S sequences | |
| 48 | +| **dedupe2.sh** | Version of dedupe that supports more hash keys for greater sensitivity | | |
| 49 | +| **dedupebymapping.sh** | Deduplicates reads based on mapping coordinates | | |
| 50 | +| **demuxbyname.sh** | Demultiplexes based on sequences headers | | |
| 51 | +| **filterbyname.sh** | Filters based on sequence headers | | |
| 52 | +| **filterbytaxa.sh** | Filters sequences based on taxonomic classification | Used with NCBI datasets | |
| 53 | +| **filterbytile.sh** | Removes reads that are in low quality areas on flowcell | | |
| 54 | +| **filterqc.sh** | Part of JGI's fastq filtering pipeline | | |
| 55 | +| **filtersam.sh** | Filters sam files to remove reads with multiple unsupported mismatches | Designed for NovaSeq | |
| 56 | +| **gitable.sh** | Used to process NCBI taxonomy data | | |
| 57 | +| **khist.sh** | Alias for bbnorm.sh with flags for making a kmer frequency histogram | | |
| 58 | +| **kmercountexact.sh** | Counts kmers and produces a histogram | Uses more memory than BBNorm but allows exact counts | |
| 59 | +| **kmercountmulti.sh** | Cardinality estimation over multiple kmer lengths | Uses LogLog; does not produce a histogram | |
| 60 | +| **mapPacBio.sh** | BBMap version designed for PacBio or Nanopore reads | Reads longer than 5kbp get broken into 5kbp shreds | |
| 61 | +| **mergesketch.sh** | Allows multiple sketches to be combined | | |
| 62 | +| **msa.sh** | Alignment tool | Used with cutprimers.sh to cut subsections out of 16s | |
| 63 | +| **mutate.sh** | Generates synthetic genomes by randomly mutating the input | | |
| 64 | +| **muxbyname.sh** | Multiplex multiple files, renaming sequences based on input file name | Opposite of demuxbyname.sh | |
| 65 | +| **partition.sh** | Splits a sequence file into multiple files | | |
| 66 | +| **pileup.sh** | Calculates coverage from sam files | | |
| 67 | +| **plotflowcell.sh** | Produces statistics about flowcell positions | | |
| 68 | +| **processhi-c.sh** | Custom trimming for hi-C reads | In development | |
| 69 | +| **randomreads.sh** | Generates synthetic data from real genome reference | Highly customizable | |
| 70 | +| **readqc.sh** | Short read quality report | Alternative to fastqc | |
| 71 | +| **reformat.sh** | Converts sequence files to another format | Has many additional options, includes subsampling | |
| 72 | +| **rename.sh** | Renames sequences in various ways, such as adding a prefix | | |
| 73 | +| **repair.sh** | Fixes broken pairing in fastq files | | |
| 74 | +| **representative.sh** | Makes a smaller subset of a reference dataset by eliminating redundancy | Designed for use with BBSketch output | |
| 75 | +| **rqcfilter2.sh** | Filtering pipeline used at JGI | portal.nersc.gov/dna/microbial/assembly/bushnell/RQCFilterData.tar | |
| 76 | +| **seal.sh** | Counts kmer matches between query and reference sequences | | |
| 77 | +| **sendsketch.sh** | Fast taxonomic classifier using webservers at JGI | | |
| 78 | +| **shred.sh** | Breaks sequences into shorter, fixed-length pieces | | |
| 79 | +| **shuffle.sh** | Randomly reorders input file | Crashes if input doesn't fit in memory | |
| 80 | +| **shuffle2.sh** | Randomly reorders input file | Supports larger files, but output might be less random | |
| 81 | +| **sketch.sh** | Makes reference sketches on a per-TaxID basis | | |
| 82 | +| **sketchblacklist.sh** | Makes sketch blacklists of common kmers | | |
| 83 | +| **sortbyname.sh** | Sorts sequences by name, length, quality, taxa, and other things | | |
| 84 | +| **summarizequast.sh** | Generates box plots for multiple quast reports | | |
| 85 | +| **tadpipe.sh** | Preprocessing and assembly pipeline using tadpole | | |
| 86 | +| **tadpole.sh** | Fast short read assembler | | |
| 87 | +| **tadwrapper.sh** | Runs Tadpole with multiple kmer lengths to select the best assembly | | |
| 88 | +| **taxserver.sh** | Starts taxonomy and sketch servers | | |
| 89 | +| **testformat.sh** | Determines if file is fasta, fastq, interleaved, etc. by reading first few lines | | |
| 90 | +| **testformat2.sh** | Generates extensive statistics by reading the full file | | |
| 91 | +| **translate6frames.sh** | Translates nucleotide sequence into amino acid sequence in all frames | | |
| 92 | +| **vcf2gff.sh** | Converts vcf format to gff format | | |
| 93 | + |
| 94 | + |
| 95 | +Full documentation: https://bbmap.org/docs |
| 96 | + |
| 97 | +## Example Usage |
| 98 | + |
| 99 | +(adapted from `/opt/bbmap/pipelines/covid/processCorona.sh`) |
| 100 | + |
| 101 | +Interleave a pair of FASTQ files for downstream processing: |
| 102 | + |
| 103 | +```text |
| 104 | +reformat.sh \ |
| 105 | + in1=${SAMPLE}_R1.fastq.gz \ |
| 106 | + in2=${SAMPLE}_R2.fastq.gz \ |
| 107 | + out=${SAMPLE}.fastq.gz |
| 108 | +``` |
| 109 | + |
| 110 | +Split into SARS-CoV-2 and non-SARS-CoV-2 reads: |
| 111 | + |
| 112 | +```text |
| 113 | +bbduk.sh ow -Xmx1g \ |
| 114 | + in=${SAMPLE}.fq.gz \ |
| 115 | + ref=REFERENCE.fasta \ |
| 116 | + outm=${SAMPLE}_viral.fq.gz \ |
| 117 | + outu=${SAMPLE}_nonviral.fq.gz \ |
| 118 | + k=25 |
| 119 | +``` |
0 commit comments