Documentation

Documentation for Version 0.5

Whats New

Everything!

## Overview As next-generation DNA sequencing becomes increasingly commonplace, the demand for powerful, sophisticated, yet easy to use analysis software has increased dramatically. The Marth lab at Boston College is at the forefront of genomic software development, addressing a large fraction of the analysis problems from read mapping to variant analysis. To best serve the research community, the _gkno_ package has been developed to address the following requirements of next-generation data analysis.

A unified launcher bringing together software tools into a single environment,
a general framework for generating linked lists of tools allowing integrated pipelines to be constructed,
and a web environment providing easy access to documentation and tutorials, user forums, blogs and bug reporting.

The web environment keeps people up to date on the work being performed with gkno and useful information that different users post in the forum. The documentation, including the tutorials, provides clear instructions on how to download and execute gkno as well as more in depth information about the included tools, pipelines and configuration files. A core goal of the package is to enable inexperienced users to simply download and execute predetermined analysis pipelines in order to generate sensible results for their research projects. The intricacies of the pipelines (including which processing tools and sensible parameter sets) are all hidden in configuration files and only advanced users need interrogate them.

Installing gkno

Download and installation instructions.

gkno launcher description

The gkno launcher is designed to bring the wealth of next-generation DNA sequencing analysis software into a single, easy to use command line. The power of the launcher is the ability to bring together multiple tools into a single analysis pipeline with the minimum of required user input. The pipeline is defined in a configuration file that can be quickly and easily constructed and is then available for repeated use. When the command line is executed, gkno generates a Makefile that is automatically (unless specified otherwise by the user) executed using the GNU make framework. This system ensures that each tool is aware of its file dependencies and includes rules to determine how all of the necessary files are to be created. If a tool fails, any files created in the failed step are deleted and the user is informed of where the problems occurred. This ensures that no partially constructed files will be made available to the user, leading to the potential of analysis based on incomplete data.

Single tool mode

Each tool in gkno is described by a json configuration file. This file describes the executable commands, the tool location, all of the allowed command line arguments, the expected parameters, data types and default values. In general, the user should have no need to deal with the configuration files, but a complete description of the format of the configuration files is given in the 'Configuration files' section. A list of all the available tools can be seen by typing:

gkno --help

In order to run a tool, the user simply needs to specify the name of the tool to run. In order to get extra information (e.g. the available command line arguments), help can be displayed by typing:

gkno <tool> --help

Pipeline mode

The gkno launcher can be used to launch any of the available pipelines. Including the term pipe as the first argument instructs gkno to operate in the pipeline mode. To see a list of all available pipelines, type:

gkno pipe –help

In order to see all of the available command line arguments for a particular pipeline, the following command line can be used:

gkno pipe <pipeline name> --help

Executing the command line above lists all of the arguments available as part of the specified pipeline. The pipeline arguments are not. however, the complete set of arguments available to all of the constituent tools. If the user wishes to set a parameter in one of the pipelines' tools, but this is not an available pipeline command line argument, all of the tools arguments are accessible by setting the pipeline task as an argument and then including arguments for that task in quotation marks. For example, if the fastq-vcf pipeline is executed and the --use-best-n-alleles argument in freebayes requires modification, the following command line is valid:

gkno pipe fastq-vcf --freebayes “--use-best-n-alleles 5”

is a valid command line. All of the commands for (in this example, freebayes) are contained within the quotation marks. The pipelines are designed in such a way that the commonly accessed commands for each of the constituent tools are accessible via the standard command line, but advanced options may require using this syntax.

Logging

gkno usage is logged in order to keep track of which tools/pipelines are most commonly used in the community. Every time gkno is launched, an ID of the form tool/ or pipe/ is generated and sent back to the Marth lab. No information about the user/location etc. is tracked, just the used tool.

Configuration files

The python code describing the gkno launcher is general and has no knowledge of the individual tools it comprises. In order to generate executable scripts (Makefiles are created that are executed using GNU make), configuration files are required to describe the individual tool command lines and how the different tools interact in a pipeline. These configuration files are in json format and the file contents for tools and pipelines are different.

This section of the documentation describes the format of the json configuration files in some detail and is not intended for the user just wanting to get started with the gkno package. For a more hands on description of how to use gkno or modify specific aspects of the configuration files, specific tutorials with worked examples have been developed. These are included in the documentation, but are also available on the gkno website under the Tutorials tab.

Tool configuration files

The tool configuration files describe all of the information necessary to run each of the individual tools. Each individual tool configuration file may contain multiple ‘tools’, each describing a different mode of operation. For example, the software tool MosaikBuild can be used to construct a reference file that is readable by the Mosaik aligner from a standard fasta reference file, or it can be used to generate a read archive using input fastq files. Each mode of operation is distinct and has different command line arguments and so they appear separately in the configuration file (as mosaik-build-reference and mosaik-build-fastq).

json elements

Each tool description is described in json format using the following elements. Unless otherwise stated, the element is required in the configuration file and its omission will cause gkno to terminate.

description: a brief description of the tool and its role. This text appears in the pipeline help and so its inclusion is necessary in order to ensure clarity.
path: the location of the executable file within the gkno package
executable: the name of the executable file
precommand: command line arguments to be included prior to the executable file, for example java –Xmx4g –jar
modifier: text to be included after the executable, but prior to any of the tools command line arguments, for example, sort or index in bamtools
help: the help command for this tool (usually --help or -h)
arguments: a list of all the valid command line arguments for this tool. Each argument is supplied with all the information necessary for gkno. In order to this, the elements are supplied for each argument (unless specified as optional).
- description: a brief description of the command line argument used in the help messages.
- input: A Boolean indicating if this argument is associated with an input file.
- output: A Boolean indicating if this argument is associated with an output file.
- resource: A Boolean indicating if the file associated with this argument is a resource file. This will (unless overridden by the user) assume that the file is in the gkno resource directory.
- required: A Boolean indicating if the file associated with this argument is required for successful operation of the tool. If required is set to true and the file is not provided, gkno will terminate highlighting that this file is missing.
- dependent: indicates that the tool is dependent on the existence of this file. The executable script is a Makefile executed using the 'gnu make' system. Before running the provided command line, the existence of the dependent files is checked. If one of these files does not exist, make will check to see if the script contains a rule for how to create this file. If it does, this file will be created, otherwise the script will fail.
- type: the expected data type associated with this argument. This can be one of the following: string, int, float or flag. On the command line, all arguments will expect a value to be provided unless the data type is set to flag.
- alternative: a short form version of the command line argument. For example, the argument could be --fastq and the alternative would likely be -f.
- extension (optional): the suffix of the file associated with this option. If there is no such file, then this element does not need to be set. If multiple extensions are allowed, they can be separated by a pipe. For example, fa and fasta are valid extensions for a fasta reference file and so this field would be populated with fa|fasta.
- default (optional): the default parameter to be given to this command line argument.
- use for filenames (optional): if the output file is not defined and there are multiple input files provided to the tool, the input file with this value set to true will used to construct the output filename. The input extension will be replaced with the output extension. For example, if the input filename is input_test.fq and the output file extension is defined as bam, the output filename will be input_test.bam if not defined by the user.
- stub (optional): if the output from a tool is a set of files and the output argument does not contain the file extension, then the output is a stub and this option is set as true. In this case, the following argument (outputs) is also required.
- outputs (optional): a list of the output suffixes that will be generated by the tool.
- if input is stream: this option is available for input file arguments. Ordinarily, the tool accepts a file as input and so the input argument would be set to the filename. If the input is a stream, this argument allows the command line to be modified so that instead of a filename, this argument is provided (for example bamtools would require the stdin instead of the filename). It is possible that if the input is a stream, this argument should not appear on the command line, but should be replaced with a different argument (for example, freebayes ordinarily expects the command line argument --bam, but if a stream is used as input, the argument --stdin is expected in place of this). If this is the case, this argument is set to replace and the replace argument with entry must be provided.
replace argument with (optional): to be included only if if input is stream (explained above) is set to replace. If set, this needs two inputs:
argument: the command line argument that is to be used as a replacement,
value: the value which is supplied. This can be set as blank.
additional files: ADD TEXT

Example tool configuration file

As an example, a section of the configuration file for freebayes is included below. The actual file can be found in the <gkno_path>/config_files/tools directory. This example contains a sample of the provided arguments, but shows the expected syntax of the file.

{
  "tools" : {
    "freebayes" : {
      "description" : "Bayesian variant and haplotype calling",
      "path" : "freebayes/bin",
      "executable" : "freebayes",
      "help" : "--help|-h",
      "arguments" : {
        "--bam" : {
          "description" : "Add FILE to the set of BAM files to be analyzed.",
          "input" : "true",
          "use for filenames" : "true",
          "output" : "false",
          "resource" : "false",
          "required" : "true",
          "dependent" : "true",
          "alternative" : "-b",
          "extension" : "bam",
          "if input is stream" : "replace",
          "replace argument with" : {
            "argument" : "--stdin",
            "value" : ""
          },
          "type" : "string"
        },
        "--no-snps" : {
          "description" : "Ignore SNP alleles.",
          "input" : "false",
          "output" : "false",
          "resource" : "false",
          "required" : "false",
          "dependent" : "false",
          "alternative" : "-I",
          "type" : "flag"
        }
      }
    }
  }
}

### Pipeline configuration files The pipeline configuration file describes the tools to be used, the order of use and any linkage between the tools. A small number of pipeline command line arguments are also defined in this configuration file. In general, this always includes the input and output paths describing where files are to be found and where to write output files to. Any major options within constituent tools that the user will likely want to modify can also be included. There are only two blocks within this pipeline configuration file: * workflow: and ordered list of the tools to be executed, * linkage: describes interrelationship between the individual tools, * arguments: definitions of the command line options that can be set for this pipeline.

The individual components of the pipeline configuration file are discussed in the following sections.

Workflow

The workflow is a simple list containing all of the tools appearing in the pipeline, in the order in which they are to be executed.

Linkage

The linkage section provides connections between the different tools in the pipeline. These linkages do no necessarily apply between consecutive tools, but can link any two tools in the pipeline. Inside the linkage block, is a block for each tool (if necessary). Within each of these tool blocks is a block for each of the tools command line arguments for which the value depends on the output of a previous tool. The configuration file contains the following pieces of information for each linked command:

tool: the tool whose output is to be used for the current tools input,
output command: The command from the tool specified in tool that generates the output to be used,
extension: in the case where the tool specified generates an output stub, the extension of the file to be linked is required. For example, the MosaikAligner output command (-out) is a stub and the output extensinos are defined as ‘.bam’, ‘.special.bam’ and ‘.multiple.bam’. If a later tool is linked to the MosaikAligner output, the extension is required to define which of the file is required (e.g. .bam).

In the example configuration file in section 5.2.4, the –q input to MosaikBuildFastq is linked to the –q value in the fastqCheck tool. This means that having set the value of –q for fastqCheck, there is no need to worry about the input to MosaikBuildFastq as this is already explicitly handled. Of course, the user can override any of these linkages on the command line if they desire.

Arguments

The arguments block contains information about all the allowable command line arguments for the pipeline. There are four standard arguments that appear for all pipelines and then any number of arguments specified for each individual pipeline. All of the definitions can accept the following information:

tool: the tool with which the command should be associated. For input files (for example), this would typically be the first tool in the pipeline that uses the input files. For the four standard pipeline arguments, this is set to pipeline.
command: the command line argument for the specified tool, to which the argument is applied.
alternative: an alternative, short-form version of the argument (e.g. –hs for --hash-size).
default: the default value given to the argument.

The standard arguments available in all of the pipelines are:
--name (-n): the standard name to apply to output files. In the example configuration file in section 5.2.4, the input files (-q and –q2) are specified, but the name of the output is not. The name of the output defaults to the default value of --name, but this can be set by the user.
--input-path (-ip): the input path of input files. If the input files are not listed with a path, the assumption is that the files reside in the current working directory. Setting --input-path, will force gkno to assume all unspecified input files (except for resource files, see below) are available in the directory specified by --input-path.
--output-path (-op): similar to the input path. All output files are output to the --output-path unless a path is provided with the filename.
--resource-path (-rp): all files listed in the tool configuration file as resources files are assumed to be available in the resource path directory. By default, this is in the gkno root directory, but can be modified using this command.

Any extra command line arguments are build up with the same information. It is customary to include all individual tool arguments that users will likely want to modify as pipeline arguments. More in-depth or esoteric commands are not included, but are always accessible using the tool specific commands (see section 4).

Example pipeline configuration file

As an example, the pipeline configuration file for the singleSampleAlignment pipeline is shown below. This is a simple pipeline consisting of only two tools.

{
  "workflow" : [
    "fastqCheck",
    "MosaikBuildFastq",
    "MosaikAligner"
  ],
  "linkage" : {
    "MosaikBuildFastq" : {
      "-q" : {
        "tool" : "fastqCheck",
        "output command" : "-q"
      },
      "-q2" : {
        "tool" : "fastqCheck",
        "output command" : "-q2"
      }
    },
    "MosaikAligner" : {
      "-in" : {
        "tool" : "MosaikBuildFastq",
        "output command" : "-out"
      },
      "json parameters" : {
        "tool" : "fastqCheck",
        "output command" : "-out"
      }
    }
  },
  "arguments" : {
    "--name" : {
      "tool" : "pipeline",
      "command" : "null",
      "alternative" : "-n",
      "default" : "output"
    },
    "--input-path" : {
      "tool" : "pipeline",
      "command" : "null",
      "alternative" : "-ip",
      "default" : "$(PWD)"
    },
    "--output-path" : {
      "tool" : "pipeline",
      "command" : "null",
      "alternative" : "-op",
      "default" : "$(PWD)"
    },
    "--resource-path" : {
      "tool" : "pipeline",
      "command" : "null",
      "alternative" : "-rp",
      "default" : "$(RESOURCES)"
    },
    "--fastq" : {
      "tool" : "fastqCheck",
      "command" : "-q",
      "alternative" : "-q"
    },
    "--fastq2" : {
      "tool" : "fastqCheck",
      "command" : "-q2",
      "alternative" : "-q2"
    },
    "--median-fragment-length" : {
      "tool" : "MosaikBuildFastq",
      "command" : "-mfl",
      "alternative" : "-mfl"
    },
    "--sequencing-technology" : {
      "tool" : "MosaikBuildFastq",
      "command" : "-st",
      "alternative" : "-st",
      "default" : "illumina"
    }
  }
}

## Available tools The toolkit is dynamic and extra tools can be added by the Marthlab or others (in collaboration with the Marthlab). A list of currently available tools, along with a brief description and links to references are included below: ### Mosaik Mosaik is the Marthlabs sequence read alignment software and comprises multiple elements, each of which are described below. **MosaikBuild** MosaikBuild is used to convert a fasta format reference file into a native format used by the alignment software. Sequence reads themselves also require conversion into a format that the aligner can read. This is also achieved using MosaikBuild.

MosaikJump

A hash-based algorithm is used to perform alignments within Mosaik. To facilitate this, a jump database is required. This database is generated using the MosaikJump utility.

MosaikAligner

MosaikAligner description.

### Bamtools Bamtools description. ### Freebayes Freebayes description.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation

Documentation for Version 0.5

Whats New

Table of contents

Installing gkno

gkno launcher description

Single tool mode

Pipeline mode

Logging

Configuration files

Tool configuration files

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally