|
| 1 | +trunc_seq |
| 2 | +========= |
| 3 | + |
| 4 | +`trunc_seq.pl` is a script to truncate sequence files. |
| 5 | + |
| 6 | +* [Synopsis](#synopsis) |
| 7 | +* [Description](#description) |
| 8 | +* [Usage](#usage) |
| 9 | +* [Options](#options) |
| 10 | +* [Output](#output) |
| 11 | +* [Run environment](#run-environment) |
| 12 | +* [Dependencies](#dependencies) |
| 13 | +* [Author - contact](#author---contact) |
| 14 | +* [Citation, installation, and license](#citation-installation-and-license) |
| 15 | +* [Changelog](#changelog) |
| 16 | + |
| 17 | +## Synopsis |
| 18 | + |
| 19 | + perl trunc_seq.pl 20 3500 seq-file.embl > seq-file_trunc_20_3500.embl |
| 20 | + |
| 21 | +**or** |
| 22 | + |
| 23 | + perl trunc_seq.pl file_of_filenames_and_coords.tsv |
| 24 | + |
| 25 | +## Description |
| 26 | + |
| 27 | +This script truncates sequence files according to the given |
| 28 | +coordinates. The features/annotations in RichSeq files (e.g. EMBL or |
| 29 | +GENBANK format) will also be adapted accordingly. Use option **-o** to |
| 30 | +specify a different output sequence format. Input can be given directly |
| 31 | +as a file and truncation coordinates to the script, with the start |
| 32 | +position as the first argument, stop as the second and (the path to) |
| 33 | +the sequence file as the third. In this case the truncated sequence |
| 34 | +entry is printed to *STDOUT*. Input sequence files should contain only |
| 35 | +one sequence entry, if a multi-sequence file is used as input only the |
| 36 | +**first** sequence entry is truncated. |
| 37 | + |
| 38 | +Alternatively, a file of filenames (fof) with respective coordinates |
| 39 | +and sequence files in the following **tab-separated** format can be |
| 40 | +given to the script (the header is optional): |
| 41 | + |
| 42 | + #start stop seq-file |
| 43 | + 300 9000 (path/to/)seq-file |
| 44 | + 50 1300 (path/to/)seq-file2 |
| 45 | + |
| 46 | +With a fof the resulting truncated sequence files are printed into a |
| 47 | +results directory. Use option **-r** to specify a different results |
| 48 | +directory than the default. |
| 49 | + |
| 50 | +It is also possible to truncate a RichSeq sequence file loaded into the |
| 51 | +[Artemis](http://www.sanger.ac.uk/science/tools/artemis) genome browser |
| 52 | +from the Sanger Institute: Select a subsequence and then go to Edit -> |
| 53 | +Subsequence (and Features) |
| 54 | + |
| 55 | +## Usage |
| 56 | + |
| 57 | + perl trunc_seq.pl -o gbk 120 30000 seq-file.embl > seq-file_trunc_120_3000.gbk |
| 58 | + |
| 59 | +**or** |
| 60 | + |
| 61 | + perl trunc_seq.pl -o fasta 5300 18500 seq-file.gbk | perl revcom_seq.pl -i fasta > seq-file_trunc_revcom.fasta |
| 62 | + |
| 63 | +**or** |
| 64 | + |
| 65 | + perl trunc_seq.pl -r path/to/trunc_embl_dir -o embl file_of_filenames_and_coords.tsv |
| 66 | + |
| 67 | +## Options |
| 68 | + |
| 69 | +- **-h**, **-help** |
| 70 | + |
| 71 | + Help (perldoc POD) |
| 72 | + |
| 73 | +- **-o**=*str*, **-outformat**=*str* |
| 74 | + |
| 75 | + Specify different sequence format for the output (files) [fasta, embl, or gbk] |
| 76 | + |
| 77 | +- **-r**=*str*, **-result\_dir**=*str* |
| 78 | + |
| 79 | + Path to result folder for fof input \[default = './trunc\_seq\_results'\] |
| 80 | + |
| 81 | +- **-v**, **-version** |
| 82 | + |
| 83 | + Print version number to *STDOUT* |
| 84 | + |
| 85 | +## Output |
| 86 | + |
| 87 | +- *STDOUT* |
| 88 | + |
| 89 | + If a single sequence file is given to the script the truncated sequence |
| 90 | + file is printed to *STDOUT*. Redirect or pipe into another tool as |
| 91 | + needed. |
| 92 | + |
| 93 | +**or** |
| 94 | + |
| 95 | +- ./trunc_seq_results |
| 96 | + |
| 97 | + If a fof is given to the script, all output files are stored in a |
| 98 | + results folder |
| 99 | + |
| 100 | +- ./trunc_seq_results/seq-file_trunc_start_stop.format |
| 101 | + |
| 102 | + Truncated output sequence files are named appended with 'trunc' and the |
| 103 | + corresponding start and stop positions |
| 104 | + |
| 105 | +## Run environment |
| 106 | + |
| 107 | +The Perl script runs under Windows and UNIX flavors. |
| 108 | + |
| 109 | +## Dependencies |
| 110 | + |
| 111 | +- [**BioPerl**](http://www.bioperl.org) (tested version 1.007001) |
| 112 | + |
| 113 | +## Author - contact |
| 114 | + |
| 115 | +Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster) |
| 116 | + |
| 117 | +## Citation, installation, and license |
| 118 | + |
| 119 | +For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md). |
| 120 | + |
| 121 | +## Changelog |
| 122 | + |
| 123 | +* v0.2 (2015-12-07) |
| 124 | + * Merged funtionality of `trunc_seq.pl` and `run_trunc_seq.pl` in one single script |
| 125 | + * Allows now single file and file of filenames (fof) with coordinates input |
| 126 | + * output for single file input printed to *STDOUT* now |
| 127 | + * output for fof input printed into files in a result directory, new option **-r** to specify result directory |
| 128 | + * included a POD instead of a simple usage text |
| 129 | + * included `pod2usage` with Pod::Usage |
| 130 | + * included 'use autodie' pragma |
| 131 | + * options with Getopt::Long |
| 132 | + * output format now specified with option **-o** |
| 133 | + * included version switch, **-v** |
| 134 | + * fixed bug to remove input filepaths from fof input for output files |
| 135 | + * skip empty or comment lines (/^#/) in fof input |
| 136 | + * check and warn if input seq file has more than one seq entries |
| 137 | +* v0.1 (2013-02-08) |
| 138 | + * In v0.1 `trunc_seq.pl` only for single sequence input, but included additional wrapper script `run_trunc_seq.pl` for a fof input |
0 commit comments