Skip to content

Commit d66126a

Browse files
committed
update documentation
updates to online application documentation, plus some formatting changes, and FAQ updates
1 parent 8e4f710 commit d66126a

File tree

7 files changed

+190
-140
lines changed

7 files changed

+190
-140
lines changed

docs/FAQ.md

Lines changed: 38 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ programmer understands but a casual user might not, as well as rationale.
3737
However, the only Cram files that can be used must either have a valid reference
3838
`UR` tag in the `@SQ` header, i.e. the original local reference fasta file is
3939
still available, or have an embedded reference sequence in the Cram file itself,
40-
i.e. generated with output option `embed_ref=1`). Using an external reference
40+
i.e. generated with output option `embed_ref=1`. Using an external reference
4141
fasta file is not supported, a limitation unfortunately imposed by Bio::DB::HTS,
4242
not by Bio::ToolBox. Lacking these, you are best to simply back-convert the Cram
4343
file to Bam format using `samtools` prior to usage.
@@ -47,15 +47,28 @@ programmer understands but a casual user might not, as well as rationale.
4747
CSV files appear perfectly benign, but are in fact a can of worms: mandatory or
4848
optional quoting, empty or undefined values, spaces, character escaping, text
4949
encoding, and so on. This mostly affects reading files. Most (all?) bioinformatic
50-
text formats are tab-delimited, so CSV support is intentionally absent.
50+
text formats are tab-delimited, so CSV support is intentionally absent. With that
51+
said, if you provide an output file name with a `.csv` extension, it will write
52+
a (crude) CSV file. There are (currently) no attempts at quoting or escaping
53+
characters, so if your content contains commas you can expect errors. Your best
54+
bet is to write a TSV file.
55+
56+
- How do I get a plain table without all that metadata junk in TXT files?
57+
58+
BioToolBox applications write tab-delimited text files with a header row.
59+
Additional metadata and comments may be written at the beginning of the file,
60+
prefixed with `#` symbol. This is a (mostly) universal comment symbol indicating
61+
that the line can safely be ignored. But sometimes you just want a plain table to
62+
import into a spreadsheet program, for example. Provide an output file name with
63+
a `.tsv` extension and it will write a plain TSV file sans metadata.
5164

5265
- How do I get my UCSC gene table (refFlat, knownGene, genePred, etc) recognized?
5366

54-
UCSC doesn't have official file extensions, and their downloads page just
55-
have `.txt.gz` extensions. Furthermore, they don't have proper column headers. Downloads
56-
from the table browser will stick a header line, prefixed with a `#` but no space
57-
between it and the first word. I have work arounds to detect those headers, but
58-
what about the files from the download page?
67+
UCSC doesn't have official file extensions, and their downloads page just have
68+
`.txt.gz` extensions. Furthermore, they don't have proper column headers.
69+
Downloads from the table browser will stick a header line, prefixed with a `#`
70+
but no space between it and the first word. I have work arounds to detect those
71+
headers, but what about the files from the download page?
5972

6073
Programs that are designed to potentially interpret a gene table, such as
6174
[get_datasets](apps/get_datasets.md), will "taste" a file for potential UCSC
@@ -65,9 +78,9 @@ programmer understands but a casual user might not, as well as rationale.
6578

6679
Some programs accept a `--noheader` flag, and it will insert dummy column headers.
6780

68-
Otherwise, you can help yourself by changing the extension from `.txt` to something
69-
more descriptive, like `.refflat`, `.genepred`, `.knowngene`, or even the most
70-
generic `.ucsc`. Don't forget the `.gz` if it's compressed.
81+
Otherwise, you can help yourself by changing the extension from `.txt` to
82+
something more descriptive, like `.refflat`, `.genepred`, `.knowngene`, or even
83+
the most generic `.ucsc`. Don't forget the `.gz` if it's compressed.
7184

7285
- What is the difference between Start and Start0?
7386

@@ -76,25 +89,26 @@ programmer understands but a casual user might not, as well as rationale.
7689
between the two.
7790

7891
Many annotation formats come in two flavors of coordinate system: 1-base system
79-
(counting each nucleotide in a sequence starting at 1) or 0-base (or interbase) system
80-
(counting between bases, hence starting at 0). The GFF family of annotation file
81-
formats (including GTF and GFF3) use 1-base. The UCSC family of annotation formats
82-
(BED, refFlat, genePred, etc) use 0-base. SAM files are 1-based, but binary BAM files
83-
are internally 0-based, while VCF files are 1-based. In other words, every format is
84-
different. The [BioPerl](https://bioperl.org) libraries, of which much of BioToolBox
85-
was initially based on, uses 1-base for everything. BioToolBox inherently transforms
86-
0-based coordinates to 1-base formats internally, at least when it is aware of what
87-
the file is using, hence the purpose of naming columns differently.
92+
(counting each nucleotide in a sequence starting at 1) or 0-base (or interbase)
93+
system (counting between bases, hence starting at 0). The GFF family of
94+
annotation file formats (including GTF and GFF3) use 1-base. The UCSC family of
95+
annotation formats (BED, refFlat, genePred, etc) use 0-base. SAM files are
96+
1-based, but binary BAM files are internally 0-based, while VCF files are
97+
1-based. In other words, every format is different. The
98+
[BioPerl](https://bioperl.org) libraries, of which much of BioToolBox was
99+
initially based on, uses 1-base for everything. BioToolBox inherently transforms
100+
0-based coordinates to 1-base formats internally, at least when it is aware of
101+
what the file format is using, hence the purpose of naming columns differently.
88102

89103
- Why do so many programs reference a database and how do I use one?
90104

91105
In the early days of BioToolBox, much of the analysis was based on
92106
[BioPerl](https://bioperl.org) databases, notably
93-
[Bio::DB::SeqFeature::Store](https://metacpan.org/pod/Bio::DB::SeqFeature::Store),
94-
where annotation as well as datasets (microarray values) were stored. These were SQL
95-
databases, backed by either MySQL or SQLite. These are still supported, although less
96-
so as annotation files can now be parsed on the fly or datasets stored in bigWig or
97-
Bam databases.
107+
[Bio::DB::SeqFeature::Store](https://metacpan.org/pod/Bio::DB::SeqFeature::Store)
108+
, where annotation as well as datasets (microarray values) were stored. These
109+
were SQL databases, backed by either MySQL or SQLite. These are still supported,
110+
although less so as annotation files can now be parsed on the fly or datasets
111+
stored in bigWig or Bam databases.
98112

99113
For annotation, working with a database can be arguably faster, especially when
100114
working with an annotation set over and over again. Use the BioPerl script,

docs/apps/get_binned_data.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@ A program to collect data in bins across a list of features.
3131
5p_utr|3p_utr]
3232
--long collect each window independently
3333
-r --format <integer> number of decimal places for numbers
34+
--mapq <integer> minimum map quality of counted alignments
3435
3536
Bin specification:
3637
-b --bins <integer> number of bins feature is divided (10)
@@ -174,6 +175,15 @@ The command line flags and descriptions:
174175
Default is not to format, often leading to more than the intended
175176
significant digits.
176177

178+
- --mapq &lt;integer>
179+
180+
Specify the minimum mapping quality of alignments to be considered when
181+
counting from a Bam file. Default is 0, which will include all alignments,
182+
including multi-mapping (typically MAPQ of 0). Set to an integer in range
183+
of 0..255. Only affects count methods, including `count`, `ncount`, and
184+
`pcount`. Other methods involving coverage, e.g. `mean`, do not filter
185+
alignments.
186+
177187
### Bin specification
178188

179189
- --bins &lt;integer>

docs/apps/get_datasets.md

Lines changed: 16 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,7 @@ get\_datasets.pl \[--options...\] --in &lt;filename> &lt;data1> &lt;data2...>
4141
--tpm calculate TPM values
4242
-r --format <integer> number of decimal places for numbers
4343
--discard <number> discard features whose sum below threshold
44+
--mapq <integer> minimum map quality of counted alignments
4445

4546
Adjustments to features:
4647
-x --extend <integer> extend the feature in both directions
@@ -256,12 +257,12 @@ The command line flags and descriptions:
256257
it was counted in an input region or not. This might be used when a
257258
more global normalization is needed.
258259

259-
The region method is best used with RNASeq data and a complete gene
260-
annotation table. The genome method is best used with partial annotation
261-
tables or other Seq types, such as ChIPSeq. This option can only be used
262-
with one of the count methods (count, ncount, pcount). A sum method may be
263-
cautiously allowed if, for example, using bigWig point data. The FPKM values
264-
are appended as additional columns in the output table.
260+
The region method is best used with RNASeq data and a complete gene
261+
annotation table. The genome method is best used with partial annotation
262+
tables or other Seq types, such as ChIPSeq. This option can only be used
263+
with one of the count methods (`count`, `ncount`, `pcount`). A sum method
264+
may be cautiously allowed if, for example, using bigWig point data. The FPKM
265+
values are appended as additional columns in the output table.
265266

266267
- --tpm
267268

@@ -284,6 +285,15 @@ The command line flags and descriptions:
284285
that were newly collected. For more advanced filtering, see
285286
[manipulate\_datasets.pl](https://metacpan.org/pod/manipulate_datasets.pl).
286287

288+
- --mapq &lt;integer>
289+
290+
Specify the minimum mapping quality of alignments to be considered when
291+
counting from a Bam file. Default is 0, which will include all alignments,
292+
including multi-mapping (typically MAPQ of 0). Set to an integer in range
293+
of 0..255. Only affects count methods, including `count`, `ncount`, and
294+
`pcount`. Other methods involving coverage, e.g. `mean`, do not filter
295+
alignments.
296+
287297
### Adjustments to features
288298

289299
- --extend &lt;integer>

docs/apps/get_features.md

Lines changed: 19 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,6 @@ get\_features.pl --db &lt;name> --out &lt;filename>
1818

1919
Selection:
2020
-f --feature <type> feature: gene, mRNA, transcript, etc
21-
-u --sub include subfeatures (true if gff, gtf, refFlat)
2221

2322
Filter features:
2423
-l --list <filename> file of feature IDs to keep
@@ -39,11 +38,13 @@ get\_features.pl --db &lt;name> --out &lt;filename>
3938

4039
Report format options:
4140
-B --bed write BED6 (no --sub) or BED12 (--sub) format
41+
-u --sub include subfeatures when writing bed format
4242
-G --gff write GFF3 format
4343
-g --gtf write GTF format
4444
-r --refflat write UCSC refFlat format
4545
-t --tag <text> include specific GFF attributes in text output
4646
--coord include coordinates in text output
47+
--useid use ID as the BED name instead of default Name
4748

4849
General options:
4950
-o --out <filename> output file name
@@ -82,14 +83,6 @@ The command line flags and descriptions:
8283
is '`gene`'. For databases, an interactive list will be presented
8384
from which one or more may be chosen.
8485

85-
- --sub
86-
87-
Optionally include all child subfeatures in the output. For example,
88-
transcript, CDS, and/or exon subfeatures of a gene. This option is
89-
automatically enabled with GFF, GTF, or refFlat output; it may be
90-
turned off with `--nosub`. With BED output, it will force a BED12
91-
file to be written. It has no effect with standard text.
92-
9386
### Filter features
9487

9588
- --list &lt;file>
@@ -202,6 +195,13 @@ The command line flags and descriptions:
202195
With subfeatures enabled, write a BED12 (12-column BED) file.
203196
Otherwise, write a standard 6-column BED format file.
204197

198+
- --sub
199+
200+
Optionally include all child subfeatures (exons) in the output when
201+
writing a BED format; this forces a BED12 output. This option is
202+
automatically enabled with GFF, GTF, or refFlat output. It has no
203+
effect with standard text.
204+
205205
- --gff
206206

207207
Write a GFF version 3 (GFF3) format output file. Subfeatures are
@@ -231,6 +231,16 @@ The command line flags and descriptions:
231231
in other formats. This is automatically included when adjusting
232232
coordinate positions.
233233

234+
- --useid
235+
236+
Use the feature's Primary ID tag instead of the Display Name tag for use in
237+
the output Name column when writing to either a BED or UCSC (refFlat)
238+
format. By default the Display Name is used when available. From GTF files,
239+
this corresponds to the `gene_id` or `transcript_id` tags, rather than
240+
`gene_name` or `transcript_name`. For GFF3 files, this would be `ID` and
241+
`Name` tags.
242+
243+
234244
### General options
235245

236246
- --out &lt;filename>

docs/apps/get_gene_regions.md

Lines changed: 30 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -41,14 +41,20 @@ get\_gene\_regions.pl \[--options...\] --db &lt;text> --out &lt;filename>
4141
-K --chrskip <regex> skip features from certain chromosomes
4242

4343
Adjustments:
44-
-b --begin --start integer specify adjustment to start coordinate
45-
-e --end --stop integer specify adjustment to stop coordinate
44+
-b --begin --start integer specify adjustment to start coordinate
45+
-e --end --stop integer specify adjustment to stop coordinate
4646

47-
General options:
47+
Output options:
48+
-o --out <filename> specify output name
4849
--bed output as a bed6 format
49-
-o --out <filename> specify output name
50-
-z --gz compress output
51-
-v --version print version and exit
50+
--bedname specify what to use for bed name column
51+
[genename|geneid| default is 'featurename'
52+
transcriptname|transcriptid
53+
featurename]
54+
-z --gz compress output
55+
56+
General options:
57+
-v --version print version and exit
5258
-h --help
5359

5460
## OPTIONS
@@ -203,7 +209,11 @@ The command line flags and descriptions:
203209
a start adjustment will always modify the feature's 5'end, either
204210
the feature startpoint or endpoint, depending on its orientation.
205211

206-
### General options
212+
### Output options
213+
214+
- --out &lt;filename>
215+
216+
Specify the output filename.
207217

208218
- --bed
209219

@@ -213,10 +223,23 @@ The command line flags and descriptions:
213223

214224
Specify the output filename.
215225

226+
- --bedname E<lt>name<gt>
227+
228+
Specify what to use for the Name column in the output BED file.
229+
Several options are available, including:
230+
231+
geneid - The Primary ID of the parent Gene feature
232+
genename - The Display Name of the parent Gene feature
233+
transcriptid - The Primary ID of the parent Transcript feature
234+
transcriptname - The Display Name of the parent Transcript feature
235+
featurename - The generated name of the feature (default)
236+
216237
- --gz
217238

218239
Specify whether (or not) the output file should be compressed with gzip.
219240

241+
### General options
242+
220243
- --version
221244

222245
Print the version number.

docs/apps/get_relative_data.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,7 @@ get\_relative\_data.pl \[--options\] -i &lt;filename> &lt;data1> &lt;data2...>
3232
--avtype [type,type,...] alternative types of feature to avoid
3333
--long collect each window independently
3434
-r --format <integer> number of decimal places for numbers
35+
--mapq <integer> minimum map quality of counted alignments
3536

3637
Bin specification:
3738
-w --win <integer> size of windows, default 50 bp
@@ -195,6 +196,15 @@ The command line flags and descriptions:
195196
Default is not to format, often leading to more than the intended
196197
significant digits.
197198

199+
- --mapq &lt;integer>
200+
201+
Specify the minimum mapping quality of alignments to be considered when
202+
counting from a Bam file. Default is 0, which will include all alignments,
203+
including multi-mapping (typically MAPQ of 0). Set to an integer in range
204+
of 0..255. Only affects count methods, including `count`, `ncount`, and
205+
`pcount`. Other methods involving coverage, e.g. `mean`, do not filter
206+
alignments.
207+
198208
### Bin specification
199209

200210
- --win &lt;integer>

0 commit comments

Comments
 (0)