update documentation

tjparnell · tjparnell · commit d66126a91140 · 2025-10-12T10:47:07.000-06:00
updates to online application documentation, plus some formatting changes, and FAQ updates
diff --git a/docs/FAQ.md b/docs/FAQ.md
@@ -37,7 +37,7 @@ programmer understands but a casual user might not, as well as rationale.
 	However, the only Cram files that can be used must either have a valid reference
 	`UR` tag in the `@SQ` header, i.e. the original local reference fasta file is
 	still available, or have an embedded reference sequence in the Cram file itself,
-	i.e. generated with output option `embed_ref=1`). Using an external reference
+	i.e. generated with output option `embed_ref=1`. Using an external reference
 	fasta file is not supported, a limitation unfortunately imposed by Bio::DB::HTS,
 	not by Bio::ToolBox. Lacking these, you are best to simply back-convert the Cram
 	file to Bam format using `samtools` prior to usage. 
@@ -47,15 +47,28 @@ programmer understands but a casual user might not, as well as rationale.
 	CSV files appear perfectly benign, but are in fact a can of worms: mandatory or
 	optional quoting, empty or undefined values, spaces, character escaping, text
 	encoding, and so on. This mostly affects reading files. Most (all?) bioinformatic
-	text formats are tab-delimited, so CSV support is intentionally absent.
+	text formats are tab-delimited, so CSV support is intentionally absent. With that
+	said, if you provide an output file name with a `.csv` extension, it will write
+	a (crude) CSV file. There are (currently) no attempts at quoting or escaping
+	characters, so if your content contains commas you can expect errors. Your best
+	bet is to write a TSV file.
+
+- How do I get a plain table without all that metadata junk in TXT files?
+
+	BioToolBox applications write tab-delimited text files with a header row.
+	Additional metadata and comments may be written at the beginning of the file,
+	prefixed with `#` symbol. This is a (mostly) universal comment symbol indicating
+	that the line can safely be ignored. But sometimes you just want a plain table to
+	import into a spreadsheet program, for example. Provide an output file name with
+	a `.tsv` extension and it will write a plain TSV file sans metadata.
 	
 - How do I get my UCSC gene table (refFlat, knownGene, genePred, etc) recognized?
 
-	UCSC doesn't have official file extensions, and their downloads page just 
-	have `.txt.gz` extensions. Furthermore, they don't have proper column headers. Downloads 
-	from the table browser will stick a header line, prefixed with a `#` but no space
-	between it and the first word. I have work arounds to detect those headers, but 
-	what about the files from the download page?
+	UCSC doesn't have official file extensions, and their downloads page just have
+	`.txt.gz` extensions. Furthermore, they don't have proper column headers.
+	Downloads from the table browser will stick a header line, prefixed with a `#`
+	but no space between it and the first word. I have work arounds to detect those
+	headers, but what about the files from the download page?
 	
 	Programs that are designed to potentially interpret a gene table, such as
 	[get_datasets](apps/get_datasets.md), will "taste" a file for potential UCSC
@@ -65,9 +78,9 @@ programmer understands but a casual user might not, as well as rationale.
 	
 	Some programs accept a `--noheader` flag, and it will insert dummy column headers.
 	
-	Otherwise, you can help yourself by changing the extension from `.txt` to something 
-	more descriptive, like `.refflat`, `.genepred`, `.knowngene`, or even the most 
-	generic `.ucsc`. Don't forget the `.gz` if it's compressed.
+	Otherwise, you can help yourself by changing the extension from `.txt` to
+	something more descriptive, like `.refflat`, `.genepred`, `.knowngene`, or even
+	the most generic `.ucsc`. Don't forget the `.gz` if it's compressed.
 
 - What is the difference between Start and Start0?
 
@@ -76,25 +89,26 @@ programmer understands but a casual user might not, as well as rationale.
 	between the two.
 	
 	Many annotation formats come in two flavors of coordinate system: 1-base system
-	(counting each nucleotide in a sequence starting at 1) or 0-base (or interbase) system
-	(counting between bases, hence starting at 0). The GFF family of annotation file
-	formats (including GTF and GFF3) use 1-base. The UCSC family of annotation formats
-	(BED, refFlat, genePred, etc) use 0-base. SAM files are 1-based, but binary BAM files
-	are internally 0-based, while VCF files are 1-based. In other words, every format is
-	different. The [BioPerl](https://bioperl.org) libraries, of which much of BioToolBox
-	was initially based on, uses 1-base for everything. BioToolBox inherently transforms
-	0-based coordinates to 1-base formats internally, at least when it is aware of what
-	the file is using, hence the purpose of naming columns differently.
+	(counting each nucleotide in a sequence starting at 1) or 0-base (or interbase)
+	system (counting between bases, hence starting at 0). The GFF family of
+	annotation file formats (including GTF and GFF3) use 1-base. The UCSC family of
+	annotation formats (BED, refFlat, genePred, etc) use 0-base. SAM files are
+	1-based, but binary BAM files are internally 0-based, while VCF files are
+	1-based. In other words, every format is different. The
+	[BioPerl](https://bioperl.org) libraries, of which much of BioToolBox was
+	initially based on, uses 1-base for everything. BioToolBox inherently transforms
+	0-based coordinates to 1-base formats internally, at least when it is aware of
+	what the file format is using, hence the purpose of naming columns differently.
 
 - Why do so many programs reference a database and how do I use one?
 
 	In the early days of BioToolBox, much of the analysis was based on
 	[BioPerl](https://bioperl.org) databases, notably
-	[Bio::DB::SeqFeature::Store](https://metacpan.org/pod/Bio::DB::SeqFeature::Store),
-	where annotation as well as datasets (microarray values) were stored. These were SQL
-	databases, backed by either MySQL or SQLite. These are still supported, although less
-	so as annotation files can now be parsed on the fly or datasets stored in bigWig or
-	Bam databases. 
+	[Bio::DB::SeqFeature::Store](https://metacpan.org/pod/Bio::DB::SeqFeature::Store)
+	, where annotation as well as datasets (microarray values) were stored. These
+	were SQL databases, backed by either MySQL or SQLite. These are still supported,
+	although less so as annotation files can now be parsed on the fly or datasets
+	stored in bigWig or Bam databases. 
 	
 	For annotation, working with a database can be arguably faster, especially when 
 	working with an annotation set over and over again. Use the BioPerl script, 
diff --git a/docs/apps/get_binned_data.md b/docs/apps/get_binned_data.md
@@ -31,6 +31,7 @@ A program to collect data in bins across a list of features.
            5p_utr|3p_utr] 
      --long                              collect each window independently
      -r --format <integer>               number of decimal places for numbers
+     --mapq <integer>                    minimum map quality of counted alignments
      
      Bin specification:
      -b --bins <integer>                 number of bins feature is divided (10)
@@ -174,6 +175,15 @@ The command line flags and descriptions:
     Default is not to format, often leading to more than the intended 
     significant digits.
 
+- --mapq &lt;integer>
+
+	Specify the minimum mapping quality of alignments to be considered when
+	counting from a Bam file. Default is 0, which will include all alignments,
+	including multi-mapping (typically MAPQ of 0). Set to an integer in range
+	of 0..255. Only affects count methods, including `count`, `ncount`, and
+	`pcount`. Other methods involving coverage, e.g. `mean`, do not filter
+	alignments.
+
 ### Bin specification
 
 - --bins &lt;integer>
diff --git a/docs/apps/get_datasets.md b/docs/apps/get_datasets.md
@@ -41,6 +41,7 @@ get\_datasets.pl \[--options...\] --in &lt;filename> &lt;data1> &lt;data2...>
     --tpm                               calculate TPM values
     -r --format <integer>               number of decimal places for numbers
     --discard <number>                  discard features whose sum below threshold
+    --mapq <integer>                    minimum map quality of counted alignments
     
     Adjustments to features:
     -x --extend <integer>               extend the feature in both directions
@@ -256,12 +257,12 @@ The command line flags and descriptions:
         it was counted in an input region or not. This might be used when a 
         more global normalization is needed.
 
-    The region method is best used with RNASeq data and a complete gene 
-    annotation table. The genome method is best used with partial annotation 
-    tables or other Seq types, such as ChIPSeq. This option can only be used 
-    with one of the count methods (count, ncount, pcount). A sum method may be 
-    cautiously allowed if, for example, using bigWig point data. The FPKM values 
-    are appended as additional columns in the output table.
+    The region method is best used with RNASeq data and a complete gene
+    annotation table. The genome method is best used with partial annotation
+    tables or other Seq types, such as ChIPSeq. This option can only be used
+    with one of the count methods (`count`, `ncount`, `pcount`). A sum method
+    may be cautiously allowed if, for example, using bigWig point data. The FPKM
+    values are appended as additional columns in the output table.
 
 - --tpm
 
@@ -284,6 +285,15 @@ The command line flags and descriptions:
     that were newly collected. For more advanced filtering, see 
     [manipulate\_datasets.pl](https://metacpan.org/pod/manipulate_datasets.pl).
 
+- --mapq &lt;integer>
+
+	Specify the minimum mapping quality of alignments to be considered when
+	counting from a Bam file. Default is 0, which will include all alignments,
+	including multi-mapping (typically MAPQ of 0). Set to an integer in range
+	of 0..255. Only affects count methods, including `count`, `ncount`, and
+	`pcount`. Other methods involving coverage, e.g. `mean`, do not filter
+	alignments.
+
 ### Adjustments to features
 
 - --extend &lt;integer>
diff --git a/docs/apps/get_features.md b/docs/apps/get_features.md
@@ -18,7 +18,6 @@ get\_features.pl --db &lt;name> --out &lt;filename>
     
     Selection:
     -f --feature <type>           feature: gene, mRNA, transcript, etc
-    -u --sub                      include subfeatures (true if gff, gtf, refFlat)
     
     Filter features:
     -l --list <filename>          file of feature IDs to keep
@@ -39,11 +38,13 @@ get\_features.pl --db &lt;name> --out &lt;filename>
     
     Report format options:
     -B --bed                      write BED6 (no --sub) or BED12 (--sub) format
+    -u --sub                      include subfeatures when writing bed format
     -G --gff                      write GFF3 format
     -g --gtf                      write GTF format
     -r --refflat                  write UCSC refFlat format
     -t --tag <text>               include specific GFF attributes in text output
     --coord                       include coordinates in text output
+    --useid                       use ID as the BED name instead of default Name
     
     General options:
     -o --out <filename>           output file name
@@ -82,14 +83,6 @@ The command line flags and descriptions:
     is '`gene`'. For databases, an interactive list will be presented 
     from which one or more may be chosen.
 
-- --sub
-
-    Optionally include all child subfeatures in the output. For example, 
-    transcript, CDS, and/or exon subfeatures of a gene. This option is 
-    automatically enabled with GFF, GTF, or refFlat output; it may be 
-    turned off with `--nosub`. With BED output, it will force a BED12 
-    file to be written. It has no effect with standard text. 
-
 ### Filter features
 
 - --list &lt;file>
@@ -202,6 +195,13 @@ The command line flags and descriptions:
     With subfeatures enabled, write a BED12 (12-column BED) file. 
     Otherwise, write a standard 6-column BED format file. 
 
+- --sub
+
+    Optionally include all child subfeatures (exons) in the output when
+    writing a BED format; this forces a BED12 output. This option is 
+    automatically enabled with GFF, GTF, or refFlat output. It has no
+    effect with standard text. 
+
 - --gff
 
     Write a GFF version 3 (GFF3) format output file. Subfeatures are 
@@ -231,6 +231,16 @@ The command line flags and descriptions:
     in other formats. This is automatically included when adjusting 
     coordinate positions.
 
+- --useid
+
+    Use the feature's Primary ID tag instead of the Display Name tag for use in
+    the output Name column when writing to either a BED or UCSC (refFlat)
+    format. By default the Display Name is used when available. From GTF files,
+    this corresponds to the `gene_id` or `transcript_id` tags, rather than
+    `gene_name` or `transcript_name`. For GFF3 files, this would be `ID` and
+    `Name` tags.
+
+
 ### General options
 
 - --out &lt;filename>
diff --git a/docs/apps/get_gene_regions.md b/docs/apps/get_gene_regions.md
@@ -41,14 +41,20 @@ get\_gene\_regions.pl \[--options...\] --db &lt;text> --out &lt;filename>
     -K --chrskip <regex>          skip features from certain chromosomes
     
     Adjustments:
-    -b --begin --start integer     specify adjustment to start coordinate
-    -e --end --stop integer        specify adjustment to stop coordinate
+    -b --begin --start integer    specify adjustment to start coordinate
+    -e --end --stop integer       specify adjustment to stop coordinate
     
-    General options:
+    Output options:
+    -o --out <filename>           specify output name
     --bed                         output as a bed6 format
-    -o --out <filename>              specify output name
-    -z --gz                          compress output
-    -v --version                     print version and exit
+    --bedname                     specify what to use for bed name column
+       [genename|geneid|            default is 'featurename'
+       transcriptname|transcriptid
+       featurename]
+    -z --gz                       compress output
+
+    General options:
+    -v --version                  print version and exit
     -h --help
 
 ## OPTIONS
@@ -203,7 +209,11 @@ The command line flags and descriptions:
     a start adjustment will always modify the feature's 5'end, either 
     the feature startpoint or endpoint, depending on its orientation. 
 
-### General options
+### Output options
+
+- --out &lt;filename>
+
+    Specify the output filename.
 
 - --bed
 
@@ -213,10 +223,23 @@ The command line flags and descriptions:
 
     Specify the output filename.
 
+- --bedname E<lt>name<gt>
+
+    Specify what to use for the Name column in the output BED file.
+    Several options are available, including:
+    
+        geneid          - The Primary ID of the parent Gene feature
+        genename        - The Display Name of the parent Gene feature
+        transcriptid    - The Primary ID of the parent Transcript feature
+        transcriptname  - The Display Name of the parent Transcript feature
+        featurename     - The generated name of the feature (default)
+
 - --gz
 
     Specify whether (or not) the output file should be compressed with gzip.
 
+### General options
+
 - --version
 
     Print the version number.
diff --git a/docs/apps/get_relative_data.md b/docs/apps/get_relative_data.md
@@ -32,6 +32,7 @@ get\_relative\_data.pl \[--options\] -i &lt;filename> &lt;data1> &lt;data2...>
     --avtype [type,type,...]            alternative types of feature to avoid
     --long                              collect each window independently
     -r --format <integer>               number of decimal places for numbers
+    --mapq <integer>                    minimum map quality of counted alignments
     
     Bin specification:
     -w --win <integer>                  size of windows, default 50 bp
@@ -195,6 +196,15 @@ The command line flags and descriptions:
     Default is not to format, often leading to more than the intended 
     significant digits.
 
+- --mapq &lt;integer>
+
+	Specify the minimum mapping quality of alignments to be considered when
+	counting from a Bam file. Default is 0, which will include all alignments,
+	including multi-mapping (typically MAPQ of 0). Set to an integer in range
+	of 0..255. Only affects count methods, including `count`, `ncount`, and
+	`pcount`. Other methods involving coverage, e.g. `mean`, do not filter
+	alignments.
+
 ### Bin specification
 
 - --win &lt;integer>
diff --git a/docs/apps/manipulate_datasets.md b/docs/apps/manipulate_datasets.md