@@ -37,7 +37,7 @@ programmer understands but a casual user might not, as well as rationale.
3737 However, the only Cram files that can be used must either have a valid reference
3838 `UR` tag in the `@SQ` header, i.e. the original local reference fasta file is
3939 still available, or have an embedded reference sequence in the Cram file itself,
40- i.e. generated with output option `embed_ref=1`) . Using an external reference
40+ i.e. generated with output option `embed_ref=1`. Using an external reference
4141 fasta file is not supported, a limitation unfortunately imposed by Bio::DB::HTS,
4242 not by Bio::ToolBox. Lacking these, you are best to simply back-convert the Cram
4343 file to Bam format using `samtools` prior to usage.
@@ -47,15 +47,28 @@ programmer understands but a casual user might not, as well as rationale.
4747 CSV files appear perfectly benign, but are in fact a can of worms: mandatory or
4848 optional quoting, empty or undefined values, spaces, character escaping, text
4949 encoding, and so on. This mostly affects reading files. Most (all?) bioinformatic
50- text formats are tab-delimited, so CSV support is intentionally absent.
50+ text formats are tab-delimited, so CSV support is intentionally absent. With that
51+ said, if you provide an output file name with a `.csv` extension, it will write
52+ a (crude) CSV file. There are (currently) no attempts at quoting or escaping
53+ characters, so if your content contains commas you can expect errors. Your best
54+ bet is to write a TSV file.
55+
56+ - How do I get a plain table without all that metadata junk in TXT files?
57+
58+ BioToolBox applications write tab-delimited text files with a header row.
59+ Additional metadata and comments may be written at the beginning of the file,
60+ prefixed with `#` symbol. This is a (mostly) universal comment symbol indicating
61+ that the line can safely be ignored. But sometimes you just want a plain table to
62+ import into a spreadsheet program, for example. Provide an output file name with
63+ a `.tsv` extension and it will write a plain TSV file sans metadata.
5164
5265- How do I get my UCSC gene table (refFlat, knownGene, genePred, etc) recognized?
5366
54- UCSC doesn't have official file extensions, and their downloads page just
55- have `.txt.gz` extensions. Furthermore, they don't have proper column headers. Downloads
56- from the table browser will stick a header line, prefixed with a `#` but no space
57- between it and the first word. I have work arounds to detect those headers, but
58- what about the files from the download page?
67+ UCSC doesn't have official file extensions, and their downloads page just have
68+ `.txt.gz` extensions. Furthermore, they don't have proper column headers.
69+ Downloads from the table browser will stick a header line, prefixed with a `#`
70+ but no space between it and the first word. I have work arounds to detect those
71+ headers, but what about the files from the download page?
5972
6073 Programs that are designed to potentially interpret a gene table, such as
6174 [get_datasets](apps/get_datasets.md), will "taste" a file for potential UCSC
@@ -65,9 +78,9 @@ programmer understands but a casual user might not, as well as rationale.
6578
6679 Some programs accept a `--noheader` flag, and it will insert dummy column headers.
6780
68- Otherwise, you can help yourself by changing the extension from `.txt` to something
69- more descriptive, like `.refflat`, `.genepred`, `.knowngene`, or even the most
70- generic `.ucsc`. Don't forget the `.gz` if it's compressed.
81+ Otherwise, you can help yourself by changing the extension from `.txt` to
82+ something more descriptive, like `.refflat`, `.genepred`, `.knowngene`, or even
83+ the most generic `.ucsc`. Don't forget the `.gz` if it's compressed.
7184
7285- What is the difference between Start and Start0?
7386
@@ -76,25 +89,26 @@ programmer understands but a casual user might not, as well as rationale.
7689 between the two.
7790
7891 Many annotation formats come in two flavors of coordinate system: 1-base system
79- (counting each nucleotide in a sequence starting at 1) or 0-base (or interbase) system
80- (counting between bases, hence starting at 0). The GFF family of annotation file
81- formats (including GTF and GFF3) use 1-base. The UCSC family of annotation formats
82- (BED, refFlat, genePred, etc) use 0-base. SAM files are 1-based, but binary BAM files
83- are internally 0-based, while VCF files are 1-based. In other words, every format is
84- different. The [BioPerl](https://bioperl.org) libraries, of which much of BioToolBox
85- was initially based on, uses 1-base for everything. BioToolBox inherently transforms
86- 0-based coordinates to 1-base formats internally, at least when it is aware of what
87- the file is using, hence the purpose of naming columns differently.
92+ (counting each nucleotide in a sequence starting at 1) or 0-base (or interbase)
93+ system (counting between bases, hence starting at 0). The GFF family of
94+ annotation file formats (including GTF and GFF3) use 1-base. The UCSC family of
95+ annotation formats (BED, refFlat, genePred, etc) use 0-base. SAM files are
96+ 1-based, but binary BAM files are internally 0-based, while VCF files are
97+ 1-based. In other words, every format is different. The
98+ [BioPerl](https://bioperl.org) libraries, of which much of BioToolBox was
99+ initially based on, uses 1-base for everything. BioToolBox inherently transforms
100+ 0-based coordinates to 1-base formats internally, at least when it is aware of
101+ what the file format is using, hence the purpose of naming columns differently.
88102
89103- Why do so many programs reference a database and how do I use one?
90104
91105 In the early days of BioToolBox, much of the analysis was based on
92106 [BioPerl](https://bioperl.org) databases, notably
93- [Bio::DB::SeqFeature::Store](https://metacpan.org/pod/Bio::DB::SeqFeature::Store),
94- where annotation as well as datasets (microarray values) were stored. These were SQL
95- databases, backed by either MySQL or SQLite. These are still supported, although less
96- so as annotation files can now be parsed on the fly or datasets stored in bigWig or
97- Bam databases.
107+ [Bio::DB::SeqFeature::Store](https://metacpan.org/pod/Bio::DB::SeqFeature::Store)
108+ , where annotation as well as datasets (microarray values) were stored. These
109+ were SQL databases, backed by either MySQL or SQLite. These are still supported,
110+ although less so as annotation files can now be parsed on the fly or datasets
111+ stored in bigWig or Bam databases.
98112
99113 For annotation, working with a database can be arguably faster, especially when
100114 working with an annotation set over and over again. Use the BioPerl script,
0 commit comments