Skip to content

Commit 12b8a7f

Browse files
nekrutclaudebgruening
authored
Polish split_file_to_collection and splitfasta help with diagrams (#1820)
* Polish split_file_to_collection and splitfasta help with diagrams - Add PNG diagrams via macros showing split operations - Rewrite help with structured Description/Examples format - Bump splitfasta version to 0.5.2 - All tests pass (20/20 + 2/2) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add examples to split_file_to_collection, fix help reference Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Use bullet list for splitting modes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Restructure help: merge allocation into examples Drop abstract index table. Each allocation mode (alternating, batch, random) is now shown with the same FASTA input so the difference is immediately visible. Add plain-English annotations to each example. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Use emoji sequence names in examples Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add regex substitution example for tabular column split Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add .lint_skip for pre-existing TestsCaseValidation warnings Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Bump split_file_to_collection version to 0.5.3 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Apply suggestion from @bgruening * Delete tools/splitfasta/macros.xml * Update splitFasta.xml * Update split_file_to_collection.xml --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Björn Grüning <bjoern@gruenings.eu>
1 parent 72c547f commit 12b8a7f

7 files changed

Lines changed: 229 additions & 57 deletions

File tree

tools/splitfasta/splitFasta.xml

Lines changed: 29 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
<tool id="rbc_splitfasta" name="Split Fasta" version="0.5.1" profile="23.0">
1+
<tool id="rbc_splitfasta" name="Split Fasta" version="0.5.2" profile="23.0">
22
<description>files into a collection</description>
33
<requirements>
44
<requirement type="package" version="1.76">biopython</requirement>
@@ -52,8 +52,34 @@
5252
</test>
5353
</tests>
5454
<help><![CDATA[
55-
Takes an input FASTA file and writes entries (i.e. sequences) to separate datasets, which are organized in a dataset collection.
56-
There are two modes: 1) each sequence is written to its own data set which is named by the ID of the sequence or 2) The file is split into a given number of chunks which are numbered.
55+
56+
===========
57+
Description
58+
===========
59+
60+
Splits a FASTA file into separate datasets organized in a collection. Two modes are available:
61+
62+
- **Each sequence in its own dataset** — one output file per sequence, named by the sequence ID
63+
- **Split into chunks** — sequences are distributed across a specified number of output files
64+
65+
.. image:: $PATH_TO_IMAGES/split_fasta.png
66+
:alt: Split a FASTA file into a collection with one sequence per dataset
67+
:width: 620
68+
69+
========
70+
Examples
71+
========
72+
73+
**One sequence per dataset**
74+
75+
A FASTA file with 3 sequences produces a collection of 3 datasets named ``seq_A``, ``seq_B``, ``seq_C``.
76+
77+
-------
78+
79+
**Split into 2 chunks**
80+
81+
The same file split into 2 chunks produces ``part1`` (2 sequences) and ``part2`` (1 sequence).
82+
5783
]]></help>
5884
<citations>
5985
<citation type="bibtex">
49.9 KB
Loading
Lines changed: 44 additions & 0 deletions
Loading
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
TestsCaseValidation

tools/text_processing/split_file_to_collection/split_file_to_collection.xml

Lines changed: 106 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
<tool id="split_file_to_collection" name="Split file" version="0.5.2">
1+
<tool id="split_file_to_collection" name="Split file" version="0.5.3">
22
<description>to dataset collection</description>
33
<macros>
44
<xml name="regex_sanitizer">
@@ -30,7 +30,7 @@
3030
<param name="newfilenames" type="text" label="Base name for new files in collection"
3131
help="This will increment automatically - if input is 'file', then output is 'file0', 'file1', etc." value="split_file"/>
3232
<conditional name="select_allocate">
33-
<param name="allocate" type="select" label="Method to allocate records to new files" help="See the information section for a diagram">
33+
<param name="allocate" type="select" label="Method to allocate records to new files" help="See the help section for a diagram">
3434
<option value="random">At random</option>
3535
<option value="batch">Maintain record order</option>
3636
<option value="byrow" selected="true">Alternate output files</option>
@@ -527,58 +527,110 @@
527527
</test>
528528
</tests>
529529
<help><![CDATA[
530-
**Split file into a dataset collection**
531-
532-
This tool splits a data set consisting of records into multiple data sets within a collection.
533-
A record can be for instance simply a line, a FASTA sequence (header + sequence), a FASTQ sequence
534-
(headers + sequence + qualities), etc. The important property is that the records either have a
535-
specific length (e.g. 4 lines for FASTQ) or that the beginning/end of a new record
536-
can be specified by a regular expression, e.g. ".*" for lines or ">.*" for FASTA.
537-
The tool has presets for text, tabular data sets (which are split after each line), FASTA (new records start with ">.*"), FASTQ (records consist of 4 lines), SDF (records start with "^BEGIN IONS") and MGF (records end with "^$$$$").
538-
For other data types the text delimiting records or the number of lines making up a record can be specified manually using the generic splitter.
539-
If the generic splitter is used, an option is also available to split records either before or after the
540-
separator. If a preset filetype is used, this is selected automatically (after for SDF, before for all
541-
others).
542-
543-
If splitting by line (or by some other item, like a FASTA entry or an MGF record), the splitting can be either done alternatingly, in original record order, or at random.
544-
545-
If t records are to be distributed to n new data sets, then the i-th record goes to data set
546-
547-
* floor(i / t * n) (for batch),
548-
* i % n (for alternating), or
549-
* a random data set
550-
551-
For instance, t=5 records are distributed as follows on n=2 data sets
552-
553-
= === === ====
554-
i bat alt rand
555-
= === === ====
556-
0 0 0 0
557-
1 0 1 1
558-
2 0 0 1
559-
3 1 1 0
560-
4 1 0 0
561-
= === === ====
562-
563-
If the five records are distributed on n=3 data sets:
564-
565-
= === === ====
566-
i bat alt rand
567-
= === === ====
568-
0 0 0 0
569-
1 0 1 1
570-
2 1 2 2
571-
3 1 0 0
572-
4 2 1 1
573-
= === === ====
574-
575-
Note that there are no guarantees when splitting at random that every result file will be non-empty, so downstream tools should be able to gracefully handle empty files.
576-
577-
If a tabular file is used as input, you may choose to split by line or by column. If split by column, a new file is created for each unique value in the column.
578-
In addition, (Python) regular expressions may be used to transform the value in the column to a new value. Caution should be used with this feature, as it could transform all values to the same value, or other unexpected behavior.
579-
The default regular expression uses each value in the column without modifying it.
580-
581-
Two modes are available for the tool. For the main mode, the number of output files is selected. In this case, records are shared out between this number of files. Alternatively, 'chunking mode' can be selected, which puts a fixed number of records (the 'chunk size') into each output file.
530+
531+
===========
532+
Description
533+
===========
534+
535+
Splits a dataset into multiple files organized as a dataset collection. Supports FASTA, FASTQ, tabular, text, MGF, SD-files, and generic record-based formats.
536+
537+
Records can be defined by a fixed line count (e.g. 4 lines for FASTQ) or by a regular expression marking record boundaries (e.g. ``>.*`` for FASTA). Presets handle common formats automatically; the generic splitter allows custom separators.
538+
539+
.. image:: $PATH_TO_IMAGES/split_file.png
540+
:alt: Split a dataset into a collection of files
541+
:width: 620
542+
543+
You can control how many output files are created:
544+
545+
- **Number of output files** — records are shared out between *n* files
546+
- **Chunk mode** — each file gets exactly *k* records (the last file may get fewer)
547+
548+
For tabular input, you can also split by a column value — a new file is created for each unique value in the chosen column, with optional regex substitution.
549+
550+
========
551+
Examples
552+
========
553+
554+
The following examples use a FASTA file with 4 sequences as input::
555+
556+
>🍎 >🍊 >🍋 >🍇
557+
ATCG GCTA TTAA CCGG
558+
559+
-------
560+
561+
**Alternating** (default) — records are dealt out round-robin, like cards. Split into 2 files::
562+
563+
split_000000.fasta: split_000001.fasta:
564+
>🍎 (1st record) >🍊 (2nd record)
565+
ATCG GCTA
566+
>🍋 (3rd record) >🍇 (4th record)
567+
TTAA CCGG
568+
569+
Records alternate: 🍎→file0, 🍊→file1, 🍋→file0, 🍇→file1.
570+
571+
-------
572+
573+
**Batch** — records stay in original order, split into contiguous blocks. Split into 2 files::
574+
575+
split_000000.fasta: split_000001.fasta:
576+
>🍎 (1st record) >🍋 (3rd record)
577+
ATCG TTAA
578+
>🍊 (2nd record) >🍇 (4th record)
579+
GCTA CCGG
580+
581+
First half goes to file 0, second half to file 1.
582+
583+
-------
584+
585+
**Random** — each record is assigned to a random file (seeded for reproducibility)::
586+
587+
split_000000.fasta: split_000001.fasta:
588+
>🍎 >🍊
589+
ATCG GCTA
590+
>🍇 >🍋
591+
CCGG TTAA
592+
593+
.. class:: warningmark
594+
595+
Random mode does not guarantee every output file will be non-empty.
596+
597+
-------
598+
599+
**Chunk mode** — fixed number of records per file. With **chunk size** = 1::
600+
601+
split_000000.fasta: split_000001.fasta: split_000002.fasta: split_000003.fasta:
602+
>🍎 >🍊 >🍋 >🍇
603+
ATCG GCTA TTAA CCGG
604+
605+
-------
606+
607+
**Split tabular by column value**
608+
609+
A tabular file with a "group" column::
610+
611+
gene group score
612+
geneA wnt 0.9
613+
geneB notch 0.7
614+
geneC wnt 0.8
615+
geneD notch 0.6
616+
617+
Split by column 2 produces one file per unique value::
618+
619+
wnt.tabular: notch.tabular:
620+
gene group score gene group score
621+
geneA wnt 0.9 geneB notch 0.7
622+
geneC wnt 0.8 geneD notch 0.6
623+
624+
-------
625+
626+
**Split tabular by column with regex substitution**
627+
628+
Column values can be transformed before grouping using a regex match/replace pair. For example, if column 1 contains filenames like ``sample1.mgf``, ``sample2.mgf``, you can strip the extension::
629+
630+
Match regex: (.*)\.mgf
631+
Replace with: \1
632+
633+
This groups rows by the part before ``.mgf`` and names the output files accordingly (``sample1.tabular``, ``sample2.tabular`` instead of ``sample1.mgf.tabular``, ``sample2.mgf.tabular``).
582634
583635
]]></help>
584636
<citations>
57.5 KB
Loading
Lines changed: 49 additions & 0 deletions
Loading

0 commit comments

Comments
 (0)