|
1 | | -<tool id="split_file_to_collection" name="Split file" version="0.5.2"> |
| 1 | +<tool id="split_file_to_collection" name="Split file" version="0.5.3"> |
2 | 2 | <description>to dataset collection</description> |
3 | 3 | <macros> |
4 | 4 | <xml name="regex_sanitizer"> |
|
30 | 30 | <param name="newfilenames" type="text" label="Base name for new files in collection" |
31 | 31 | help="This will increment automatically - if input is 'file', then output is 'file0', 'file1', etc." value="split_file"/> |
32 | 32 | <conditional name="select_allocate"> |
33 | | - <param name="allocate" type="select" label="Method to allocate records to new files" help="See the information section for a diagram"> |
| 33 | + <param name="allocate" type="select" label="Method to allocate records to new files" help="See the help section for a diagram"> |
34 | 34 | <option value="random">At random</option> |
35 | 35 | <option value="batch">Maintain record order</option> |
36 | 36 | <option value="byrow" selected="true">Alternate output files</option> |
|
527 | 527 | </test> |
528 | 528 | </tests> |
529 | 529 | <help><![CDATA[ |
530 | | -**Split file into a dataset collection** |
531 | | -
|
532 | | -This tool splits a data set consisting of records into multiple data sets within a collection. |
533 | | -A record can be for instance simply a line, a FASTA sequence (header + sequence), a FASTQ sequence |
534 | | -(headers + sequence + qualities), etc. The important property is that the records either have a |
535 | | -specific length (e.g. 4 lines for FASTQ) or that the beginning/end of a new record |
536 | | -can be specified by a regular expression, e.g. ".*" for lines or ">.*" for FASTA. |
537 | | -The tool has presets for text, tabular data sets (which are split after each line), FASTA (new records start with ">.*"), FASTQ (records consist of 4 lines), SDF (records start with "^BEGIN IONS") and MGF (records end with "^$$$$"). |
538 | | -For other data types the text delimiting records or the number of lines making up a record can be specified manually using the generic splitter. |
539 | | -If the generic splitter is used, an option is also available to split records either before or after the |
540 | | -separator. If a preset filetype is used, this is selected automatically (after for SDF, before for all |
541 | | -others). |
542 | | -
|
543 | | -If splitting by line (or by some other item, like a FASTA entry or an MGF record), the splitting can be either done alternatingly, in original record order, or at random. |
544 | | -
|
545 | | -If t records are to be distributed to n new data sets, then the i-th record goes to data set |
546 | | -
|
547 | | -* floor(i / t * n) (for batch), |
548 | | -* i % n (for alternating), or |
549 | | -* a random data set |
550 | | -
|
551 | | -For instance, t=5 records are distributed as follows on n=2 data sets |
552 | | -
|
553 | | -= === === ==== |
554 | | -i bat alt rand |
555 | | -= === === ==== |
556 | | -0 0 0 0 |
557 | | -1 0 1 1 |
558 | | -2 0 0 1 |
559 | | -3 1 1 0 |
560 | | -4 1 0 0 |
561 | | -= === === ==== |
562 | | -
|
563 | | -If the five records are distributed on n=3 data sets: |
564 | | -
|
565 | | -= === === ==== |
566 | | -i bat alt rand |
567 | | -= === === ==== |
568 | | -0 0 0 0 |
569 | | -1 0 1 1 |
570 | | -2 1 2 2 |
571 | | -3 1 0 0 |
572 | | -4 2 1 1 |
573 | | -= === === ==== |
574 | | -
|
575 | | -Note that there are no guarantees when splitting at random that every result file will be non-empty, so downstream tools should be able to gracefully handle empty files. |
576 | | -
|
577 | | -If a tabular file is used as input, you may choose to split by line or by column. If split by column, a new file is created for each unique value in the column. |
578 | | -In addition, (Python) regular expressions may be used to transform the value in the column to a new value. Caution should be used with this feature, as it could transform all values to the same value, or other unexpected behavior. |
579 | | -The default regular expression uses each value in the column without modifying it. |
580 | | -
|
581 | | -Two modes are available for the tool. For the main mode, the number of output files is selected. In this case, records are shared out between this number of files. Alternatively, 'chunking mode' can be selected, which puts a fixed number of records (the 'chunk size') into each output file. |
| 530 | +
|
| 531 | +=========== |
| 532 | +Description |
| 533 | +=========== |
| 534 | +
|
| 535 | +Splits a dataset into multiple files organized as a dataset collection. Supports FASTA, FASTQ, tabular, text, MGF, SD-files, and generic record-based formats. |
| 536 | +
|
| 537 | +Records can be defined by a fixed line count (e.g. 4 lines for FASTQ) or by a regular expression marking record boundaries (e.g. ``>.*`` for FASTA). Presets handle common formats automatically; the generic splitter allows custom separators. |
| 538 | +
|
| 539 | +.. image:: $PATH_TO_IMAGES/split_file.png |
| 540 | + :alt: Split a dataset into a collection of files |
| 541 | + :width: 620 |
| 542 | +
|
| 543 | +You can control how many output files are created: |
| 544 | +
|
| 545 | +- **Number of output files** — records are shared out between *n* files |
| 546 | +- **Chunk mode** — each file gets exactly *k* records (the last file may get fewer) |
| 547 | +
|
| 548 | +For tabular input, you can also split by a column value — a new file is created for each unique value in the chosen column, with optional regex substitution. |
| 549 | +
|
| 550 | +======== |
| 551 | +Examples |
| 552 | +======== |
| 553 | +
|
| 554 | +The following examples use a FASTA file with 4 sequences as input:: |
| 555 | +
|
| 556 | + >🍎 >🍊 >🍋 >🍇 |
| 557 | + ATCG GCTA TTAA CCGG |
| 558 | +
|
| 559 | +------- |
| 560 | +
|
| 561 | +**Alternating** (default) — records are dealt out round-robin, like cards. Split into 2 files:: |
| 562 | +
|
| 563 | + split_000000.fasta: split_000001.fasta: |
| 564 | + >🍎 (1st record) >🍊 (2nd record) |
| 565 | + ATCG GCTA |
| 566 | + >🍋 (3rd record) >🍇 (4th record) |
| 567 | + TTAA CCGG |
| 568 | +
|
| 569 | +Records alternate: 🍎→file0, 🍊→file1, 🍋→file0, 🍇→file1. |
| 570 | +
|
| 571 | +------- |
| 572 | +
|
| 573 | +**Batch** — records stay in original order, split into contiguous blocks. Split into 2 files:: |
| 574 | +
|
| 575 | + split_000000.fasta: split_000001.fasta: |
| 576 | + >🍎 (1st record) >🍋 (3rd record) |
| 577 | + ATCG TTAA |
| 578 | + >🍊 (2nd record) >🍇 (4th record) |
| 579 | + GCTA CCGG |
| 580 | +
|
| 581 | +First half goes to file 0, second half to file 1. |
| 582 | +
|
| 583 | +------- |
| 584 | +
|
| 585 | +**Random** — each record is assigned to a random file (seeded for reproducibility):: |
| 586 | +
|
| 587 | + split_000000.fasta: split_000001.fasta: |
| 588 | + >🍎 >🍊 |
| 589 | + ATCG GCTA |
| 590 | + >🍇 >🍋 |
| 591 | + CCGG TTAA |
| 592 | +
|
| 593 | +.. class:: warningmark |
| 594 | +
|
| 595 | +Random mode does not guarantee every output file will be non-empty. |
| 596 | +
|
| 597 | +------- |
| 598 | +
|
| 599 | +**Chunk mode** — fixed number of records per file. With **chunk size** = 1:: |
| 600 | +
|
| 601 | + split_000000.fasta: split_000001.fasta: split_000002.fasta: split_000003.fasta: |
| 602 | + >🍎 >🍊 >🍋 >🍇 |
| 603 | + ATCG GCTA TTAA CCGG |
| 604 | +
|
| 605 | +------- |
| 606 | +
|
| 607 | +**Split tabular by column value** |
| 608 | +
|
| 609 | +A tabular file with a "group" column:: |
| 610 | +
|
| 611 | + gene group score |
| 612 | + geneA wnt 0.9 |
| 613 | + geneB notch 0.7 |
| 614 | + geneC wnt 0.8 |
| 615 | + geneD notch 0.6 |
| 616 | +
|
| 617 | +Split by column 2 produces one file per unique value:: |
| 618 | +
|
| 619 | + wnt.tabular: notch.tabular: |
| 620 | + gene group score gene group score |
| 621 | + geneA wnt 0.9 geneB notch 0.7 |
| 622 | + geneC wnt 0.8 geneD notch 0.6 |
| 623 | +
|
| 624 | +------- |
| 625 | +
|
| 626 | +**Split tabular by column with regex substitution** |
| 627 | +
|
| 628 | +Column values can be transformed before grouping using a regex match/replace pair. For example, if column 1 contains filenames like ``sample1.mgf``, ``sample2.mgf``, you can strip the extension:: |
| 629 | +
|
| 630 | + Match regex: (.*)\.mgf |
| 631 | + Replace with: \1 |
| 632 | +
|
| 633 | +This groups rows by the part before ``.mgf`` and names the output files accordingly (``sample1.tabular``, ``sample2.tabular`` instead of ``sample1.mgf.tabular``, ``sample2.mgf.tabular``). |
582 | 634 |
|
583 | 635 | ]]></help> |
584 | 636 | <citations> |
|
0 commit comments