A program to split (very) large .csv files column-wise based on some other metadata file with minimal memory overhead.
This needs the xsv binary in your PATH, since it uses it as a backend to do the heavy lifting. Install it from the BurntSushi/xsv repository.
You need to have python 3.10 or later installed. Install metasplit with:
pip install git+https://github.com/MrHedmad/metasplit@main
Then you can use metasplit with the metasplit command.
Use metasplit --help for a list of arguments. The argument should be self-explaining with the exception of the selection strings. I explain them here.
A selection string is a structured strings with this form:
|--------------------| |---------||------------ >>>
/path/to/metadata/file@id_variable?meta_var1=value&meta_var2=value ...
^ ^ ^
The query has many parts:
/path/to/metadata/file: The (full) path to the metadata file to use to select the columns of the input file with.id_variable: The name of the column in the metadata file that holds the ids of the columns in the input file. Must be preceded by an@.- After these two parts, the rest of the string is made up by selections:
- The first selection always starts with an
?. This marks the beginning of the selection strings. - Every selection is of the form
variable+sign+value(s). The variable is the column to consider in the metadata. The value(s) are either one (value) or a list of ([value1,value2,value3]) of values to select the ids with. The sign might be either=or!=for the variable being equal to or not equal to the values, respectively. - Multiple selections may be chained together by starting new selection strings with either
&or|for a logical AND or a logical OR with the previous selection.
- The first selection always starts with an
You can pass multiple selection strings as input, even from different metadata files. Each selection from every metadata file will be summed together (a sort of "OR") to subset the final data file.
If you instead wish to only keep IDs that satisfy your selections in every metadata file (a sort of "AND"), you can pass the --intersect flag to do just that.
Some examples of query strings:
~/metadata.csv@gene_id?sample_type=tumor: Read the~/metadata.csvfile, and select column ids in thegene_idcolumn where the columnsample_typeis equal totumor.~/metadata.csv@gene_id?type=[primary_tumor,metastasis]&study=tcga: Similar to the previous example, select wheretypeis eitherprimary_tumorormetastasisAND thestudyistcga.~/metadata.csv@gene_id?study=tcga|selection=manually_selected: select wherestudyis equal totcgaOR theselectionismanually_selected.~/metadata.csv@sample_id?study=tcga ~/clinical_metadata.csv@patient_id?smoker=true|exposed_to_asbestos=true --intersect: select in themetadata.csvfile wherestudyis equal totcga. Then, select in theclinical_metadata.csvfile wheresmokeristrueORexposed_to_asbestosistrue. Keep only samples that satisfy both selections (due to the--intersectflag).