You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+132-4Lines changed: 132 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -23,6 +23,7 @@ A short description of each tool follows. There is more detail in the [tool refe
23
23
*[tsv-join](#tsv-join) - Join lines from multiple files using fields as a key.
24
24
*[tsv-uniq](#tsv-uniq) - Filter out duplicate lines using fields as a key.
25
25
*[tsv-select](#tsv-select) - Keep a subset of the columns in the input.
26
+
*[tsv-summarize](#tsv-summarize) - Aggregate field values, summarizing across the entire file or grouped by key.
26
27
*[csv2tsv](#csv2tsv) - Convert CSV files to TSV.
27
28
*[number-lines](#number-lines) - Number the input lines.
28
29
*[Useful bash aliases](#useful-bash-aliases)
@@ -42,6 +43,8 @@ This outputs lines where field 3 satisfies (100 <= fieldval <= 200) and field 4
42
43
$ tsv-filter --ne 3:0 file.tsv | wc -l
43
44
```
44
45
46
+
See the [tsv-filter reference](#tsv-filter-reference) for details.
47
+
45
48
### tsv-join
46
49
47
50
Joins lines from multiple files based on a common key. One file, the 'filter' file, contains the records (lines) being matched. The other input files are scanned for matching records. Matching records are written to standard output, along with any designated fields from the filter file. In database parlance this is a hash semi-join. Example:
@@ -53,6 +56,8 @@ This reads `filter.tsv`, creating a lookup table keyed on fields 1 and 3. `data.
53
56
54
57
Common uses for `tsv-join` are to join related datasets or to filter one dataset based on another. Filter file entries are kept in memory, this limits the ultimate size that can be handled effectively. The author has found that filter files up to about 10 million lines are processed effectively, but performance starts to degrade after that.
55
58
59
+
See the [tsv-join reference](#tsv-join-reference) for details.
60
+
56
61
### tsv-uniq
57
62
58
63
Similar in spirit to the Unix `uniq` tool, `tsv-uniq` filters a dataset so there is only one copy of each line. `tsv-uniq` goes beyond Unix `uniq` in a couple ways. First, data does not need to be sorted. Second, equivalence is based on a subset of fields rather than the full line. `tsv-uniq` can also be run in an 'equivalence class identification' mode, where equivalent entries are marked with a unique id rather than being filtered. An example uniq'ing a file on fields 2 and 3:
@@ -64,6 +69,8 @@ $ tsv-uniq -f 2,3 data.tsv
64
69
65
70
As with `tsv-join`, this uses an in-memory lookup table to record unique entries. This ultimately limits the data sizes that can be processed. The author has found that datasets with up to about 10 million unique entries work fine, but performance degrades after that.
66
71
72
+
See the [tsv-uniq reference](#tsv-uniq-reference) for details.
73
+
67
74
### tsv-select
68
75
69
76
A version of the Unix `cut` utility with the additional ability to re-order the fields. It also helps with header lines by keeping only the header from the first file (`--header` option). The following command writes fields [4, 2, 9] from a pair of files to stdout:
Reordering fields and managing headers are useful enhancements over `cut`. However, much of the motivation for writing it was to explore the D programming language and provide a comparison point against other common approaches to this task. Code for `tsv-select` is bit more liberal with comments pointing out D programming constructs than code for the other tools.
75
82
83
+
See the [tsv-select reference](#tsv-select-reference) for details.
84
+
85
+
### tsv-summarize
86
+
87
+
tsv-summarize runs aggregation operations on fields. For example, generating the sum or median of a field's values. Summarization calculations can be run across the entire input or can be grouped by key fields. A single row of output is produced for the former, multiple rows for the latter. As an example, consider the file `data.tsv`:
88
+
```
89
+
color weight
90
+
red 6
91
+
red 5
92
+
blue 15
93
+
red 4
94
+
blue 10
95
+
```
96
+
The sum and mean weights are calculated as follows:
Note that it was not necessary to sort the file prior to using `--group-by`, this convenience is built-in.
109
+
110
+
A number of aggregation operations are available, see the [tsv-summarize reference](#tsv-summarize-reference) for details.
111
+
76
112
### csv2tsv
77
113
78
114
Sometimes you have a CSV file. This program does what you expect: convert CSV data to TSV. Example:
@@ -89,6 +125,8 @@ A simpler version of the Unix 'nl' program. It prepends a line number to each li
89
125
$ number-lines myfile.txt
90
126
```
91
127
128
+
See the [number-lines reference](#tsv-summarize-reference) for details.
129
+
92
130
### Useful bash aliases
93
131
94
132
Any number of convenient utilities can be created using shell facilities. A couple are given below. One of the most useful is `tsv-header`, which shows the field number for each column name in the header. Very useful when using numeric field indexes.
There are a number of toolkits with similar functionality. Here are a few:
145
+
There are a number of toolkits that have similar or related functionality. Several are listed below. Those handling CSV files handle TSV files as well:
108
146
109
147
*[csvkit](https://github.com/wireservice/csvkit) - CSV tools, written in Python.
110
148
*[csvtk](https://github.com/shenwei356/csvtk) - CSV tools, written in Go.
111
-
*[dplyr](https://github.com/hadley/dplyr) - Tools for tabular data in R storage formats. Written in R and C++.
149
+
*[GNU datamash](https://www.gnu.org/software/datamash/) - Performs numeric, textual and statistical operations TSV files. Written in C.
150
+
*[dplyr](https://github.com/hadley/dplyr) - Tools for tabular data in R storage formats. Runs in an R environment, code is in C++.
112
151
*[miller](https://github.com/johnkerl/miller) - CSV and JSON tools, written in C.
113
152
*[tsvutils](https://github.com/brendano/tsvutils) - TSV tools, especially rich in format converters. Written in Python.
114
153
*[xsv](https://github.com/BurntSushi/xsv) - CSV tools, written in Rust.
115
154
155
+
The different toolkits are certainly worth investigating if you work with tabular data files. Several have quite extensive feature sets. Each toolkit has its own strengths, your workflow and preferences are likely to fit some toolkits better than others.
156
+
157
+
If you are wondering about the rationale for using TSV files, there is very nice discussion in the [tsvutils README](https://github.com/brendano/tsvutils#the-philosophy-of-tsvutils) file.
158
+
116
159
## Installation
117
160
118
-
Download a D compiler (http://dlang.org/download.html). These tools have been tested with the DMD and LDC compilers, on Mac OSX and Linux. Use DMD version 2.068 or later, LDC version 0.17.0 or later.
161
+
Download a D compiler (http://dlang.org/download.html). These tools have been tested with the DMD and LDC compilers, on Mac OSX and Linux. Use DMD version 2.070 or later, LDC version 1.0.0 or later.
119
162
120
163
Clone this repository, select a compiler, and run `make` from the top level directory:
121
164
```
@@ -166,6 +209,8 @@ The simplest tool is `number-lines`. It is useful as an illustration of the code
166
209
167
210
`tsv-join` and `tsv-filter` also have relatively straightforward functionality, but support more use cases resulting in more code. `tsv-filter` in particular has more elaborate setup steps that take a bit more time to understand. `tsv-filter` uses several features like delegates (closures) and regular expressions not used in the other tools.
168
211
212
+
`tsv-summarize` is one or the more recent tools. It uses a more object oriented style than the other tools, this makes it relatively easy to add new operations. It also makes quite extensive use of built-in unit tests.
213
+
169
214
The `common` directory has code shared by the tools. At present this very limited, one helper class written as template. In addition to being an example of a simple template, it also makes use of a D ranges, a very useful sequence abstraction, and built-in unit tests.
170
215
171
216
New tools can be added by creating a new directory and a source tree following the same pattern as one of existing tools.
@@ -213,7 +258,7 @@ $ make test-nobuild
213
258
214
259
### Unit tests
215
260
216
-
D has an excellent facility for adding unit tests right with the code. The `common` utility functions in this package take advantage of built-in unit tests. However, most of the command line executables do not, and instead use more traditional invocation of the command line executables and diffs the output against a "gold" result set. The exception is`csv2tsv`, which uses both built-in unit tests and tests against the executable. The built-in unit tests are much nicer, and also the advantage of being naturally cross-platform. The command line executable tests assume a Unix shell.
261
+
D has an excellent facility for adding unit tests right with the code. The `common` utility functions in this package take advantage of built-in unit tests. However, most of the command line executables do not, and instead use more traditional invocation of the command line executables and diffs the output against a "gold" result set. The exceptions are`csv2tsv` and `tsv-summarize`. These use both built-in unit tests and tests against the executable. The built-in unit tests are much nicer, and also the advantage of being naturally cross-platform. The command line executable tests assume a Unix shell.
217
262
218
263
Tests for the command line executables are in the `tests` directory of each tool. Overall the tests cover a fair number of cases and are quite useful checks when modifying the code. They may also be helpful as an examples of command line tool invocations. See the `tests.sh` file in each `test` directory, and the `test` makefile target in `makeapp.mk`.
tsv-summarize reads tabular data files (tab-separated by default), tracks field values for each unique key, and runs summarization algorithms. Consider the file data.tsv:
645
+
```
646
+
make color time
647
+
ford blue 131
648
+
chevy green 124
649
+
ford red 128
650
+
bmw black 118
651
+
bmw black 126
652
+
ford blue 122
653
+
```
654
+
655
+
The min and average times for each make is generated by the command:
Using `--group 1,2` will group by both 'make' and 'color'. Omitting the `--group-by` entirely summarizes fields for full file.
669
+
670
+
The program tries to generate useful headers, but custom headers can be specified. Example (using `-g` and `-H` shortcuts for `--header` and `--group-by`):
Most operators take custom headers in a similarly way, generally following:
676
+
```
677
+
--<operator-name> FIELD[:header]
678
+
```
679
+
680
+
Operators can be specified multiple times. They can also take multiple fields (though not when a custom header is specified). Example:
681
+
```
682
+
--median 2,3,4
683
+
```
684
+
685
+
Summarization operators available are:
686
+
```
687
+
count min mean stddev
688
+
retain max median unique-count
689
+
first range mad mode
690
+
last sum var values
691
+
```
692
+
693
+
Calculations hold onto the minimum data needed while reading data. A few operations like median keep all data values in memory. These operations will start to encounter performance issues as available memory becomes scarce. The size that can be handled effectively is machine dependent, but often quite large files can be handled. Operations requiring numeric entries will signal an error and terminate processing if a non-numeric entry is found.
694
+
695
+
**Options:**
696
+
*`--h|help` - Brief help.
697
+
*`--help-verbose` - Print full help.
698
+
*`--g|group-by n[,n...]` - Fields to use as key.
699
+
*`--H|header` - Treat the first line of each file as a header.
700
+
*`--w|write-header` - Write an output header even if there is no input header.
0 commit comments