Skip to content

Commit 9e38a50

Browse files
Tsv summarize (#7)
Initial version of tsv-summarize.
1 parent be1b6c5 commit 9e38a50

29 files changed

Lines changed: 4507 additions & 24 deletions

NOTICES.txt

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
This file contains 3rd party license notifications.
2+
3+
* D Programming Language (http://dlang.org/).
4+
5+
Unit test included in common/src/getopt_inorder.d were adapted from unit
6+
tests in the source code for the D standard library std.getopt module.
7+
The std.getopt module is licensed under Boost Licence 1.0
8+
(http://boost.org/LICENSE_1_0.txt). Copyright and license text are:
9+
10+
Copyright Andrei Alexandrescu 2008 - 2015.
11+
12+
Boost Software License - Version 1.0 - August 17th, 2003
13+
14+
Permission is hereby granted, free of charge, to any person or organization
15+
obtaining a copy of the software and accompanying documentation covered by
16+
this license (the "Software") to use, reproduce, display, distribute,
17+
execute, and transmit the Software, and to prepare derivative works of the
18+
Software, and to permit third-parties to whom the Software is furnished to
19+
do so, all subject to the following:
20+
21+
The copyright notices in the Software and this entire statement, including
22+
the above license grant, this restriction and the following disclaimer,
23+
must be included in all copies of the Software, in whole or in part, and
24+
all derivative works of the Software, unless such copies or derivative
25+
works are solely in the form of machine-executable object code generated by
26+
a source language processor.
27+
28+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
29+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
30+
FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT
31+
SHALL THE COPYRIGHT HOLDERS OR ANYONE DISTRIBUTING THE SOFTWARE BE LIABLE
32+
FOR ANY DAMAGES OR OTHER LIABILITY, WHETHER IN CONTRACT, TORT OR OTHERWISE,
33+
ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
34+
DEALINGS IN THE SOFTWARE.

README.md

Lines changed: 132 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@ A short description of each tool follows. There is more detail in the [tool refe
2323
* [tsv-join](#tsv-join) - Join lines from multiple files using fields as a key.
2424
* [tsv-uniq](#tsv-uniq) - Filter out duplicate lines using fields as a key.
2525
* [tsv-select](#tsv-select) - Keep a subset of the columns in the input.
26+
* [tsv-summarize](#tsv-summarize) - Aggregate field values, summarizing across the entire file or grouped by key.
2627
* [csv2tsv](#csv2tsv) - Convert CSV files to TSV.
2728
* [number-lines](#number-lines) - Number the input lines.
2829
* [Useful bash aliases](#useful-bash-aliases)
@@ -42,6 +43,8 @@ This outputs lines where field 3 satisfies (100 <= fieldval <= 200) and field 4
4243
$ tsv-filter --ne 3:0 file.tsv | wc -l
4344
```
4445

46+
See the [tsv-filter reference](#tsv-filter-reference) for details.
47+
4548
### tsv-join
4649

4750
Joins lines from multiple files based on a common key. One file, the 'filter' file, contains the records (lines) being matched. The other input files are scanned for matching records. Matching records are written to standard output, along with any designated fields from the filter file. In database parlance this is a hash semi-join. Example:
@@ -53,6 +56,8 @@ This reads `filter.tsv`, creating a lookup table keyed on fields 1 and 3. `data.
5356

5457
Common uses for `tsv-join` are to join related datasets or to filter one dataset based on another. Filter file entries are kept in memory, this limits the ultimate size that can be handled effectively. The author has found that filter files up to about 10 million lines are processed effectively, but performance starts to degrade after that.
5558

59+
See the [tsv-join reference](#tsv-join-reference) for details.
60+
5661
### tsv-uniq
5762

5863
Similar in spirit to the Unix `uniq` tool, `tsv-uniq` filters a dataset so there is only one copy of each line. `tsv-uniq` goes beyond Unix `uniq` in a couple ways. First, data does not need to be sorted. Second, equivalence is based on a subset of fields rather than the full line. `tsv-uniq` can also be run in an 'equivalence class identification' mode, where equivalent entries are marked with a unique id rather than being filtered. An example uniq'ing a file on fields 2 and 3:
@@ -64,6 +69,8 @@ $ tsv-uniq -f 2,3 data.tsv
6469

6570
As with `tsv-join`, this uses an in-memory lookup table to record unique entries. This ultimately limits the data sizes that can be processed. The author has found that datasets with up to about 10 million unique entries work fine, but performance degrades after that.
6671

72+
See the [tsv-uniq reference](#tsv-uniq-reference) for details.
73+
6774
### tsv-select
6875

6976
A version of the Unix `cut` utility with the additional ability to re-order the fields. It also helps with header lines by keeping only the header from the first file (`--header` option). The following command writes fields [4, 2, 9] from a pair of files to stdout:
@@ -73,6 +80,35 @@ $ tsv-select -f 4,2,9 file1.tsv file2.tsv
7380

7481
Reordering fields and managing headers are useful enhancements over `cut`. However, much of the motivation for writing it was to explore the D programming language and provide a comparison point against other common approaches to this task. Code for `tsv-select` is bit more liberal with comments pointing out D programming constructs than code for the other tools.
7582

83+
See the [tsv-select reference](#tsv-select-reference) for details.
84+
85+
### tsv-summarize
86+
87+
tsv-summarize runs aggregation operations on fields. For example, generating the sum or median of a field's values. Summarization calculations can be run across the entire input or can be grouped by key fields. A single row of output is produced for the former, multiple rows for the latter. As an example, consider the file `data.tsv`:
88+
```
89+
color weight
90+
red 6
91+
red 5
92+
blue 15
93+
red 4
94+
blue 10
95+
```
96+
The sum and mean weights are calculated as follows:
97+
```
98+
$ tsv-summarize --header --sum 2 --mean 2 data.tsv
99+
weight_sum weight_mean
100+
40 8
101+
102+
$ tsv-summarize --header --group-by 1 --sum 2 --mean 2 data.tsv
103+
color weight_sum weight_mean
104+
red 15 5
105+
blue 25 12.5
106+
```
107+
108+
Note that it was not necessary to sort the file prior to using `--group-by`, this convenience is built-in.
109+
110+
A number of aggregation operations are available, see the [tsv-summarize reference](#tsv-summarize-reference) for details.
111+
76112
### csv2tsv
77113

78114
Sometimes you have a CSV file. This program does what you expect: convert CSV data to TSV. Example:
@@ -89,6 +125,8 @@ A simpler version of the Unix 'nl' program. It prepends a line number to each li
89125
$ number-lines myfile.txt
90126
```
91127

128+
See the [number-lines reference](#tsv-summarize-reference) for details.
129+
92130
### Useful bash aliases
93131

94132
Any number of convenient utilities can be created using shell facilities. A couple are given below. One of the most useful is `tsv-header`, which shows the field number for each column name in the header. Very useful when using numeric field indexes.
@@ -104,18 +142,23 @@ tsv-sort () { sort -t $'\t' $* ; }
104142

105143
### Other toolkits
106144

107-
There are a number of toolkits with similar functionality. Here are a few:
145+
There are a number of toolkits that have similar or related functionality. Several are listed below. Those handling CSV files handle TSV files as well:
108146

109147
* [csvkit](https://github.com/wireservice/csvkit) - CSV tools, written in Python.
110148
* [csvtk](https://github.com/shenwei356/csvtk) - CSV tools, written in Go.
111-
* [dplyr](https://github.com/hadley/dplyr) - Tools for tabular data in R storage formats. Written in R and C++.
149+
* [GNU datamash](https://www.gnu.org/software/datamash/) - Performs numeric, textual and statistical operations TSV files. Written in C.
150+
* [dplyr](https://github.com/hadley/dplyr) - Tools for tabular data in R storage formats. Runs in an R environment, code is in C++.
112151
* [miller](https://github.com/johnkerl/miller) - CSV and JSON tools, written in C.
113152
* [tsvutils](https://github.com/brendano/tsvutils) - TSV tools, especially rich in format converters. Written in Python.
114153
* [xsv](https://github.com/BurntSushi/xsv) - CSV tools, written in Rust.
115154

155+
The different toolkits are certainly worth investigating if you work with tabular data files. Several have quite extensive feature sets. Each toolkit has its own strengths, your workflow and preferences are likely to fit some toolkits better than others.
156+
157+
If you are wondering about the rationale for using TSV files, there is very nice discussion in the [tsvutils README](https://github.com/brendano/tsvutils#the-philosophy-of-tsvutils) file.
158+
116159
## Installation
117160

118-
Download a D compiler (http://dlang.org/download.html). These tools have been tested with the DMD and LDC compilers, on Mac OSX and Linux. Use DMD version 2.068 or later, LDC version 0.17.0 or later.
161+
Download a D compiler (http://dlang.org/download.html). These tools have been tested with the DMD and LDC compilers, on Mac OSX and Linux. Use DMD version 2.070 or later, LDC version 1.0.0 or later.
119162

120163
Clone this repository, select a compiler, and run `make` from the top level directory:
121164
```
@@ -166,6 +209,8 @@ The simplest tool is `number-lines`. It is useful as an illustration of the code
166209

167210
`tsv-join` and `tsv-filter` also have relatively straightforward functionality, but support more use cases resulting in more code. `tsv-filter` in particular has more elaborate setup steps that take a bit more time to understand. `tsv-filter` uses several features like delegates (closures) and regular expressions not used in the other tools.
168211

212+
`tsv-summarize` is one or the more recent tools. It uses a more object oriented style than the other tools, this makes it relatively easy to add new operations. It also makes quite extensive use of built-in unit tests.
213+
169214
The `common` directory has code shared by the tools. At present this very limited, one helper class written as template. In addition to being an example of a simple template, it also makes use of a D ranges, a very useful sequence abstraction, and built-in unit tests.
170215

171216
New tools can be added by creating a new directory and a source tree following the same pattern as one of existing tools.
@@ -213,7 +258,7 @@ $ make test-nobuild
213258

214259
### Unit tests
215260

216-
D has an excellent facility for adding unit tests right with the code. The `common` utility functions in this package take advantage of built-in unit tests. However, most of the command line executables do not, and instead use more traditional invocation of the command line executables and diffs the output against a "gold" result set. The exception is `csv2tsv`, which uses both built-in unit tests and tests against the executable. The built-in unit tests are much nicer, and also the advantage of being naturally cross-platform. The command line executable tests assume a Unix shell.
261+
D has an excellent facility for adding unit tests right with the code. The `common` utility functions in this package take advantage of built-in unit tests. However, most of the command line executables do not, and instead use more traditional invocation of the command line executables and diffs the output against a "gold" result set. The exceptions are `csv2tsv` and `tsv-summarize`. These use both built-in unit tests and tests against the executable. The built-in unit tests are much nicer, and also the advantage of being naturally cross-platform. The command line executable tests assume a Unix shell.
217262

218263
Tests for the command line executables are in the `tests` directory of each tool. Overall the tests cover a fair number of cases and are quite useful checks when modifying the code. They may also be helpful as an examples of command line tool invocations. See the `tests.sh` file in each `test` directory, and the `test` makefile target in `makeapp.mk`.
219264

@@ -592,6 +637,89 @@ $ tsv-select -f 1 --rest first data.tsv
592637
$ # Move fields 7 and 3 to the start of the line
593638
$ tsv-select -f 7,3 --rest last data.tsv
594639
```
640+
### tsv-summarize reference
641+
642+
Synopsis: tsv-summarize [options] file [file...]
643+
644+
tsv-summarize reads tabular data files (tab-separated by default), tracks field values for each unique key, and runs summarization algorithms. Consider the file data.tsv:
645+
```
646+
make color time
647+
ford blue 131
648+
chevy green 124
649+
ford red 128
650+
bmw black 118
651+
bmw black 126
652+
ford blue 122
653+
```
654+
655+
The min and average times for each make is generated by the command:
656+
```
657+
$ tsv-summarize --header --group-by 1 --min 3 --mean 3 data.tsv
658+
```
659+
660+
This produces:
661+
```
662+
make time_min time_mean
663+
ford 122 127
664+
chevy 124 124
665+
bmw 118 122
666+
```
667+
668+
Using `--group 1,2` will group by both 'make' and 'color'. Omitting the `--group-by` entirely summarizes fields for full file.
669+
670+
The program tries to generate useful headers, but custom headers can be specified. Example (using `-g` and `-H` shortcuts for `--header` and `--group-by`):
671+
```
672+
$ tsv-summarize -H -g 1 --min 3:fastest --mean 3:average data.tsv
673+
```
674+
675+
Most operators take custom headers in a similarly way, generally following:
676+
```
677+
--<operator-name> FIELD[:header]
678+
```
679+
680+
Operators can be specified multiple times. They can also take multiple fields (though not when a custom header is specified). Example:
681+
```
682+
--median 2,3,4
683+
```
684+
685+
Summarization operators available are:
686+
```
687+
count min mean stddev
688+
retain max median unique-count
689+
first range mad mode
690+
last sum var values
691+
```
692+
693+
Calculations hold onto the minimum data needed while reading data. A few operations like median keep all data values in memory. These operations will start to encounter performance issues as available memory becomes scarce. The size that can be handled effectively is machine dependent, but often quite large files can be handled. Operations requiring numeric entries will signal an error and terminate processing if a non-numeric entry is found.
694+
695+
**Options:**
696+
* `--h|help` - Brief help.
697+
* `--help-verbose` - Print full help.
698+
* `--g|group-by n[,n...]` - Fields to use as key.
699+
* `--H|header` - Treat the first line of each file as a header.
700+
* `--w|write-header` - Write an output header even if there is no input header.
701+
* `--d|delimiter CHR` - Field delimiter. Default: TAB. (Single byte UTF-8 characters only.)
702+
* `--v|values-delimiter CHR` - Values delimiter. Default: vertical bar (|). (Single byte UTF-8 characters only.)
703+
* `--p|float-precision NUM` - 'Precision' to use printing floating point numbers. Affects the number of digits printed and exponent use. Default: 12
704+
705+
**Operators:**
706+
* `--count` - Count occurrences of each unique key.
707+
* `--count-header STR` - Count occurrences of each unique key, use header STR.
708+
* `--retain n[,n...]` - Retain one copy of the field.
709+
* `--first n[,n...][:STR]` - First value seen.
710+
* `--last n[,n...][:STR]`- Last value seen.
711+
* `--min n[,n...][:STR]` - Min value. (Numeric fields only.)
712+
* `--max n[,n...][:STR]` - Max value. Numeric fields only.
713+
* `--range n[,n...][:STR]` - Difference between min and max values. (Numeric fields only.)
714+
* `--sum n[,n...][:STR]` - Sum of the values. (Numeric fields only.)
715+
* `--mean n[,n...][:STR]` - Mean (average). (Numeric fields only.)
716+
* `--median n[,n...][:STR]` - Median value. (Numeric fields only. Reads all values into memory.)
717+
* `--mad n[,n...][:STR]` - Median absolute deviation from the median. Raw value, not scaled. (Numeric fields only. Reads all values into memory.)
718+
* `--var n[,n...][:STR]` - Variance. (Sample variance, numeric fields only).
719+
* `--stdev n[,n...][:STR]` - Standard deviation. (Sample st.dev, numeric fields only).
720+
* `--unique-count n[,n...][:STR]` Number of unique values. (Reads all values into memory).
721+
* `--mode n[,n...][:STR]` - Mode. The most frequent value. (Reads all values into memory.)
722+
* `--values n[,n...][:STR]` - All the values, separated by --v|values-delimiter. (Reads all values into memory.)
595723

596724
### csv2tsv reference
597725

common/makefile

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,15 @@
11
include ../makedefs.mk
22

3-
srcs = src/tsvutil.d
4-
53
release: ;
64
debug: ;
75
clean: ;
8-
test test-release:
9-
@echo '---> Running $(notdir $(basename $(CURDIR))) unit tests.'
10-
$(DCOMPILER) $(unittest_flags) $(srcs)
6+
test: unittest
7+
test-release: ;
8+
test-nobuild: ;
9+
10+
.PHONY: unittest
11+
unittest:
12+
@echo '---> Running $(notdir $(basename $(CURDIR))) unit tests'
13+
$(DCOMPILER) $(common_srcs) $(unittest_flags) src/tsvutil.d
14+
$(DCOMPILER) $(common_srcs) $(unittest_flags) src/getopt_inorder.d
1115
@echo '---> Unit tests completed successfully.'

0 commit comments

Comments
 (0)