Skip to content

Commit 24ffbb7

Browse files
Perf updates march2017 (#38)
* Publish benchmarks with csv2tsv updates. * Wording updates. * doc updates * formatting * formatting
1 parent 948f149 commit 24ffbb7

3 files changed

Lines changed: 87 additions & 63 deletions

File tree

csv2tsv/src/csv2tsv.d

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -234,7 +234,8 @@ void csv2tsv(InputRange, OutputRange)
234234
is(Unqual!(ElementType!InputRange) == ubyte))
235235
{
236236
/* Writes are buffered to avoid byte-at-a-time output penalty. Writes are done on
237-
* newline boundaries. This ensures valid utf-8 character sequences are written.
237+
* newline boundaries, a simple but effective strategy. It has the side benefit that
238+
* multi-byte utf-8 sequences are not split up when output is done.
238239
* Note: In Phobos version 2.073 and earlier it is important to do output with char
239240
* rather than ubyte. See issue 17229 (https://issues.dlang.org/show_bug.cgi?id=17229).
240241
*/

docs/AboutTheCode.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -42,9 +42,10 @@ These tools were implemented with these trade-offs in mind. The code was deliber
4242

4343
A useful aspect of D is that is additional optimization can be made as the need arises. Coding of these tools did utilize a several optimizations that might not have been done in an initial effort. These include:
4444

45-
* The helper class in the `common` directory. This is an optimization for processing only the first N fields needed to for the particular invocation of the tool.
46-
* The template expansion done in `tsv-select`.
45+
* The `InputFieldReordering` class in the `common` directory. This is an optimization for processing only the first N fields needed for the individual command invocation. This is used by several tools.
46+
* The template expansion done in `tsv-select`. This reduces the number of if-tests in the inner loop.
4747
* Reusing arrays every input line, without re-allocating. Some programmers would do this naturally on the first attempt, for others it would be a second pass optimization.
48+
* The output buffering done in `csv2tsv`. The algorithm used naturally generates a single byte at a time, but writing a byte-at-a-time incurs a costly system call. Buffering the writes sped the program up signficantly.
4849

4950
## Building and makefile
5051

docs/Performance.md

Lines changed: 82 additions & 60 deletions
Original file line numberDiff line numberDiff line change
@@ -11,67 +11,82 @@ Performance is a key motivation for writing tools like this in D rather an inter
1111

1212
To gauge D's performance, benchmarks were run using these tools and a number of similar tools written in native compiled programming languages. Included were traditional Unix tools as well as several specialized toolkits. Programming languages involved were C, Go, and Rust.
1313

14-
The D programs performed extremely well on these benchmarks, exceeding the author's expectations. They were the fastest on five of the six benchmarks run, by often by significant margins. This is impressive given that very little low-level programming was done. High level language constructs were used throughout, including the simplest forms of file I/O (no manual buffer management), GC (no manual memory management), built-in associative arrays and other facilities from the standard library, liberal use of functional programming constructs, etc. Performance tuning was done to identify poorly performing constructs, and templates were used in several places to improve performance, but nothing extensive. See [Coding philosophy](AboutTheCode.md#coding-philosophy) for the rationale behind these choices.
14+
The D programs performed extremely well on these benchmarks, exceeding the author's expectations. They were the fastest on all six benchmarks run, often by significant margins. This is impressive given that very little low-level programming was done. High level language constructs were used throughout, including the simplest forms of file I/O (no manual buffer management), GC (no manual memory management), built-in associative arrays and other facilities from the standard library, liberal use of functional programming constructs, etc. Performance tuning was done to identify poorly performing constructs, and templates were used in several places to improve performance, but nothing extensive. See [Coding philosophy](AboutTheCode.md#coding-philosophy) for the rationale behind these choices, as well as descriptions of the performance optimizations that were done.
1515

1616
As with most benchmarks, there are important caveats. The tools used for comparison are not exact equivalents, and in many cases have different design goals and capabilities likely to impact performance. Tasks performed are highly I/O dependent and follow similar computational patterns, so the results may not transfer to other applications.
1717

1818
Despite limitations of the benchmarks, this is certainly a good result. The benchmarks engage a fair range of programming constructs, and the comparison basis includes nine distinct implementations and several long tenured Unix tools. As a practical matter, performance of the tools has changed the author's personal work habits, as calculations that used to take 15-20 seconds are now instantaneous, and calculations that took minutes often finish in 10 seconds or so.
1919

2020
## Comparative benchmarks
2121

22-
Six different tasks were used as benchmarks. Two forms of row filtering: numeric comparisons and regular expression match. Column selection (aka 'cut'). Join two files on a common key. Simple statistical calculations (e.g. mean of column values). Convert CSV files to TSV. For each there are at least two other tools providing the same functionality. Reasonably large files were used, one 4.8 GB, 7 million rows, the other 2.7 GB, 14 million rows. Smaller files were also tested, in the 500 MB - 1 GB range. Those results are not reported, but were consistent with the larger file results given below.
22+
Six tasks were used as benchmarks. Two forms of row filtering: numeric comparisons and regular expression match. Column selection (aka 'cut'). Join two files on a common key. Simple statistical calculations (e.g. mean of column values). Convert CSV files to TSV. Reasonably large files were used, one 4.8 GB, 7 million rows, the other 2.7 GB, 14 million rows. Tests against smaller files gave results consistent with the larger file tests.
2323

24-
Tests were conducted on a MacBook Pro, 2.8 GHz, 16 GB RAM, 4 cores, 500 GB of flash storage. All tools were updated to current releases the day the benchmarks were run (Feb 18, 2017). Several of the specialty toolkits were built from current source code. Compilers used were: LDC 1.1 (D compiler, Phobos 2.071.2); clang 8.0.0 (C/C++); Rust 1.15.1; Go 1.8. Run-time was measured using the `time` facility. Each benchmark was run three times and the fastest run recorded.
24+
Tests were conducted on a MacBook Pro, 16 GB RAM, 4 cores, and flash storage. All tools were updated to current versions, and several of the specialty toolkits were built from current source code. Run-time was measured using the `time` facility. Each benchmark was run three times and the fastest run recorded.
2525

26-
The specialty toolkits have been anonymized in the tables below. The purpose of these benchmarks is to gauge performance of the D tools, not make comparisons between other toolkits. The exception is the csv-to-tsv test, where the fastest toolkit is named. Toolkits used are from the set listed under [Other toolkits](../README.md#other-toolkits) in the README. Python tools were not benchmarked, this would be a useful addition. Tools that run in in-memory environments like R were excluded.
26+
The specialty toolkits are anonymized in the tables below. The purpose of these benchmarks is to gauge performance of the D tools, not make comparisons between other toolkits. (The exception is the csv-to-tsv benchmark. Each tool has had the best time in a prior version of this report and was therefore identified.) Links and info for these toolkits can be found in [Other toolkits](../README.md#other-toolkits) in the README. Python tools were not benchmarked, this would be a useful addition. Tools that run in in-memory environments like R were excluded.
2727

2828
The worst performers were the Unix tools shipped with the Mac (`cut`, etc). It's worth installing the GNU coreutils package if you use command line tools on the Mac. (MacPorts and Homebrew can install these tools.)
2929

30+
### Top four in each benchmark
31+
32+
This table shows fastest times for each benchmark. Times are in seconds. Complete results for each benchmark are in the succeeding sections.
33+
34+
| Benchmark | Tool/Time | Tool/Time | Tool/Time | Tool/Time |
35+
| ---------------------- | ------------: | --------: | --------: | --------: |
36+
| **Numeric row filter** | tsv-filter | mawk | GNU awk | Toolkit 1 |
37+
| (4.8 GB, 7M lines) | 4.34 | 11.71 | 22.02 | 53.11 |
38+
| **Regex row filter** | tsv-filter | GNU awk | mawk | Toolkit 1 |
39+
| (2.7 GB, 14M lines) | 7.11 | 15.41 | 16.58 | 28.59 |
40+
| **Column selection** | tsv-select | mawk | GNU cut | Toolkit 1 |
41+
| (4.8 GB, 7M lines) | 4.09 | 9.38 | 12.27 | 19.12 |
42+
| **Join two files** | tsv-join | Toolkit 1 | Toolkit 2 | Toolkit 3 |
43+
| (4.8 GB, 7M lines) | 20.78 | 104.06 | 194.80 | 266.42 |
44+
| **Summary statistics** | tsv-summarize | Toolkit 1 | Toolkit 2 | Toolkit 3 |
45+
| (4.8 GB, 7M lines) | 15.83 | 40.27 | 48.10 | 62.97 |
46+
| **CSV-to-TSV** | csv2tsv | csvtk | xsv | |
47+
| (2.7 GB, 14M lines) | 27.41 | 36.26 | 40.40 | |
48+
3049
### Numeric filter benchmark
3150

3251
This operation filters rows from a TSV file based on a numeric comparison (less than, greater than, etc) of two fields in a line. A 7 million line, 29 column, 4.8 GB numeric data file was used. The filter matched 1.2 million lines.
3352

3453
| Tool | Time (seconds) |
3554
| --------------------- | -------------: |
36-
| **tsv-filter** | 4.31 |
37-
| mawk (M. Brennan Awk) | 11.66 |
38-
| GNU awk | 21.80 |
39-
| Toolkit 1 | 52.92 |
40-
| awk (Mac built-in) | 284.96 |
41-
42-
_Version info: GNU awk: GNU coreutils 8.26; mawk 1.3.4; OS X awk 20070501._
55+
| **tsv-filter** | 4.34 |
56+
| mawk (M. Brennan Awk) | 11.71 |
57+
| GNU awk | 22.02 |
58+
| Toolkit 1 | 53.11 |
59+
| awk (Mac built-in) | 286.57 |
4360

4461
### Regular expression filter benchmark
4562

4663
This operation filters rows from a TSV file based on a regular comparison against a field. The regular expression used was '[RD].*(ION[0-2])', it was matched against a text field. The input file was 14 million rows, 49 columns, 2.7 GB. The filter matched 150K rows. Other regular expressions were tried, results were similar.
4764

4865
| Tool | Time (seconds) |
4966
| --------------------- | -------------: |
50-
| **tsv-filter** | 7.14 |
51-
| GNU awk | 15.29 |
52-
| mawk (M. Brennan Awk) | 16.45 |
53-
| Toolkit 1 | 28.46 |
54-
| Toolkit 2 | 41.86 |
55-
| awk (Mac built-in) | 113.05 |
56-
| Toolkit 3 | 123.22 |
67+
| **tsv-filter** | 7.11 |
68+
| GNU awk | 15.41 |
69+
| mawk (M. Brennan Awk) | 16.58 |
70+
| Toolkit 1 | 28.59 |
71+
| Toolkit 2 | 42.72 |
72+
| awk (Mac built-in) | 113.55 |
73+
| Toolkit 3 | 125.31 |
5774

5875
### Column selection benchmark
5976

6077
This is the traditional Unix `cut` operation. Surprisingly, the `cut` implementations were not the fastest. The test selected fields 1, 8, 19 from a 7 million line, 29 column, 4.8 GB numeric data file.
6178

6279
| Tool | Time (seconds) |
6380
| --------------------- |--------------: |
64-
| **tsv-select** | 4.06 |
65-
| mawk (M. Brennan Awk) | 9.12 |
66-
| GNU cut | 12.22 |
67-
| Toolkit 1 | 19.05 |
68-
| GNU awk | 32.94 |
69-
| Toolkit 2 | 36.44 |
70-
| Toolkit 3 | 46.06 |
71-
| cut (Mac built-in) | 77.79 |
72-
| awk (Mac built-in) | 286.29 |
73-
74-
_Version info: GNU cut: GNU coreutils 8.26_
81+
| **tsv-select** | 4.09 |
82+
| mawk (M. Brennan Awk) | 9.38 |
83+
| GNU cut | 12.27 |
84+
| Toolkit 1 | 19.12 |
85+
| Toolkit 2 | 32.90 |
86+
| GNU awk | 33.09 |
87+
| Toolkit 3 | 46.32 |
88+
| cut (Mac built-in) | 78.01 |
89+
| awk (Mac built-in) | 287.19 |
7590

7691
_Note: GNU cut is faster than tsv-select on small files, e.g. 250 MB. See [Relative performance of the tools](#relative-performance-of-the-tools) for an example._
7792

@@ -81,61 +96,68 @@ This test was done taking a 7 million line, 29 column numeric data file, splitti
8196

8297
| Tool | Time (seconds) |
8398
| ------------ |--------------: |
84-
| **tsv-join** | 20.56 |
85-
| Toolkit 1 | 111.55 |
86-
| Toolkit 2 | 192.90 |
87-
| Toolkit 3 | 244.02 |
99+
| **tsv-join** | 20.78 |
100+
| Toolkit 1 | 104.06 |
101+
| Toolkit 2 | 194.80 |
102+
| Toolkit 3 | 266.42 |
88103

89104
### Summary statistics
90105

91106
This test generates a set of summary statistics from the columns in a TSV file. The specific calculations were based on summary statistics available in the different available tools that had high overlap. The sets were not identical, but were close enough for rough comparison. Roughly, the count, sum, min, max, mean, and standard deviation of three fields from a 7 million row, 4.8 GB data file.
92107

93108
| Tool | Time (seconds) |
94109
| ------------------|--------------: |
95-
| **tsv-summarize** | 15.77 |
96-
| Toolkit 1 | 39.90 |
97-
| Toolkit 2 | 47.87 |
98-
| Toolkit 3 | 62.88 |
99-
| Toolkit 4 | 67.44 |
110+
| **tsv-summarize** | 15.83 |
111+
| Toolkit 1 | 40.27 |
112+
| Toolkit 2 | 48.10 |
113+
| Toolkit 3 | 62.97 |
114+
| Toolkit 4 | 67.17 |
100115

101116
### CSV to TSV conversion
102117

103-
This test converted a CSV file to TSV format. The file used was 14 million rows, 49 columns, 2.7 GB. This is the one benchmark where the D tools were outperformed by other tools.
118+
This test converted a CSV file to TSV format. The file used was 14 million rows, 49 columns, 2.7 GB. This is the most competitive of the benchmarks, each of the tools having been the fastest in a previous version of this report. The D tool, `csv2tsv`, was third fastest until buffered writes were used in version 1.1.1.
104119

105120
| Tool | Time (seconds) |
106121
| ----------- |--------------: |
107-
| csvtk | 37.01 |
108-
| Toolkit 1 | 40.18 |
109-
| **csv2tsv** | 53.27 |
122+
| **csv2tsv** | 27.41 |
123+
| csvtk | 36.26 |
124+
| xsv | 40.40 |
125+
126+
### Details
127+
128+
* Machine: MacBook Pro, 2.8 GHz, 16 GB RAM, 4 cores, 500 GB flash storage, OS X Sierra.
129+
* Test files: The 7 million line, 4.8 GB file is the HEPMASS training set from the UCI Machine Learning repository, available [here](http://archive.ics.uci.edu/ml/datasets/HEPMASS). The 2.7 GB, 14 million row file is from the Forest Inventory and Analysis Database, U.S. Department of Agriculture. The first 14 million lines from the TREE.csv file, available [here](https://apps.fs.usda.gov/fia/datamart/CSV/datamart_csv.html).
130+
* Tools: Latest versions available as of 3/3/2017. Several built from latest source. Versions: tsv-utils-dlang 1.1.1; GNU cut (GNU coreutils) 8.26; GNU Awk 4.1.4; mawk 1.3.4 (Michael Brennan awk); OS X awk 20070501; Miller (mlr) 5.0.0; csvtk v0.5.0; xsv 0.10.3; GNU datamash 1.1.1.
131+
* Compilers: LDC 1.1 (D compiler, Phobos 2.071.2); Apple clang 8.0.0 (C/C++); Rust 1.15.1; Go 1.8.
110132

111133
## DMD vs LDC
112134

113135
It is understood that the LDC compiler produces faster executables than the DMD compiler. But how much faster? To get some data, the set of benchmarks described above was used to compare to LDC and DMD. In this case, DMD version 2.073.1 was compared to LDC 1.1. LDC 1.1 uses an older version of the standard library (Phobos), version 2.071.2. LDC was faster on all benchmarks, in some cases up to a 2x delta.
114136

115137
| Test/tool | LDC Time (seconds) | DMD Time (seconds) |
116138
| ----------------------------- |------------------: | -----------------: |
117-
| Numeric filter (tsv-filter) | 4.31 | 5.54 |
118-
| Regex filter (tsv-filter) | 7.14 | 11.33 |
119-
| Column select (tsv-select) | 4.06 | 9.46 |
120-
| Join files (tsv-join) | 20.56 | 40.97 |
121-
| Stats summary (tsv-summarize) | 15.77 | 18.25 |
122-
| CSV-to-TSV (csv2tsv) | 53.27 | 64.91 |
139+
| Numeric filter (tsv-filter) | 4.34 | 5.56 |
140+
| Regex filter (tsv-filter) | 7.11 | 11.29 |
141+
| Column select (tsv-select) | 4.09 | 9.46 |
142+
| Join files (tsv-join) | 20.78 | 41.23 |
143+
| Stats summary (tsv-summarize) | 15.83 | 18.37 |
144+
| CSV-to-TSV (csv2tsv) | 27.41 | 56.08 |
123145

124146
## Relative performance of the tools
125147

126148
Runs against a 4.5 million line, 279 MB file were used to get a relative comparison of the tools. The original file was a CSV file, allowing inclusion of `csv2tsv`. The TSV file generated was used in the other runs. Execution time when filtering data is highly dependent on the amount of output, so different output sizes were tried. `tsv-join` depends on the size of the filter file, a file the same size as the output was used in these tests. Performance also depends on the specific command line options selected, so actuals will vary.
127149

128150
| Tool | Records output | Time (seconds) |
129151
| ------------ | -------------: | -------------: |
130-
| tsv-filter | 513,788 | 0.65 |
131-
| number-lines | 4,465,613 | 0.97 |
132-
| cut (GNU) | 4,465,613 | 0.98 |
133-
| tsv-filter | 4,125,057 | 1.02 |
134-
| tsv-join | 65,537 | 1.19 |
135-
| tsv-select | 4,465,613 | 1.20 |
136-
| tsv-uniq | 65,537 | 1.23 |
137-
| tsv-uniq | 4,465,613 | 3.51 |
138-
| csv2tsv | 4,465,613 | 5.13 |
139-
| tsv-join | 4,465,613 | 5.87 |
140-
141-
Performance of `tsv-filter` looks especially good. Even when outputting a large number of records it is not far off GNU `cut`. Unlike the larger file tests, GNU `cut` is faster than `tsv-select` on this metric. This suggests GNU `cut` may have superior buffer management strategies when operating on smaller files. `tsv-join` and `tsv-uniq` are fast, but show an impact when larger hash tables are needed (4.5M entries in the slower cases). `csv2tsv` is decidely slower than the other tools given the work it is doing. Investigation indicates this is likely due to the byte-at-at-time output style it uses.
152+
| tsv-filter | 513,788 | 0.66 |
153+
| number-lines | 4,465,613 | 0.98 |
154+
| cut (GNU) | 4,465,613 | 0.99 |
155+
| tsv-filter | 4,125,057 | 1.03 |
156+
| tsv-join | 65,537 | 1.20 |
157+
| tsv-select | 4,465,613 | 1.21 |
158+
| tsv-uniq | 65,537 | 1.26 |
159+
| csv2tsv | 4,465,613 | 2.55 |
160+
| tsv-uniq | 4,465,613 | 3.52 |
161+
| tsv-join | 4,465,613 | 5.86 |
162+
163+
Performance of `tsv-filter` looks especially good. Even when outputting a large number of records it is not far off GNU `cut`. Unlike the larger file tests, GNU `cut` is faster than `tsv-select` on this metric. This suggests GNU `cut` may have superior buffer management strategies when operating on smaller files. `tsv-join` and `tsv-uniq` are fast, but show an impact when larger hash tables are needed (4.5M entries in the slower cases). `csv2tsv` has improved significantly in the latest release, but is still slower than the other tools given the work it is doing.

0 commit comments

Comments
 (0)