Perf updates march2017 (#38)

jondegenhardt · web-flow · commit 24ffbb766538 · 2017-03-04T11:00:37.000-08:00
* Publish benchmarks with csv2tsv updates.

* Wording updates.

* doc updates

* formatting

* formatting
diff --git a/csv2tsv/src/csv2tsv.d b/csv2tsv/src/csv2tsv.d
@@ -234,7 +234,8 @@ void csv2tsv(InputRange, OutputRange)
         is(Unqual!(ElementType!InputRange) == ubyte))
 {
     /* Writes are buffered to avoid byte-at-a-time output penalty. Writes are done on
-     * newline boundaries. This ensures valid utf-8 character sequences are written.
+     * newline boundaries, a simple but effective strategy. It has the side benefit that
+     * multi-byte utf-8 sequences are not split up when output is done.
      * Note: In Phobos version 2.073 and earlier it is important to do output with char
      * rather than ubyte. See issue 17229 (https://issues.dlang.org/show_bug.cgi?id=17229).
      */
diff --git a/docs/AboutTheCode.md b/docs/AboutTheCode.md
@@ -42,9 +42,10 @@ These tools were implemented with these trade-offs in mind. The code was deliber
 
 A useful aspect of D is that is additional optimization can be made as the need arises. Coding of these tools did utilize a several optimizations that might not have been done in an initial effort. These include:
 
-* The helper class in the `common` directory. This is an optimization for processing only the first N fields needed to for the particular invocation of the tool.
-* The template expansion done in `tsv-select`.
+* The `InputFieldReordering` class in the `common` directory. This is an optimization for processing only the first N fields needed for the individual command invocation. This is used by several tools. 
+* The template expansion done in `tsv-select`. This reduces the number of if-tests in the inner loop.
 * Reusing arrays every input line, without re-allocating. Some programmers would do this naturally on the first attempt, for others it would be a second pass optimization.
+* The output buffering done in `csv2tsv`. The algorithm used naturally generates a single byte at a time, but writing a byte-at-a-time incurs a costly system call. Buffering the writes sped the program up signficantly.
 
 ## Building and makefile
 
diff --git a/docs/Performance.md b/docs/Performance.md
@@ -11,67 +11,82 @@ Performance is a key motivation for writing tools like this in D rather an inter
 
 To gauge D's performance, benchmarks were run using these tools and a number of similar tools written in native compiled programming languages. Included were traditional Unix tools as well as several specialized toolkits. Programming languages involved were C, Go, and Rust.
 
-The D programs performed extremely well on these benchmarks, exceeding the author's expectations. They were the fastest on five of the six benchmarks run, by often by significant margins. This is impressive given that very little low-level programming was done. High level language constructs were used throughout, including the simplest forms of file I/O (no manual buffer management), GC (no manual memory management), built-in associative arrays and other facilities from the standard library, liberal use of functional programming constructs, etc. Performance tuning was done to identify poorly performing constructs, and templates were used in several places to improve performance, but nothing extensive. See [Coding philosophy](AboutTheCode.md#coding-philosophy) for the rationale behind these choices.
+The D programs performed extremely well on these benchmarks, exceeding the author's expectations. They were the fastest on all six benchmarks run, often by significant margins. This is impressive given that very little low-level programming was done. High level language constructs were used throughout, including the simplest forms of file I/O (no manual buffer management), GC (no manual memory management), built-in associative arrays and other facilities from the standard library, liberal use of functional programming constructs, etc. Performance tuning was done to identify poorly performing constructs, and templates were used in several places to improve performance, but nothing extensive. See [Coding philosophy](AboutTheCode.md#coding-philosophy) for the rationale behind these choices, as well as descriptions of the performance optimizations that were done.
 
 As with most benchmarks, there are important caveats. The tools used for comparison are not exact equivalents, and in many cases have different design goals and capabilities likely to impact performance. Tasks performed are highly I/O dependent and follow similar computational patterns, so the results may not transfer to other applications.
 
 Despite limitations of the benchmarks, this is certainly a good result. The benchmarks engage a fair range of programming constructs, and the comparison basis includes nine distinct implementations and several long tenured Unix tools. As a practical matter, performance of the tools has changed the author's personal work habits, as calculations that used to take 15-20 seconds are now instantaneous, and calculations that took minutes often finish in 10 seconds or so.
 
 ## Comparative benchmarks
 
-Six different tasks were used as benchmarks. Two forms of row filtering: numeric comparisons and regular expression match. Column selection (aka 'cut'). Join two files on a common key. Simple statistical calculations (e.g. mean of column values). Convert CSV files to TSV. For each there are at least two other tools providing the same functionality. Reasonably large files were used, one 4.8 GB, 7 million rows, the other 2.7 GB, 14 million rows. Smaller files were also tested, in the 500 MB - 1 GB range. Those results are not reported, but were consistent with the larger file results given below.
+Six tasks were used as benchmarks. Two forms of row filtering: numeric comparisons and regular expression match. Column selection (aka 'cut'). Join two files on a common key. Simple statistical calculations (e.g. mean of column values). Convert CSV files to TSV. Reasonably large files were used, one 4.8 GB, 7 million rows, the other 2.7 GB, 14 million rows. Tests against smaller files gave results consistent with the larger file tests.
 
-Tests were conducted on a MacBook Pro, 2.8 GHz, 16 GB RAM, 4 cores, 500 GB of flash storage. All tools were updated to current releases the day the benchmarks were run (Feb 18, 2017). Several of the specialty toolkits were built from current source code. Compilers used were: LDC 1.1 (D compiler, Phobos 2.071.2); clang 8.0.0 (C/C++); Rust 1.15.1; Go 1.8. Run-time was measured using the `time` facility. Each benchmark was run three times and the fastest run recorded.
+Tests were conducted on a MacBook Pro, 16 GB RAM, 4 cores, and flash storage. All tools were updated to current versions, and several of the specialty toolkits were built from current source code. Run-time was measured using the `time` facility. Each benchmark was run three times and the fastest run recorded.
 
-The specialty toolkits have been anonymized in the tables below. The purpose of these benchmarks is to gauge performance of the D tools, not make comparisons between other toolkits. The exception is the csv-to-tsv test, where the fastest toolkit is named. Toolkits used are from the set listed under [Other toolkits](../README.md#other-toolkits) in the README. Python tools were not benchmarked, this would be a useful addition. Tools that run in in-memory environments like R were excluded.
+The specialty toolkits are anonymized in the tables below. The purpose of these benchmarks is to gauge performance of the D tools, not make comparisons between other toolkits. (The exception is the csv-to-tsv benchmark. Each tool has had the best time in a prior version of this report and was therefore identified.) Links and info for these toolkits can be found in [Other toolkits](../README.md#other-toolkits) in the README. Python tools were not benchmarked, this would be a useful addition. Tools that run in in-memory environments like R were excluded.
 
 The worst performers were the Unix tools shipped with the Mac (`cut`, etc). It's worth installing the GNU coreutils package if you use command line tools on the Mac. (MacPorts and Homebrew can install these tools.)
 
+### Top four in each benchmark
+
+This table shows fastest times for each benchmark. Times are in seconds. Complete results for each benchmark are in the succeeding sections.
+
+| Benchmark              |     Tool/Time | Tool/Time | Tool/Time | Tool/Time |
+| ---------------------- | ------------: | --------: | --------: | --------: |
+| **Numeric row filter** |    tsv-filter |      mawk |   GNU awk | Toolkit 1 |
+| (4.8 GB, 7M lines)     |          4.34 |     11.71 |     22.02 |     53.11 |
+| **Regex row filter**   |    tsv-filter |   GNU awk |      mawk | Toolkit 1 |
+| (2.7 GB, 14M lines)    |          7.11 |     15.41 |     16.58 |     28.59 |
+| **Column selection**   |    tsv-select |      mawk |   GNU cut | Toolkit 1 |
+| (4.8 GB, 7M lines)     |          4.09 |      9.38 |     12.27 |     19.12 |
+| **Join two files**     |      tsv-join | Toolkit 1 | Toolkit 2 | Toolkit 3 |
+| (4.8 GB, 7M lines)     |         20.78 |    104.06 |    194.80 |    266.42 |
+| **Summary statistics** | tsv-summarize | Toolkit 1 | Toolkit 2 | Toolkit 3 |
+| (4.8 GB, 7M lines)     |         15.83 |     40.27 |     48.10 |     62.97 |
+| **CSV-to-TSV**         |       csv2tsv |     csvtk |       xsv |           |
+| (2.7 GB, 14M lines)    |         27.41 |     36.26 |     40.40 |           |
+
 ### Numeric filter benchmark
 
 This operation filters rows from a TSV file based on a numeric comparison (less than, greater than, etc) of two fields in a line. A 7 million line, 29 column, 4.8 GB numeric data file was used. The filter matched 1.2 million lines.
 
 | Tool                  | Time (seconds) |
 | --------------------- | -------------: |
-| **tsv-filter**        |           4.31 |
-| mawk (M. Brennan Awk) |          11.66 |
-| GNU awk               |          21.80 |
-| Toolkit 1             |          52.92 |
-| awk (Mac built-in)    |         284.96 |
-
-_Version info: GNU awk: GNU coreutils 8.26; mawk 1.3.4; OS X awk 20070501._
+| **tsv-filter**        |           4.34 |
+| mawk (M. Brennan Awk) |          11.71 |
+| GNU awk               |          22.02 |
+| Toolkit 1             |          53.11 |
+| awk (Mac built-in)    |         286.57 |
 
 ### Regular expression filter benchmark
 
 This operation filters rows from a TSV file based on a regular comparison against a field. The regular expression used was '[RD].*(ION[0-2])', it was matched against a text field. The input file was 14 million rows, 49 columns, 2.7 GB. The filter matched 150K rows. Other regular expressions were tried, results were similar.
 
 | Tool                  | Time (seconds) |
 | --------------------- | -------------: |
-| **tsv-filter**        |           7.14 |
-| GNU awk               |          15.29 |
-| mawk (M. Brennan Awk) |          16.45 |
-| Toolkit 1             |          28.46 |
-| Toolkit 2             |          41.86 |
-| awk (Mac built-in)    |         113.05 |
-| Toolkit 3             |         123.22 |
+| **tsv-filter**        |           7.11 |
+| GNU awk               |          15.41 |
+| mawk (M. Brennan Awk) |          16.58 |
+| Toolkit 1             |          28.59 |
+| Toolkit 2             |          42.72 |
+| awk (Mac built-in)    |         113.55 |
+| Toolkit 3             |         125.31 |
 
 ### Column selection benchmark
 
 This is the traditional Unix `cut` operation. Surprisingly, the `cut` implementations were not the fastest. The test selected fields 1, 8, 19 from a 7 million line, 29 column, 4.8 GB numeric data file.
 
 | Tool                  | Time (seconds) |
 | --------------------- |--------------: |
-| **tsv-select**        |           4.06 |
-| mawk (M. Brennan Awk) |           9.12 |
-| GNU cut               |          12.22 |
-| Toolkit 1             |          19.05 |
-| GNU awk               |          32.94 |
-| Toolkit 2             |          36.44 |
-| Toolkit 3             |          46.06 |
-| cut (Mac built-in)    |          77.79 |
-| awk (Mac built-in)    |         286.29 |
-
-_Version info: GNU cut: GNU coreutils 8.26_
+| **tsv-select**        |           4.09 |
+| mawk (M. Brennan Awk) |           9.38 |
+| GNU cut               |          12.27 |
+| Toolkit 1             |          19.12 |
+| Toolkit 2             |          32.90 |
+| GNU awk               |          33.09 |
+| Toolkit 3             |          46.32 |
+| cut (Mac built-in)    |          78.01 |
+| awk (Mac built-in)    |         287.19 |
 
 _Note: GNU cut is faster than tsv-select on small files, e.g. 250 MB. See [Relative performance of the tools](#relative-performance-of-the-tools) for an example._
 
@@ -81,61 +96,68 @@ This test was done taking a 7 million line, 29 column numeric data file, splitti
 
 | Tool         | Time (seconds) |
 | ------------ |--------------: |
-| **tsv-join** |          20.56 |
-| Toolkit 1    |         111.55 |
-| Toolkit 2    |         192.90 |
-| Toolkit 3    |         244.02 |
+| **tsv-join** |          20.78 |
+| Toolkit 1    |         104.06 |
+| Toolkit 2    |         194.80 |
+| Toolkit 3    |         266.42 |
 
 ### Summary statistics
 
 This test generates a set of summary statistics from the columns in a TSV file. The specific calculations were based on summary statistics available in the different available tools that had high overlap. The sets were not identical, but were close enough for rough comparison. Roughly, the count, sum, min, max, mean, and standard deviation of three fields from a 7 million row, 4.8 GB data file.
 
 | Tool              | Time (seconds) |
 | ------------------|--------------: |
-| **tsv-summarize** |          15.77 |
-| Toolkit 1         |          39.90 |
-| Toolkit 2         |          47.87 |
-| Toolkit 3         |          62.88 |
-| Toolkit 4         |          67.44 |
+| **tsv-summarize** |          15.83 |
+| Toolkit 1         |          40.27 |
+| Toolkit 2         |          48.10 |
+| Toolkit 3         |          62.97 |
+| Toolkit 4         |          67.17 |
 
 ### CSV to TSV conversion
 
-This test converted a CSV file to TSV format. The file used was 14 million rows, 49 columns, 2.7 GB. This is the one benchmark where the D tools were outperformed by other tools.
+This test converted a CSV file to TSV format. The file used was 14 million rows, 49 columns, 2.7 GB. This is the most competitive of the benchmarks, each of the tools having been the fastest in a previous version of this report. The D tool, `csv2tsv`, was third fastest until buffered writes were used in version 1.1.1.
 
 | Tool        | Time (seconds) |
 | ----------- |--------------: |
-| csvtk       |          37.01 |
-| Toolkit 1   |          40.18 |
-| **csv2tsv** |          53.27 |
+| **csv2tsv** |          27.41 |
+| csvtk       |          36.26 |
+| xsv         |          40.40 |
+
+### Details
+
+* Machine: MacBook Pro, 2.8 GHz, 16 GB RAM, 4 cores, 500 GB flash storage, OS X Sierra.
+* Test files: The 7 million line, 4.8 GB file is the HEPMASS training set from the UCI Machine Learning repository, available [here](http://archive.ics.uci.edu/ml/datasets/HEPMASS). The 2.7 GB, 14 million row file is from the Forest Inventory and Analysis Database, U.S. Department of Agriculture. The first 14 million lines from the TREE.csv file, available [here](https://apps.fs.usda.gov/fia/datamart/CSV/datamart_csv.html).
+* Tools: Latest versions available as of 3/3/2017. Several built from latest source. Versions: tsv-utils-dlang 1.1.1; GNU cut (GNU coreutils) 8.26; GNU Awk 4.1.4; mawk 1.3.4 (Michael Brennan awk); OS X awk 20070501; Miller (mlr) 5.0.0; csvtk v0.5.0; xsv 0.10.3; GNU datamash 1.1.1.
+* Compilers: LDC 1.1 (D compiler, Phobos 2.071.2); Apple clang 8.0.0 (C/C++); Rust 1.15.1; Go 1.8.
 
 ## DMD vs LDC
 
 It is understood that the LDC compiler produces faster executables than the DMD compiler. But how much faster? To get some data, the set of benchmarks described above was used to compare to LDC and DMD. In this case, DMD version 2.073.1 was compared to LDC 1.1. LDC 1.1 uses an older version of the standard library (Phobos), version 2.071.2. LDC was faster on all benchmarks, in some cases up to a 2x delta.
 
 | Test/tool                     | LDC Time (seconds) | DMD Time (seconds) |
 | ----------------------------- |------------------: | -----------------: |
-| Numeric filter (tsv-filter)   |               4.31 |               5.54 |
-| Regex filter (tsv-filter)     |               7.14 |              11.33 |
-| Column select (tsv-select)    |               4.06 |               9.46 |
-| Join files (tsv-join)         |              20.56 |              40.97 |
-| Stats summary (tsv-summarize) |              15.77 |              18.25 |
-| CSV-to-TSV (csv2tsv)          |              53.27 |              64.91 |
+| Numeric filter (tsv-filter)   |               4.34 |               5.56 |
+| Regex filter (tsv-filter)     |               7.11 |              11.29 |
+| Column select (tsv-select)    |               4.09 |               9.46 |
+| Join files (tsv-join)         |              20.78 |              41.23 |
+| Stats summary (tsv-summarize) |              15.83 |              18.37 |
+| CSV-to-TSV (csv2tsv)          |              27.41 |              56.08 |
 
 ## Relative performance of the tools
 
 Runs against a 4.5 million line, 279 MB file were used to get a relative comparison of the tools. The original file was a CSV file, allowing inclusion of `csv2tsv`. The TSV file generated was used in the other runs. Execution time when filtering data is highly dependent on the amount of output, so different output sizes were tried. `tsv-join` depends on the size of the filter file, a file the same size as the output was used in these tests. Performance also depends on the specific command line options selected, so actuals will vary.
 
 | Tool         | Records output | Time (seconds) |
 | ------------ | -------------: | -------------: |
-| tsv-filter   |        513,788 |           0.65 |
-| number-lines |      4,465,613 |           0.97 |
-| cut (GNU)    |      4,465,613 |           0.98 |
-| tsv-filter   |      4,125,057 |           1.02 |
-| tsv-join     |         65,537 |           1.19 |
-| tsv-select   |      4,465,613 |           1.20 |
-| tsv-uniq     |         65,537 |           1.23 |
-| tsv-uniq     |      4,465,613 |           3.51 |
-| csv2tsv      |      4,465,613 |           5.13 |
-| tsv-join     |      4,465,613 |           5.87 |
-
-Performance of `tsv-filter` looks especially good. Even when outputting a large number of records it is not far off GNU `cut`. Unlike the larger file tests, GNU `cut` is faster than `tsv-select` on this metric. This suggests GNU `cut` may have superior buffer management strategies when operating on smaller files. `tsv-join` and `tsv-uniq` are fast, but show an impact when larger hash tables are needed (4.5M entries in the slower cases). `csv2tsv` is decidely slower than the other tools given the work it is doing. Investigation indicates this is likely due to the byte-at-at-time output style it uses.
+| tsv-filter   |        513,788 |           0.66 |
+| number-lines |      4,465,613 |           0.98 |
+| cut (GNU)    |      4,465,613 |           0.99 |
+| tsv-filter   |      4,125,057 |           1.03 |
+| tsv-join     |         65,537 |           1.20 |
+| tsv-select   |      4,465,613 |           1.21 |
+| tsv-uniq     |         65,537 |           1.26 |
+| csv2tsv      |      4,465,613 |           2.55 |
+| tsv-uniq     |      4,465,613 |           3.52 |
+| tsv-join     |      4,465,613 |           5.86 |
+
+Performance of `tsv-filter` looks especially good. Even when outputting a large number of records it is not far off GNU `cut`. Unlike the larger file tests, GNU `cut` is faster than `tsv-select` on this metric. This suggests GNU `cut` may have superior buffer management strategies when operating on smaller files. `tsv-join` and `tsv-uniq` are fast, but show an impact when larger hash tables are needed (4.5M entries in the slower cases). `csv2tsv` has improved significantly in the latest release, but is still slower than the other tools given the work it is doing.