Benchmark for splitting strings with a delimiter in C
iterManagement of iteration states in the function stackiter-structThe iterator struct and its associated functions (methods)strtokStandard C library function for tokenizing a string
Benchmark for the binaries compiled with -O3:
$ make -s bench 2>/dev/null
Benchmark 1: bin/sols-iter
Time (mean ± σ): 22.8 ms ± 2.6 ms [User: 21.0 ms, System: 1.6 ms]
Range (min … max): 21.7 ms … 51.3 ms 129 runs
Benchmark 2: bin/sols-iter-struct
Time (mean ± σ): 22.6 ms ± 1.1 ms [User: 21.0 ms, System: 1.4 ms]
Range (min … max): 21.8 ms … 32.1 ms 134 runs
Benchmark 3: bin/sols-strtok
Time (mean ± σ): 31.9 ms ± 1.6 ms [User: 30.1 ms, System: 1.7 ms]
Range (min … max): 29.8 ms … 40.1 ms 90 runs
Summary
bin/sols-iter-struct ran
1.01 ± 0.13 times faster than bin/sols-iter
1.42 ± 0.10 times faster than bin/sols-strtok
Benchmark 1: bin/smss-iter
Time (mean ± σ): 21.4 ms ± 0.3 ms [User: 20.8 ms, System: 0.4 ms]
Range (min … max): 20.6 ms … 22.5 ms 140 runs
Benchmark 2: bin/smss-iter-struct
Time (mean ± σ): 21.5 ms ± 0.6 ms [User: 20.9 ms, System: 0.4 ms]
Range (min … max): 20.7 ms … 26.4 ms 141 runs
Benchmark 3: bin/smss-strtok
Time (mean ± σ): 30.7 ms ± 1.6 ms [User: 30.2 ms, System: 0.4 ms]
Range (min … max): 28.9 ms … 37.5 ms 98 runs
Summary
bin/smss-iter ran
1.01 ± 0.03 times faster than bin/smss-iter-struct
1.44 ± 0.08 times faster than bin/smss-strtok
In the benchmark, iter and iter-struct were roughly 1.4 times faster than
strtok. Be warned that these were the result on my machine, and also that the
input strings are kind of contrived. We should not take this to imply that
strtok is slower in general.
When we run tests for sols (split one long string) and smss (split many
short strings), they fail with the output suggesting that the strtok version
produced a different output:
$ make test-sols
diff -q --from-file output/sols-iter output/sols-iter-struct output/sols-strtok
Files output/sols-iter and output/sols-strtok differ
make: *** [Makefile:54: test-sols] Error 1
$ make test-smss
diff -q --from-file output/smss-iter output/smss-iter-struct output/smss-strtok
Files output/smss-iter and output/smss-strtok differ
make: *** [Makefile:54: test-smss] Error 1
This is intentional. Let's see what is different about them in the sols
case:
$ make --silent run > /dev/null
$ wc --lines --total never output/sols-{iter{,-struct},strtok}
1650689 output/sols-iter
1650689 output/sols-iter-struct
1650688 output/sols-strtok
$ diff --from-file output/sols-strtok \
<(head -1650688 output/sols-iter) \
<(head -1650688 output/sols-iter-struct)
$ tail -n 3 output/sols-{iter{,-struct},strtok}
==> output/sols-iter <==
z
z
==> output/sols-iter-struct <==
z
z
==> output/sols-strtok <==
z
z
z
As you can see, strtok version does not end with an extra white space. This
is because strtok stops its execution when the last token is the empty
string. The input string as defined in bench.h ends with a comma as a
delimiter like so: "...z,z,z,". This behavior is indistinguishable from when
the input string ends without a comma: "...z,z,z". I decided to include the
last token in my implementation of iter and iter-struct, so that is why it
outputs an excessive empty line at the end.
For the smss test, you can see that the strtok version does not output any
white spaces:
$ grep '^$' output/smss-iter | wc -l
1024
$ grep '^$' output/smss-iter-struct | wc -l
1024
$ grep '^$' output/smss-strtok | wc -l
0
$ diff --ignore-blank-lines --from-file output/smss-{iter{,-struct},strtok}
In the implementation of iter and iter-struct, you can get this behavior by
not handling the finish case, thereby freeing a byte-length space in the
stack previously occupied by a boolean flag. I would not necessarily say this
is a bug in strtok. It is probably more apt to say that the initial interface
ended up being exposed as a public API without caring too much about this
corner case in mind.
Through more digging, i found a SO answer saying:
That's a limitation of strtok. The designers had whitespace-separated tokens in mind.
In the same thread, strsep(3) function is mentioned, and the man page of
which states that:
The strsep() function was introduced as a replacement for strtok(3), since the latter cannot handle empty fields. However, strtok(3) conforms to C89/C99 and hence is more portable.
delimited-ext.c has a struct definition for
DelimitedIter2, which accepts multiple delimiters.