GitHub - exflikt/c-delim-iter-bench: Benchmark for splitting strings with a delimiter in C

Benchmark for splitting strings with a delimiter in C

iter Management of iteration states in the function stack
iter-struct The iterator struct and its associated functions (methods)
strtok Standard C library function for tokenizing a string

Benchmark

Benchmark for the binaries compiled with -O3:

$ make -s bench 2>/dev/null
Benchmark 1: bin/sols-iter
  Time (mean ± σ):      22.8 ms ±   2.6 ms    [User: 21.0 ms, System: 1.6 ms]
  Range (min … max):    21.7 ms …  51.3 ms    129 runs

Benchmark 2: bin/sols-iter-struct
  Time (mean ± σ):      22.6 ms ±   1.1 ms    [User: 21.0 ms, System: 1.4 ms]
  Range (min … max):    21.8 ms …  32.1 ms    134 runs

Benchmark 3: bin/sols-strtok
  Time (mean ± σ):      31.9 ms ±   1.6 ms    [User: 30.1 ms, System: 1.7 ms]
  Range (min … max):    29.8 ms …  40.1 ms    90 runs

Summary
  bin/sols-iter-struct ran
    1.01 ± 0.13 times faster than bin/sols-iter
    1.42 ± 0.10 times faster than bin/sols-strtok
Benchmark 1: bin/smss-iter
  Time (mean ± σ):      21.4 ms ±   0.3 ms    [User: 20.8 ms, System: 0.4 ms]
  Range (min … max):    20.6 ms …  22.5 ms    140 runs

Benchmark 2: bin/smss-iter-struct
  Time (mean ± σ):      21.5 ms ±   0.6 ms    [User: 20.9 ms, System: 0.4 ms]
  Range (min … max):    20.7 ms …  26.4 ms    141 runs

Benchmark 3: bin/smss-strtok
  Time (mean ± σ):      30.7 ms ±   1.6 ms    [User: 30.2 ms, System: 0.4 ms]
  Range (min … max):    28.9 ms …  37.5 ms    98 runs

Summary
  bin/smss-iter ran
    1.01 ± 0.03 times faster than bin/smss-iter-struct
    1.44 ± 0.08 times faster than bin/smss-strtok

In the benchmark, iter and iter-struct were roughly 1.4 times faster than strtok. Be warned that these were the result on my machine, and also that the input strings are kind of contrived. We should not take this to imply that strtok is slower in general.

Test Failure

When we run tests for sols (split one long string) and smss (split many short strings), they fail with the output suggesting that the strtok version produced a different output:

$ make test-sols
diff -q --from-file output/sols-iter output/sols-iter-struct output/sols-strtok
Files output/sols-iter and output/sols-strtok differ
make: *** [Makefile:54: test-sols] Error 1
$ make test-smss
diff -q --from-file output/smss-iter output/smss-iter-struct output/smss-strtok
Files output/smss-iter and output/smss-strtok differ
make: *** [Makefile:54: test-smss] Error 1

This is intentional. Let's see what is different about them in the sols case:

$ make --silent run > /dev/null
$ wc --lines --total never output/sols-{iter{,-struct},strtok}
1650689 output/sols-iter
1650689 output/sols-iter-struct
1650688 output/sols-strtok
$ diff --from-file output/sols-strtok \
<(head -1650688 output/sols-iter) \
<(head -1650688 output/sols-iter-struct)
$ tail -n 3 output/sols-{iter{,-struct},strtok}
==> output/sols-iter <==
z
z


==> output/sols-iter-struct <==
z
z


==> output/sols-strtok <==
z
z
z

As you can see, strtok version does not end with an extra white space. This is because strtok stops its execution when the last token is the empty string. The input string as defined in bench.h ends with a comma as a delimiter like so: "...z,z,z,". This behavior is indistinguishable from when the input string ends without a comma: "...z,z,z". I decided to include the last token in my implementation of iter and iter-struct, so that is why it outputs an excessive empty line at the end.

For the smss test, you can see that the strtok version does not output any white spaces:

$ grep '^$' output/smss-iter | wc -l
1024
$ grep '^$' output/smss-iter-struct | wc -l
1024
$ grep '^$' output/smss-strtok | wc -l
0
$ diff --ignore-blank-lines --from-file output/smss-{iter{,-struct},strtok}

In the implementation of iter and iter-struct, you can get this behavior by not handling the finish case, thereby freeing a byte-length space in the stack previously occupied by a boolean flag. I would not necessarily say this is a bug in strtok. It is probably more apt to say that the initial interface ended up being exposed as a public API without caring too much about this corner case in mind.

Through more digging, i found a SO answer saying:

That's a limitation of strtok. The designers had whitespace-separated tokens in mind.

In the same thread, strsep(3) function is mentioned, and the man page of which states that:

The strsep() function was introduced as a replacement for strtok(3), since the latter cannot handle empty fields. However, strtok(3) conforms to C89/C99 and hence is more portable.

Feature Extension

delimited-ext.c has a struct definition for DelimitedIter2, which accepts multiple delimiters.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
split-many-short-strings		split-many-short-strings
split-one-long-string		split-one-long-string
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
bench.h		bench.h
delimited-ext.c		delimited-ext.c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Benchmark

Test Failure

Feature Extension

About

Uh oh!

Contributors

Uh oh!

Languages

License

exflikt/c-delim-iter-bench

Folders and files

Latest commit

History

Repository files navigation

Benchmark

Test Failure

Feature Extension

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages