Skip to content

Latest commit

 

History

History
143 lines (115 loc) · 4.84 KB

README.md

File metadata and controls

143 lines (115 loc) · 4.84 KB

Benchmark for splitting strings with a delimiter in C

  • iter Management of iteration states in the function stack
  • iter-struct The iterator struct and its associated functions (methods)
  • strtok Standard C library function for tokenizing a string

Benchmark

Benchmark for the binaries compiled with -O3:

$ make -s bench 2>/dev/null
Benchmark 1: bin/sols-iter
  Time (mean ± σ):      22.8 ms ±   2.6 ms    [User: 21.0 ms, System: 1.6 ms]
  Range (min … max):    21.7 ms …  51.3 ms    129 runs

Benchmark 2: bin/sols-iter-struct
  Time (mean ± σ):      22.6 ms ±   1.1 ms    [User: 21.0 ms, System: 1.4 ms]
  Range (min … max):    21.8 ms …  32.1 ms    134 runs

Benchmark 3: bin/sols-strtok
  Time (mean ± σ):      31.9 ms ±   1.6 ms    [User: 30.1 ms, System: 1.7 ms]
  Range (min … max):    29.8 ms …  40.1 ms    90 runs

Summary
  bin/sols-iter-struct ran
    1.01 ± 0.13 times faster than bin/sols-iter
    1.42 ± 0.10 times faster than bin/sols-strtok
Benchmark 1: bin/smss-iter
  Time (mean ± σ):      21.4 ms ±   0.3 ms    [User: 20.8 ms, System: 0.4 ms]
  Range (min … max):    20.6 ms …  22.5 ms    140 runs

Benchmark 2: bin/smss-iter-struct
  Time (mean ± σ):      21.5 ms ±   0.6 ms    [User: 20.9 ms, System: 0.4 ms]
  Range (min … max):    20.7 ms …  26.4 ms    141 runs

Benchmark 3: bin/smss-strtok
  Time (mean ± σ):      30.7 ms ±   1.6 ms    [User: 30.2 ms, System: 0.4 ms]
  Range (min … max):    28.9 ms …  37.5 ms    98 runs

Summary
  bin/smss-iter ran
    1.01 ± 0.03 times faster than bin/smss-iter-struct
    1.44 ± 0.08 times faster than bin/smss-strtok

In the benchmark, iter and iter-struct were roughly 1.4 times faster than strtok. Be warned that these were the result on my machine, and also that the input strings are kind of contrived. We should not take this to imply that strtok is slower in general.

Test Failure

When we run tests for sols (split one long string) and smss (split many short strings), they fail with the output suggesting that the strtok version produced a different output:

$ make test-sols
diff -q --from-file output/sols-iter output/sols-iter-struct output/sols-strtok
Files output/sols-iter and output/sols-strtok differ
make: *** [Makefile:54: test-sols] Error 1
$ make test-smss
diff -q --from-file output/smss-iter output/smss-iter-struct output/smss-strtok
Files output/smss-iter and output/smss-strtok differ
make: *** [Makefile:54: test-smss] Error 1

This is intentional. Let's see what is different about them in the sols case:

$ make --silent run > /dev/null
$ wc --lines --total never output/sols-{iter{,-struct},strtok}
1650689 output/sols-iter
1650689 output/sols-iter-struct
1650688 output/sols-strtok
$ diff --from-file output/sols-strtok \
<(head -1650688 output/sols-iter) \
<(head -1650688 output/sols-iter-struct)
$ tail -n 3 output/sols-{iter{,-struct},strtok}
==> output/sols-iter <==
z
z


==> output/sols-iter-struct <==
z
z


==> output/sols-strtok <==
z
z
z

As you can see, strtok version does not end with an extra white space. This is because strtok stops its execution when the last token is the empty string. The input string as defined in bench.h ends with a comma as a delimiter like so: "...z,z,z,". This behavior is indistinguishable from when the input string ends without a comma: "...z,z,z". I decided to include the last token in my implementation of iter and iter-struct, so that is why it outputs an excessive empty line at the end.

For the smss test, you can see that the strtok version does not output any white spaces:

$ grep '^$' output/smss-iter | wc -l
1024
$ grep '^$' output/smss-iter-struct | wc -l
1024
$ grep '^$' output/smss-strtok | wc -l
0
$ diff --ignore-blank-lines --from-file output/smss-{iter{,-struct},strtok}

In the implementation of iter and iter-struct, you can get this behavior by not handling the finish case, thereby freeing a byte-length space in the stack previously occupied by a boolean flag. I would not necessarily say this is a bug in strtok. It is probably more apt to say that the initial interface ended up being exposed as a public API without caring too much about this corner case in mind.

Through more digging, i found a SO answer saying:

That's a limitation of strtok. The designers had whitespace-separated tokens in mind.

In the same thread, strsep(3) function is mentioned, and the man page of which states that:

The strsep() function was introduced as a replacement for strtok(3), since the latter cannot handle empty fields. However, strtok(3) conforms to C89/C99 and hence is more portable.

Feature Extension

delimited-ext.c has a struct definition for DelimitedIter2, which accepts multiple delimiters.