Skip to content

v3 spec makes confusing performance analysis #89

@Norbo11

Description

@Norbo11

After implementing V3 we reach the "Is it fast?" section which does some initial analysis on the
performance.

First we are told to run:

time (bin/make_world | bin/step_world 0.1 1 > /dev/null)
time (bin/make_world | bin/$USER/step_world_v3_opencl 0.1 1 > /dev/null)

On my machine, the OpenCL version is indeed a little bit slower. This makes sense, as we're only making 1 step, and the realised speedup is not enough to counteract the overheads of initialising OpenCL. All of this makes sense.

Then we are told that there is an extra overhead due to the formatting of world data. We are told to run:

time (bin/make_world 1000 0.1   0  > /dev/null)   # text format
time (bin/make_world 1000 0.1   1  > /dev/null)   # binary format

And indeed we see that the binary format is of course, a lot quicker to produce. However, the spec then says that "I would recommend using the binary format when not debugging, as otherwise your improvements in speed will be swamped by conversions." Surely, this is an incorrect claim. The conversion only happens once at the beginning of every run. As soon as our StepWorld function is invoked, there is no file reading involved. So all we're really doing is shaving off ~1 second from the total, which may be significant only for small-ish runs.

The more confusing part comes next. We are told to run the following commands:

time (cat /tmp/world.bin | bin/step_world 0.1 0  1 > /dev/null) 
time (cat /tmp/world.bin | bin/$USER/step_world_v3_opencl 0.1 0  1 > /dev/null)
time (cat /tmp/world.bin | bin/step_world 0.1 1  1 > /dev/null) 
time (cat /tmp/world.bin | bin/$USER/step_world_v3_opencl 0.1 1  1 > /dev/null)

This would allow us to compare time taken to execute 1 vs 0 steps in both versions of the program, thus computing the "marginal cost of each frame". I disagree with this for the following reasons:

  • There is a fair amount of noise around the timing of each command, so for this marginal cost to make any sense we would have to average the difference across a large number of samples;
  • By the same logic which was made at the beginning, 1 step is not enough to overcome the overhead of running OpenCL. Surely the benefits are only realised after the number of steps surpasses a particular point. For example, setting n = 1000 and comparing the two on my machine shows that the sequential version takes 44s while the OpenCL version takes 3.5s.

Therefore, how can the spec claim that "the GPU time per frame will be similar to or, more likely, quite a bit slower than the original CPU"? It may be the case that our OpenCL implementation isn't fully optimised yet (due to inefficient memory accesses, etc.) but it's very far from slower.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions