-
Notifications
You must be signed in to change notification settings - Fork 5
Description
After implementing V3 we reach the "Is it fast?" section which does some initial analysis on the
performance.
First we are told to run:
time (bin/make_world | bin/step_world 0.1 1 > /dev/null)
time (bin/make_world | bin/$USER/step_world_v3_opencl 0.1 1 > /dev/null)
On my machine, the OpenCL version is indeed a little bit slower. This makes sense, as we're only making 1 step, and the realised speedup is not enough to counteract the overheads of initialising OpenCL. All of this makes sense.
Then we are told that there is an extra overhead due to the formatting of world data. We are told to run:
time (bin/make_world 1000 0.1 0 > /dev/null) # text format
time (bin/make_world 1000 0.1 1 > /dev/null) # binary format
And indeed we see that the binary format is of course, a lot quicker to produce. However, the spec then says that "I would recommend using the binary format when not debugging, as otherwise your improvements in speed will be swamped by conversions." Surely, this is an incorrect claim. The conversion only happens once at the beginning of every run. As soon as our StepWorld function is invoked, there is no file reading involved. So all we're really doing is shaving off ~1 second from the total, which may be significant only for small-ish runs.
The more confusing part comes next. We are told to run the following commands:
time (cat /tmp/world.bin | bin/step_world 0.1 0 1 > /dev/null)
time (cat /tmp/world.bin | bin/$USER/step_world_v3_opencl 0.1 0 1 > /dev/null)
time (cat /tmp/world.bin | bin/step_world 0.1 1 1 > /dev/null)
time (cat /tmp/world.bin | bin/$USER/step_world_v3_opencl 0.1 1 1 > /dev/null)
This would allow us to compare time taken to execute 1 vs 0 steps in both versions of the program, thus computing the "marginal cost of each frame". I disagree with this for the following reasons:
- There is a fair amount of noise around the timing of each command, so for this marginal cost to make any sense we would have to average the difference across a large number of samples;
- By the same logic which was made at the beginning, 1 step is not enough to overcome the overhead of running OpenCL. Surely the benefits are only realised after the number of steps surpasses a particular point. For example, setting n = 1000 and comparing the two on my machine shows that the sequential version takes 44s while the OpenCL version takes 3.5s.
Therefore, how can the spec claim that "the GPU time per frame will be similar to or, more likely, quite a bit slower than the original CPU"? It may be the case that our OpenCL implementation isn't fully optimised yet (due to inefficient memory accesses, etc.) but it's very far from slower.