Description
To support destructible and sinkable types, in particular atomic refcounted types, tasks must zero-init their data buffer.
This is introduced in #144 to properly support the refcounted FlowEvent.
However there is a significant 17% overhead on very short running tasks like Fibonacci(40)
Note: significant is relative, fibonacci spawns 2^40 tasks which are in the trillions and each task is simpler than zero initialization
The change: https://github.com/mratsim/weave/pull/144/files#diff-c5d52e34ee454756d2c729faec306b62L113
proc newTaskFromCache*(): Task =
result = workerContext.taskCache.pop()
result = workerContext.taskCache.pop0()
if result.isNil:
if result.isNil:
result = myMemPool().borrow(deref(Task))
result = myMemPool().borrow0(deref(Task))
# Zeroing is expensive, it's 96 bytes
# The task must be fully zero-ed including the data buffer
# otherwise datatypes that use custom destructors
# result.fn = nil # Always overwritten
# and that rely on "myPointer.isNil" to return early
# result.parent = nil # Always overwritten
# may read recycled garbage data.
# result.scopedBarrier = nil # Always overwritten
# "FlowEvent" is such an example
result.prev = nil
result.next = nil
# TODO: The perf cost to the following is 17% as measured on fib(40)
result.start = 0
result.cur = 0
# # Zeroing is expensive, it's 96 bytes
result.stop = 0
# # result.fn = nil # Always overwritten
result.stride = 0
# # result.parent = nil # Always overwritten
result.futures = nil
# # result.scopedBarrier = nil # Always overwritten
result.isLoop = false
# result.prev = nil
result.hasFuture = false
# result.next = nil
# result.start = 0
# result.cur = 0
# result.stop = 0
# result.stride = 0
# result.futures = nil
# result.isLoop = false
# result.hasFuture = false
The simple optimization would be to only zero init the part of the buffer that will be overwritten.
An alternative would be to zero init the buffer only for non-trivial types as detected by supportsCopyMem
.
And a third possiblity would be to do both.