Description
Each reference counted tree_nodet
of irept
essentially contains -
- Reference count
- An id / data field
- A key value look-up (named_sub).
- A linear collection (sub)
- Cached Hash
I put some statistics gathering code into the destructor of the tree_node
class and the few examples I looked at seemed to have a reasonable proportion of leaf nodes where the named_sub
and the sub
fields are empty. Despite these collections being empty we still allocate the complete book keeping structure for them, which uses 48 bytes of memory, where we only have 4 actual bytes of the id value which needs to be stored. The hash look-up for a leaf nodes doesn't benefit from caching as the hash is the id in this case.
On x64 the pointer to tree_nodet
8 bytes in size, with the data structure being 8 byte aligned. The alignment means the least significant bit of these pointers in guaranteed to be 0. Therefore we could re-purpose this bit and the memory of the pointer for the storage of leaf nodes. So for the leaf nodes instead of using the memory which would hold the pointer for a pointer we can set the least significant bit to 1 and write the value of the id to the 4 highest bytes. Using the 4 highest bytes keeps the memory of the id
aligned as these are 4 byte aligned and allows forming pointers/references to it without performing bit shifts. I think this can be achieved relatively transparently to code using irept
based on allocating the full data structure only if constructing irept
with non-empty named_sub / sub or on requesting writable references to these fields.
I think the above outlined scheme dual purposing the pointer memory for either a pointer or an id value would save the allocation and de-allocation of the entire 48 byte tree_node
data structure for nearly all leaf nodes in memory. I have not yet had the time to implement and debug a portable version of this idea. So I don't yet know what the performance characteristics of this would look like. However I wanted to shared the details of this idea for future reference.
Activity
tautschnig commentedon Oct 13, 2023
This does sound like a great idea, but perhaps we need to get there in two steps. The first one is to figure out how to move away from
static dt empty_d
and instead usenullptr
wherever we currently have&empty_d
. This will require tweaking theread()
interface, I believe.Once that's in place we can likely wrap
data
in aunion
. We probably don't even need to play 4-highest-bytes games and can instead write2 * id + 1
, just like the encoding ofliteralt
uses.thomasspriggs commentedon Oct 15, 2023
If we swap the type of empty_d to being a pointer, then it can still hold the empty value. Then there is no need to allow
nullptr
. The value is the empty/""/0 irep_idt and the set flag. Theread()
function is a larger can of worms as it can potentially return a non-valid reference. Similarlywrite()
could use some careful attention.Storing the
irep_idt
in the 4 highest bytes keeps any pointers/references to it correctly aligned, which has performance benefits.I have a work-in-progress branch here which passes the regression tests but not the unit tests - develop...thomasspriggs:cbmc:tas/irep_optimisation1_unions
It still needs -
read
/write
.Fixing the unit tests and any actual problems they are revealing.I have done some initial run time testing using
time ctest -L CORE -E unit -V -j30
for my branch. This yields -With the same machine and test command on
develop
the results were -This seems to show that the branch does at least offer a minor run-time benefit, without considering memory usage.
tautschnig commentedon Oct 16, 2023
Thank you for this great piece of work! A couple of comments:
05ccb75009f67c (develop)
thomasspriggs/cbmc@78b7968452 (thomasspriggs/tas/irep_optimisation1_unions):
thomasspriggs commentedon Oct 18, 2023
The const cast fixes are now merged.
I think the unit test failures were due to the setup of the particular machine I was using for debugging. I had not investigated them, but they are not reproducible on my usual machine.
kroening commentedon Nov 3, 2023
A concern is the need for low-level pointer tweaks, and the need for branches when reading ireps.
Please consider the following alternative. Ireps have sharing, which you can use to make any frequently used irep cheaper, including the id-only ones.
thomasspriggs commentedon Nov 3, 2023
@kroening - I share your concern with regards for making sure that low-level tweaks work as intended. Which is one of the reasons I have made further work on this approach a relatively low priority. Although it seems like @tautschnig thinks it is worth pursuing this further.
With regards to branches - my understanding is that on modern architectures if there is a cache miss and the value needs to be fetched from main memory, then the wait time can be far longer than for a pipeline interruption from a branch mis-prediction. Therefore an implementation with more branches may be counter-intuitively faster due to better cache locality. This is why it is important to evaluate performance changes based on benchmarks.
For a leaf node, the only data we need to store is the ID. With
dstringt
the full string has already been stored in a table and deduplicated. So applying sharing to an ID-onlyirept
adds an unnecessary additional layer of indirection and reference counting, in my opinion. Thedstringt
is as small or smaller than the reference count or the pointer/index to a reference counted structure, because it only holds an index itself. When the data in question is that small then copying it uses less memory than forming a reference to it. This is why I think that sharing or copy on write is not the optimal approach for ID-onlyirept
s.With the approach I outlined the data of an id-only irept is stored in place of the reference to it; effectively inside their parent
irept
. This reduction in the levels of indirection is why I hypothesise that this approach may have beneficial properties in terms of cache-locality. Reducing the amount of memory used to hold the data structure may have benefits in terms of memory bandwidth usage, beyond advantages in terms of lowered available memory requirements.