Skip to content

Idea for irept memory usage reduction. #7960

Open
@thomasspriggs

Description

@thomasspriggs

Each reference counted tree_nodet of irept essentially contains -

  • Reference count
  • An id / data field
  • A key value look-up (named_sub).
  • A linear collection (sub)
  • Cached Hash

I put some statistics gathering code into the destructor of the tree_node class and the few examples I looked at seemed to have a reasonable proportion of leaf nodes where the named_sub and the sub fields are empty. Despite these collections being empty we still allocate the complete book keeping structure for them, which uses 48 bytes of memory, where we only have 4 actual bytes of the id value which needs to be stored. The hash look-up for a leaf nodes doesn't benefit from caching as the hash is the id in this case.

On x64 the pointer to tree_nodet 8 bytes in size, with the data structure being 8 byte aligned. The alignment means the least significant bit of these pointers in guaranteed to be 0. Therefore we could re-purpose this bit and the memory of the pointer for the storage of leaf nodes. So for the leaf nodes instead of using the memory which would hold the pointer for a pointer we can set the least significant bit to 1 and write the value of the id to the 4 highest bytes. Using the 4 highest bytes keeps the memory of the id aligned as these are 4 byte aligned and allows forming pointers/references to it without performing bit shifts. I think this can be achieved relatively transparently to code using irept based on allocating the full data structure only if constructing irept with non-empty named_sub / sub or on requesting writable references to these fields.

I think the above outlined scheme dual purposing the pointer memory for either a pointer or an id value would save the allocation and de-allocation of the entire 48 byte tree_node data structure for nearly all leaf nodes in memory. I have not yet had the time to implement and debug a portable version of this idea. So I don't yet know what the performance characteristics of this would look like. However I wanted to shared the details of this idea for future reference.

Activity

tautschnig

tautschnig commented on Oct 13, 2023

@tautschnig
Collaborator

This does sound like a great idea, but perhaps we need to get there in two steps. The first one is to figure out how to move away from static dt empty_d and instead use nullptr wherever we currently have &empty_d. This will require tweaking the read() interface, I believe.

Once that's in place we can likely wrap data in a union. We probably don't even need to play 4-highest-bytes games and can instead write 2 * id + 1, just like the encoding of literalt uses.

thomasspriggs

thomasspriggs commented on Oct 15, 2023

@thomasspriggs
ContributorAuthor

This does sound like a great idea, but perhaps we need to get there in two steps. The first one is to figure out how to move away from static dt empty_d and instead use nullptr wherever we currently have &empty_d. This will require tweaking the read() interface, I believe.

If we swap the type of empty_d to being a pointer, then it can still hold the empty value. Then there is no need to allow nullptr. The value is the empty/""/0 irep_idt and the set flag. The read() function is a larger can of worms as it can potentially return a non-valid reference. Similarly write() could use some careful attention.

Once that's in place we can likely wrap data in a union. We probably don't even need to play 4-highest-bytes games and can instead write 2 * id + 1, just like the encoding of literalt uses.

Storing the irep_idt in the 4 highest bytes keeps any pointers/references to it correctly aligned, which has performance benefits.

I have a work-in-progress branch here which passes the regression tests but not the unit tests - develop...thomasspriggs:cbmc:tas/irep_optimisation1_unions
It still needs -

  • Careful checking of read/write.
  • Fixing the unit tests and any actual problems they are revealing.
  • Portability.
  • Memory usage analysis.

I have done some initial run time testing using time ctest -L CORE -E unit -V -j30 for my branch. This yields -

100% tests passed, 0 tests failed out of 89

Label Time Summary:
CBMC    = 2020.14 sec*proc (89 tests)
CORE    = 2020.14 sec*proc (89 tests)

Total Test time (real) = 299.87 sec

real	4m59.875s
user	22m45.095s
sys	1m40.742s

With the same machine and test command on develop the results were -

100% tests passed, 0 tests failed out of 89

Label Time Summary:
CBMC    = 2077.61 sec*proc (89 tests)
CORE    = 2077.61 sec*proc (89 tests)

Total Test time (real) = 321.93 sec

real	5m21.932s
user	24m32.950s
sys	1m48.697s

This seems to show that the branch does at least offer a minor run-time benefit, without considering memory usage.

tautschnig

tautschnig commented on Oct 16, 2023

@tautschnig
Collaborator

Thank you for this great piece of work! A couple of comments:

  1. Would you mind PR-ing those const-cast fixes ASAP? Scary stuff...
  2. Cloning and testing your branch I wasn't able to reproduce any unit test failures - perhaps you've fixed them already?
  3. The following regression test data includes memory usage information, and confirms minor CPU (and memory) savings (with various irrelevant lines snipped):
    05ccb75009f67c (develop)
cbmc-github.git/regression/cbmc$ /usr/bin/time -v ../test.pl -C -c "../../../build/bin/cbmc" > /dev/null
  User time (seconds): 59.53
  System time (seconds): 9.47
  Elapsed (wall clock) time (h:mm:ss or m:ss): 1:09.42
  Maximum resident set size (kbytes): 557596
cbmc-github.git/regression/cbmc$ /usr/bin/time -v ../test.pl -C -c "../../../build/bin/cbmc" > /dev/null
  User time (seconds): 58.61
  System time (seconds): 9.90
  Elapsed (wall clock) time (h:mm:ss or m:ss): 1:08.41
  Maximum resident set size (kbytes): 557596

thomasspriggs/cbmc@78b7968452 (thomasspriggs/tas/irep_optimisation1_unions):

cbmc-github.git/regression/cbmc$ /usr/bin/time -v ../test.pl -C -c "../../../build/bin/cbmc" > /dev/null
  User time (seconds): 56.42
  System time (seconds): 9.21
  Elapsed (wall clock) time (h:mm:ss or m:ss): 1:05.53
  Maximum resident set size (kbytes): 542800
cbmc-github.git/regression/cbmc$ /usr/bin/time -v ../test.pl -C -c "../../../build/bin/cbmc" > /dev/null
  User time (seconds): 56.22
  System time (seconds): 9.21
  Elapsed (wall clock) time (h:mm:ss or m:ss): 1:05.34
  Maximum resident set size (kbytes): 542720
thomasspriggs

thomasspriggs commented on Oct 18, 2023

@thomasspriggs
ContributorAuthor

The const cast fixes are now merged.

I think the unit test failures were due to the setup of the particular machine I was using for debugging. I had not investigated them, but they are not reproducible on my usual machine.

kroening

kroening commented on Nov 3, 2023

@kroening
Member

A concern is the need for low-level pointer tweaks, and the need for branches when reading ireps.

Please consider the following alternative. Ireps have sharing, which you can use to make any frequently used irep cheaper, including the id-only ones.

  1. On construction of an irep with an ID only, look into a global vector of ireps, indexed by the ID number. If there, return the address of that irep. If not there, add it.
  2. No action needed whatsoever on access!
thomasspriggs

thomasspriggs commented on Nov 3, 2023

@thomasspriggs
ContributorAuthor

@kroening - I share your concern with regards for making sure that low-level tweaks work as intended. Which is one of the reasons I have made further work on this approach a relatively low priority. Although it seems like @tautschnig thinks it is worth pursuing this further.

With regards to branches - my understanding is that on modern architectures if there is a cache miss and the value needs to be fetched from main memory, then the wait time can be far longer than for a pipeline interruption from a branch mis-prediction. Therefore an implementation with more branches may be counter-intuitively faster due to better cache locality. This is why it is important to evaluate performance changes based on benchmarks.

For a leaf node, the only data we need to store is the ID. With dstringt the full string has already been stored in a table and deduplicated. So applying sharing to an ID-only irept adds an unnecessary additional layer of indirection and reference counting, in my opinion. The dstringt is as small or smaller than the reference count or the pointer/index to a reference counted structure, because it only holds an index itself. When the data in question is that small then copying it uses less memory than forming a reference to it. This is why I think that sharing or copy on write is not the optimal approach for ID-only irepts.

With the approach I outlined the data of an id-only irept is stored in place of the reference to it; effectively inside their parent irept. This reduction in the levels of indirection is why I hypothesise that this approach may have beneficial properties in terms of cache-locality. Reducing the amount of memory used to hold the data structure may have benefits in terms of memory bandwidth usage, beyond advantages in terms of lowered available memory requirements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @tautschnig@kroening@thomasspriggs

        Issue actions

          Idea for `irept` memory usage reduction. · Issue #7960 · diffblue/cbmc