Description
Describe the enhancement requested
The dictionary builders already have methods to insert whole arrays, but unfortunately they cause a lot of potentially unnecessary CPU time.
Take the following scenario: I have two sources of data, one of them is already dictionary encoded, the other is not, so I would like to initialize the dictionary builder with the existing dictionary, and only insert new items for the non-dictionary encodede items. Now comes the important part: I'm ok with inserts potentially creating duplicates in the dictionary.
I would like to propose a new API PrependInitialDict
, that takes an array and must be called before inserting into the indices array, otherwise it errors, and then any new dictionary item inserted start at len(initialDict)+i
.
Theoretically it could even be designed to insert dicts multiple times, but I would suggest to start the API like this and only extend when we have the use cases.
Alternative I have considered: Prepending the dictionary after building the "new" dictionary and have any indices start at the length. I've found this to not really be workable, for two reasons:
- There would still have to be an API to set the initial index.
- It would rely on the user actually prepending the dictionary afterward (easy to misuse).
- It would be quite awkward to use in scenarios where there are deeply nested lists and structs, where building the final record is primarily done using a record builder, but only this array would be the exception.
cc @zeroshade
Component(s)
Go