Allow for dictionary encoded vector ingestion #85
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary:
Allows for already encoded dictionaries to passthrough the writer as opposed to being decoded + deduped again.
Specifically, we want to allow for a "passthrough" without decoding when the schema indicates that we are ingesting an ArrayVector, but the array we find is a dictionaryVector with ArrayVector values. When this is the case, we will check to see if the dictionary is a valid run-length encoding (meaning, it likely has been deduped by upstream). The way in which the vector is written to storage does not change, we will still write the offsets and the ArrayVector elements.
A caveat is that right now,
isDictionaryValidRunLengthEncoded
function doesn't actually verify that there is a duplicate. It's possible that a scenario exists in which a DictionaryVector was passed to the ArrayVector field writer, but the DictionaryVector has not been deduped, but its offsets still technically fulfill a run-length encoding (each offset is greater than or equal to the offset at index - 1).Reviewed By: helfman
Differential Revision: D59863671