-
Notifications
You must be signed in to change notification settings - Fork 195
Add back config to toggle the preservation of timestamps in consolidated fragments #5515
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The choice of default (ie., maintain current behavior by default) seems reasonable to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
I believe we should make a plan to disable this feature by default and eventually remove it. As a concept, consolidation is fundamentally incompatible with time travelling and adding a timestamps pseudo-attribute was a failed attempt to reconcile them.
Co-authored-by: Theodore Tsirpanis <[email protected]>
I don't know the background story on why this feature was introduced in the first place, I can imagine though that it was requested to meet some need. If the way it has been implemented is optimal or not is another story. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's been three years since #3267 was merged so we should definitely be suspicious as to whether the behavior which is being re-exposed here is still correct. I'd really like to see some test cases which validate query results, and also validate that consolidation happened in the expected way using the fragment metadata.
@@ -409,6 +409,11 @@ class Config { | |||
*/ | |||
static const std::string SM_GROUP_TIMESTAMP_END; | |||
|
|||
/** | |||
* Enable or disable consolidation with timestamps. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment gets pulled into docs, right?
This is not specific enough, especially if it is user-facing documentation. There's no description here of what the consolidation result actually looks like for the different options, which I think is really important given that one of the options results in something which might qualify as "data loss" for an unwitting customer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I'm leaving my comment for posterity, but I see that there is more specific documentation in config_api_external.h
.
However, those docs aren't specific enough for my liking, the options should be annotated with a brief description of what happens to duplicate coordinates in consolidated fragments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -303,13 +303,17 @@ class ArrayDirectory { | |||
* [`timestamp_start`, `timestamp_end`] will be considered when | |||
* fetching URIs. | |||
* @param mode The mode to load the array directory in. | |||
* @param allow_partial_fragment_overlap If we want to allow matching |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the the only usage of this is "if false, consolidation_with_timestamps_supported
returns false". What's the connection to overlapping fragments?
Consolidation is not only a question of data retention, but also data management. There might be several reasons that DBAs would prefer to have one larger consolidated fragment versus a larger number of non-consolidated smaller fragments. I am not a DBA so I can't really enumerate them. From the perspective of a query engine there is a difference: If you have a query which wants half of your fragments, and you are not consolidated, then you have to merge coordinates from all of the fragments. If you have a query which wants half of your fragments, and you are consolidated, then instead you have to de-duplicate a single stream on the max timestamp per coordinate. I would expect merge to be a lot more resource-intensive than a single-stream de-duplicate. Imagine the effort required to parallelize them, for example.
I'm a bit leery of this. The upshot of |
If you consolidate with timestamps, and then decide you want to purge old data, does it work to re-run consolidation in the "purge" mode? |
The underlying goal here is to optimize an array for efficient reads with time-traveling across the fragment history, and that remains a requirement which we will continue to support. The implementation may evolve or change to better realize the usage requirements. |
#3267 erroneously removed the option to perform consolidation without timestamps [sc-18605]. This PR restores that config as the option to not retain timestamps when consolidating is still a valuable one in many cases :
Note: I have left the default configuration as "with timestamps", but I have also implemented and tested changing the default to "without timestamps" and everything works fine. I can toggle the default to whatever the requirement is very easily. I just don't know what the requirement is :)
Fixes [CORE-134]
TYPE: IMPROVEMENT
DESC: Add back config to toggle the preservation of timestamps in consolidated fragments