Skip to content

Add back config to toggle the preservation of timestamps in consolidated fragments #5515

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

ypatia
Copy link
Member

@ypatia ypatia commented May 21, 2025

#3267 erroneously removed the option to perform consolidation without timestamps [sc-18605]. This PR restores that config as the option to not retain timestamps when consolidating is still a valuable one in many cases :

  • heavily modified datasets (many updates) which want to reclaim space and do not need time-travel
  • code that wants to benefit from cell count metadata on the fragments (and needs no-dups semantics)
  • the potential future ability to improve read performance of no-dup arrays by optimizing the reader (e.g., using the sparse allows-dups reader)

Note: I have left the default configuration as "with timestamps", but I have also implemented and tested changing the default to "without timestamps" and everything works fine. I can toggle the default to whatever the requirement is very easily. I just don't know what the requirement is :)

Fixes [CORE-134]


TYPE: IMPROVEMENT
DESC: Add back config to toggle the preservation of timestamps in consolidated fragments

@ypatia ypatia requested a review from a team as a code owner May 21, 2025 14:16
Copy link

@bkmartinjr bkmartinjr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The choice of default (ie., maintain current behavior by default) seems reasonable to me.

Copy link
Member

@teo-tsirpanis teo-tsirpanis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

I believe we should make a plan to disable this feature by default and eventually remove it. As a concept, consolidation is fundamentally incompatible with time travelling and adding a timestamps pseudo-attribute was a failed attempt to reconcile them.

@ypatia
Copy link
Member Author

ypatia commented May 27, 2025

Thanks!

I believe we should make a plan to disable this feature by default and eventually remove it. As a concept, consolidation is fundamentally incompatible with time travelling and adding a timestamps pseudo-attribute was a failed attempt to reconcile them.

I don't know the background story on why this feature was introduced in the first place, I can imagine though that it was requested to meet some need. If the way it has been implemented is optimal or not is another story.
However, I agree in principle that consolidation means mostly I don't care about time-traveling, that's why I was wondering and asked to clarify what the default value of this config toggle should be. If 95% of the cases we don't want to have timestamps in consolidated fragments then it'd make sense to change the default value to false. cc @ihnorton

Copy link
Member

@rroelke rroelke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's been three years since #3267 was merged so we should definitely be suspicious as to whether the behavior which is being re-exposed here is still correct. I'd really like to see some test cases which validate query results, and also validate that consolidation happened in the expected way using the fragment metadata.

@@ -409,6 +409,11 @@ class Config {
*/
static const std::string SM_GROUP_TIMESTAMP_END;

/**
* Enable or disable consolidation with timestamps.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC @nickvigilante

I think this comment gets pulled into docs, right?

This is not specific enough, especially if it is user-facing documentation. There's no description here of what the consolidation result actually looks like for the different options, which I think is really important given that one of the options results in something which might qualify as "data loss" for an unwitting customer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'm leaving my comment for posterity, but I see that there is more specific documentation in config_api_external.h.

However, those docs aren't specific enough for my liking, the options should be annotated with a brief description of what happens to duplicate coordinates in consolidated fragments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -303,13 +303,17 @@ class ArrayDirectory {
* [`timestamp_start`, `timestamp_end`] will be considered when
* fetching URIs.
* @param mode The mode to load the array directory in.
* @param allow_partial_fragment_overlap If we want to allow matching
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the the only usage of this is "if false, consolidation_with_timestamps_supported returns false". What's the connection to overlapping fragments?

@rroelke
Copy link
Member

rroelke commented May 27, 2025

Thanks!
I believe we should make a plan to disable this feature by default and eventually remove it. As a concept, consolidation is fundamentally incompatible with time travelling and adding a timestamps pseudo-attribute was a failed attempt to reconcile them.

I don't know the background story on why this feature was introduced in the first place, I can imagine though that it was requested to meet some need. If the way it has been implemented is optimal or not is another story. However, I agree in principle that consolidation means mostly I don't care about time-traveling, that's why I was wondering and asked to clarify what the default value of this config toggle should be. If 95% of the cases we don't want to have timestamps in consolidated fragments then it'd make sense to change the default value to false. cc @ihnorton

Consolidation is not only a question of data retention, but also data management. There might be several reasons that DBAs would prefer to have one larger consolidated fragment versus a larger number of non-consolidated smaller fragments. I am not a DBA so I can't really enumerate them.

From the perspective of a query engine there is a difference:

If you have a query which wants half of your fragments, and you are not consolidated, then you have to merge coordinates from all of the fragments.

If you have a query which wants half of your fragments, and you are consolidated, then instead you have to de-duplicate a single stream on the max timestamp per coordinate.

I would expect merge to be a lot more resource-intensive than a single-stream de-duplicate. Imagine the effort required to parallelize them, for example.

If 95% of the cases we don't want to have timestamps in consolidated fragments then it'd make sense to change the default value to false

I'm a bit leery of this. The upshot of false is performance; the downside is what a customer doing the wrong thing would perceive as data loss.

@rroelke
Copy link
Member

rroelke commented May 27, 2025

If you consolidate with timestamps, and then decide you want to purge old data, does it work to re-run consolidation in the "purge" mode?

@ihnorton
Copy link
Member

The underlying goal here is to optimize an array for efficient reads with time-traveling across the fragment history, and that remains a requirement which we will continue to support. The implementation may evolve or change to better realize the usage requirements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants