You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As explained in this issue, it is worth exploring the use of other types of space filling curves when using Liquid Clustering.
Hilbert curves are a nice compromise between the ordering of all clustering columns. In cases where some columns have more importance (in terms of how often they are used to filter on) than others, this might lower reading performance w.r.t. other writing techniques like Hive-style partitioning.
An obvious choice of a space filling curve is one where the curve gives utmost importance to 1 dimension, before incrementing the next one. It looks like this in 2 dimensions:
As explained in the issue, the functionality could be largely equal to what exists today. The only thing that should change is how the DataFrame is repartitioned. That means that in MultiDimClustering.cluster, a new case should be added where we refer to a new object (next to ZOrderClustering and HilbertClustering ). This is where we would implement that new curve.
The default case would be to just use the hilbert curve, but when the user wants they could use another type of curve. That does mean we would change the SQL API. Some ideas could be to write something like:
ALTER TABLE <table_name>
CLUSTER BY (<clustering_columns>) WITH <curve-type>
or
ALTER TABLE <table_name>
CLUSTER BY (<clustering_columns>) USING <curve-type>
Interested to see what your ideas are on the topic!
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
As explained in this issue, it is worth exploring the use of other types of space filling curves when using Liquid Clustering.
Hilbert curves are a nice compromise between the ordering of all clustering columns. In cases where some columns have more importance (in terms of how often they are used to filter on) than others, this might lower reading performance w.r.t. other writing techniques like Hive-style partitioning.
An obvious choice of a space filling curve is one where the curve gives utmost importance to 1 dimension, before incrementing the next one. It looks like this in 2 dimensions:
As explained in the issue, the functionality could be largely equal to what exists today. The only thing that should change is how the DataFrame is repartitioned. That means that in MultiDimClustering.cluster, a new case should be added where we refer to a new object (next to ZOrderClustering and HilbertClustering ). This is where we would implement that new curve.
The default case would be to just use the hilbert curve, but when the user wants they could use another type of curve. That does mean we would change the SQL API. Some ideas could be to write something like:
or
Interested to see what your ideas are on the topic!
Beta Was this translation helpful? Give feedback.
All reactions