-
Notifications
You must be signed in to change notification settings - Fork 115
Description
Is your feature request related to a problem? Please describe
I often want to run OSB against 2 clusters, one with a search-related change and one without it, to see how search latency is affected. However, there is often quite a bit of noise in the results, and it can be tricky to tell if this is really due to my change, or if it's due to how exactly the data was indexed - for example the exact segment topology can very significantly affect sort query performance.
It would be really useful if there was an optional, easy way OSB could ensure the index would be exactly the same every time.
Describe the solution you'd like
The simplest solution might be setting a random seed with some workload flag like --rng-seed. But, I don't know enough about the indexing flow to be sure if this would actually result in identical indices every time.
Describe alternatives you've considered
If the OSB random seed isn't enough to ensure identical indexes, another option would be hosting one "standard" snapshot for each dataset somewhere. Then, if the user specifies some flag, OSB would download this snapshot and install it to the cluster, instead of doing the typical indexing operation.
This could be limited to only a snapshot for the current OS version, and maybe for only the important workloads such as nyc_taxis/big5/http_logs, to avoid having to host too many of them.
Additional context
No response