Add pluggable backend architecture to cuvs-bench#1536
Conversation
| gbench_results = json.load(f) | ||
|
|
||
| # Merge with existing results (if any) | ||
| self._merge_build_results(output_json, temp_output_json, gbench_results) |
There was a problem hiding this comment.
One thing we need to verify is that in the case where google bench is stopped abruptly or otherwise fails during index build (or search), we don't lose any of the existing results that might have been generated on previous runs. We also want to start where we left off when possible so that a user doesn't have to exhaustively run anything again if it stopped halfway through.
For example, say a user runs a bunch of the benchmarks and then stops during the index build. It's possible the new json that was produced is invalid or corrupt because the user may have stopped the benchmarks prior to any results being written a file. We need to make sure that if the json isn't valid, we don't blindly overwrite any existing results with corrupt data.
For some context / background- I had this working last year because we were finding users were running into this issue quite frequently. I implemented a "two-phase commit" type of approach where we would create any new data as its own unique temporary file and then, upon a user stopping the program or the underlying c++ failing, the python layer would check to see if it can merge those results into existing results. If it could, it would do the merge. If it couldn't, it would just accept that the file is corrupt and not touch any existing "committed" results.
…arser requirements
…nd API. Add registry pattern for dynamic config loader registration, enabling future backends like Milvus without modifying core orchestration code
… the orchestrator to dynamically discover both components. The C++ Google Benchmark backend is now automatically registered when the module is imported, so users no longer need to manually register it. Future backends like Milvus will still require explicit registration.
dantegd
left a comment
There was a problem hiding this comment.
The BenchmarkBackend abstract class is well-designed, with clear build() and search() contracts, Optional lifecycle hooks and pre-flight checks for GPU/network requirements which makes things flexible, very nice!
Dataset, BuildResult, and SearchResult are well-thought-out and the plugin registry approach is a good fit for this use case.
The overall design approach is good and reasonble, though there could be opportunities for simplification if we really wanted to, but overall a very good PR.
| queries_per_second=0.0, | ||
| recall=0.0, | ||
| algorithm="", | ||
| search_params={}, |
There was a problem hiding this comment.
| search_params={}, | |
| search_params=[], |
In base.py we have: search_params: List[Dict[str, Any]] so changing this to match would be good.
There was a problem hiding this comment.
Good catch. I made that change in base.py when I realized that the legacy benchmark batches all indexes into one. I forgot to update the code in cpp_gbench.py for the early-return cases (no indexes or skipping). But during the actual run, it was already returning first_index.search_params directly, which is the correct List[Dict] type
| return Dataset( | ||
| name=dataset_config.name, | ||
| # Note: C++ backend loads vectors from files, so we pass empty arrays | ||
| base_vectors=np.array([]), |
There was a problem hiding this comment.
| base_vectors=np.array([]), | |
| base_vectors=np.empty((0, 0)), |
This will avoid a crash/error in Dataset with empty arrays, with the property in Dataset where we have:
@property
def dims(self) -> int:
"""Vector dimensionality."""
return self.base_vectors.shape[1]Or alternatively, make the property handle the empty array gracefully if you don't want to change the default value here.
There was a problem hiding this comment.
Right. Though the already implemented C++ backend doesn't call dims as it reads from files, it was meant to serve other backends that may need the actual vector arrays. I'll go with both approaches: use np.empty((0, 0)) for the empty arrays AND make the dims property handle empty arrays gracefully. This way we're defensive on both fronts.
|
|
||
| # Check defaults | ||
| assert backend.data_prefix == "" | ||
| assert backend.index_prefix == "" |
There was a problem hiding this comment.
| assert backend.index_prefix == "" |
CppGoogleBenchmarkBackend doesn't have an index_prefix attribute, does it? Sorry if I missed it
There was a problem hiding this comment.
Oh I haven't updated these tests since the major refactoring I made from to the backend. I initially had index_prefix which I ended up removing. I will refactor these tests today
| super().__init__(config) | ||
|
|
||
| self.executable_path = Path(config["executable_path"]) | ||
| self.data_prefix = config.get("data_prefix", "") |
There was a problem hiding this comment.
Here is where I don't see index_prefix but I see it all over in the tests, not sure if it's missing here or I missed where it's coming from
| temp_exec.unlink(missing_ok=True) | ||
|
|
||
| def test_backend_gpu_detection(self): | ||
| """Test GPU detection from backend name.""" |
There was a problem hiding this comment.
I'm a bit confused here, this test assumes GPU is detected from executable name, but supports_gpu actually reads from config:
# base.py line 524
return self.config.get("requires_gpu", False)The test creates executables with names like CUVS_IVF_FLAT but doesn't set requires_gpu in config, could use some documentation about the assumptions of GPU detection.
There was a problem hiding this comment.
I'm a bit confused here, this test assumes GPU is detected from executable name
Yes, this was the initial assumption because the C++ executables follow a naming pattern like CUVS_IVF_FLAT_ANN_BENCH where the CUVS prefix indicates CUDA/GPU-based algorithms. I initially thought parsing the executable name would be a quick way to infer GPU requirements. However, this approach was error-prone and redundant since algorithms.yaml already declares requires_gpu for each algorithm. To match the legacy run.py which validates algorithms using algorithms.yaml, I refactored to read requires_gpu from config instead.
| def supports_gpu(self) -> bool: | ||
| """ | ||
| Whether this backend uses GPU acceleration. | ||
|
|
||
| Reads from config["requires_gpu"], matching algorithms.yaml structure. | ||
|
|
||
| Returns | ||
| ------- | ||
| bool | ||
| True if backend uses GPU, False otherwise (default) | ||
| """ | ||
| return self.config.get("requires_gpu", False) |
There was a problem hiding this comment.
| def supports_gpu(self) -> bool: | |
| """ | |
| Whether this backend uses GPU acceleration. | |
| Reads from config["requires_gpu"], matching algorithms.yaml structure. | |
| Returns | |
| ------- | |
| bool | |
| True if backend uses GPU, False otherwise (default) | |
| """ | |
| return self.config.get("requires_gpu", False) | |
| def requires_gpu(self) -> bool: | |
| """ | |
| Whether this backend uses GPU acceleration. | |
| Reads from config["requires_gpu"], matching algorithms.yaml structure. | |
| Returns | |
| ------- | |
| bool | |
| True if backend uses GPU, False otherwise (default) | |
| """ | |
| return self.config.get("requires_gpu", False) |
The naming is confusing. If a backend "supports" GPU but doesn't "require" it, the pre-flight check would incorrectly skip it when GPU is unavailable. The semantic difference matters for writing correct plugins.
Suggestion: Use consistent naming: requires_gpu property that reads requires_gpu config.
There was a problem hiding this comment.
You're right. The naming creates a semantic mismatch. The current behavior is 'requires' (skip if unavailable), not 'supports' (run anyway). I'll rename the property to requires_gpu to match both the config key and the actual behavior. This also makes it clearer for future plugin authors that if they set requires_gpu=True, the backend will be skipped when GPU is unavailable. I also explicitly mentioned it in the docstring.
| # ========================================================================= | ||
| # Helper methods (copied from run.py as-is) | ||
| # ========================================================================= |
There was a problem hiding this comment.
Are the methods copied here to avoid circular imports or what's the reason? If not, we should import these from run, otherwise it would be better to extract the shared utilities to a common module.
There was a problem hiding this comment.
Those methods were copied to ensure backward compatibility with the legacy output format (same JSON structure, same file paths). My initial thought was to eventually remove run.py and runners.py since the new pluggable backend is self-contained and replicates their functionality. Importing from run.py would create a dependency on code I intended to deprecate. That said, if we decide to keep both workflows long-term, extracting the shared utilities to a common module would be cleaner than duplication. If you have any suggestion please let me know.
There was a problem hiding this comment.
That makes total sense, I would just suggest adding a comment to the old run.py/runners.py that they are deprecated (it's not in the PR yet, is it? If it is, sorry for missing it!)
|
|
||
| # Dry run: print command and return without executing | ||
| if dry_run: | ||
| print(f"Benchmark command for {self.output_filename[0]}:\n{' '.join(cmd)}\n") |
There was a problem hiding this comment.
There are multiple prints in the code that seem purposeful and useful for the user, I think this would be better served if we start using Python logging module
There was a problem hiding this comment.
Right. Working on this
| @@ -0,0 +1,47 @@ | |||
| # | |||
There was a problem hiding this comment.
This is a general comment, so adding it here. The module structure has potential circular dependency:
backends/__init__.pyimports fromregistryorchestrator/__init__.pyimports frombackends.registryorchestrator/registry.pyre-exports andcalls _register_builtin_loaders()
While Python handles this with caching, the architecture is fragile, I wonder if there might be a better way to architect that part.
There was a problem hiding this comment.
The auto-registration of the C++ backend at import time was convenient but creates this fragility. Let me think through this.
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| """ | ||
| Unit tests for the C++ Google Benchmark backend. |
There was a problem hiding this comment.
I think a good improvement would be to add integration tests with mock subprocess for CppGoogleBenchmarkBackend
…equires_gpu for semantic clarity. Also update docstring
…to be more descriptive
dantegd
left a comment
There was a problem hiding this comment.
Great improvements, just have some additional comments and requests
| "real_time": self.search_time_ms, | ||
| "time_unit": "s", |
There was a problem hiding this comment.
| "real_time": self.search_time_ms, | |
| "time_unit": "s", | |
| "real_time": self.search_time_ms, | |
| "time_unit": "ms", |
Unlike build, we are looking at self.search_time_ms instead of self.search_time_s here, so either the label or the variable need a small correction
There was a problem hiding this comment.
Good catch that was a typo. the The C++ Google Benchmark outputs build time in s and search time in ms
| queries_per_second=20.0, | ||
| recall=0.95, | ||
| algorithm="test_algo", | ||
| search_params={"nprobe": 10} |
There was a problem hiding this comment.
The type annotation states search_params as List[Dict[str, Any]], just a minor niptick here is that this is a straight dict.
But also will make the assertion below json_result["nprobe"] == 10 throw an error, no? Because to_json will return
result = {
...
"search_params": {"nprobe": 10}, # stored as a nested dict under key "search_params"
...
}
There was a problem hiding this comment.
Thanks for pointing this out. This is a bug in the test. Initially when writing the test, I had search_params written as a dictionary. However, to match the legacy benchmark API (run.py + runners.py) which enables more than one search parameter in a single run, I updated the SearchResult type annotation to List[Dict[str, Any]] but forgot to update the test data accordingly.
I didn't catch this earlier because dataclasses don't enforce type annotations at runtime, and since the test doesn't execute a code path that indexes into the list (x[0]) or iterates expecting dicts, the bug flew under the radar.
But also will make the assertion below json_result["nprobe"] == 10 throw an error, no?
Yes, this will fail. I forgot to update both:
The test data: still passes {"nprobe": 10} instead of [{"nprobe": 10}]
The assertion: still accesses json_result["nprobe"] instead of json_result["search_params"][0]["nprobe"]
| index_name = f"{algo}_{group}" if group != "base" else f"{algo}" | ||
| for i in range(len(params)): | ||
| index["build_param"][param_names[i]] = params[i] | ||
| index_name += "." + f"{param_names[i]}{params[i]}" |
There was a problem hiding this comment.
Minor nitpick:
The loop's primary purpose is building the index_name string, but it also re-assigns the same key-value pairs that were already written into build_param before, no? Since param_names and params are derived from the same source, the build_param writes are no-ops.
Consider separating the two concerns for clarity:
| index_name = f"{algo}_{group}" if group != "base" else f"{algo}" | |
| for i in range(len(params)): | |
| index["build_param"][param_names[i]] = params[i] | |
| index_name += "." + f"{param_names[i]}{params[i]}" | |
| index = { | |
| "algo": algo, | |
| "build_param": dict(zip(param_names, params)), | |
| } | |
| index_name = f"{algo}_{group}" if group != "base" else f"{algo}" | |
| for name, val in zip(param_names, params): | |
| index_name += f".{name}{val}" |
There was a problem hiding this comment.
Thanks for catching this. As you may have noticed, config_loaders is an exact replica of config methods in run.py (legacy benchmark API) to ensure full backward compatibility. The redundant assignment was inherited from the original code. I'll clean this up in a follow-up commit since it's a no-op with no functional impact
| force: bool = False, | ||
| search_threads: Optional[int] = None, | ||
| dry_run: bool = False, | ||
| append_results: bool = False |
There was a problem hiding this comment.
This parameter doesn't exist in the base class search() signature, unless I missed it. If a future Python backend is used in tune mode, it will get an unexpected keyword argument. Either add append_results to the base class or handle it via the config dict.
There was a problem hiding this comment.
I only added it to cpp_gbench because it's specific to that backend. My thought process was that other backends like Milvus might write to a DB, not JSON. However, as you accurately pointed out, since it's not defined in the base class, a user implementing another backend might not be aware it's required, which could lead to a TypeError. I'll go with the config dict option to avoid exposing a backend-specific parameter that would be ignored by others.
There was a problem hiding this comment.
I updated the config passed at cpp_gbench instantiation to include an additional key (append_results) which is specific to that backend and will be read internally via self.config.get("append_results", False). Another backend like Milvus would simply ignore this config key since it doesn't read it.
There was a problem hiding this comment.
Actually, after thoroughly thinking about it, I think it's better for now to always append results in tune mode and not expose an argument for it, keeping things simple. This enables analysis of convergence and Pareto frontiers across all trials. The append logic is handled internally by the orchestrator based on trial number. In the future, if there's a use case, we can add an option to retain only the last iteration or the best result (champion)
- The base class interface stays clean (no backend-specific params)
- CppGoogleBenchmarkBackend reads append_results from its config dict
- Other backends (e.g., Milvus) can ignore this config key or implement their own storage logic
| temp_conf_filename = f"{self.dataset_name}_{self.output_filename[0]}_{uuid.uuid1()}.json" | ||
| with open(temp_conf_filename, 'w') as f: |
There was a problem hiding this comment.
Both build() and search() write temp JSON configs to the current working directory. The import tempfile at the top is unused I think? Using tempfile.NamedTemporaryFile or at least tempfile.gettempdir() would be a nice improvement.
There was a problem hiding this comment.
Right, this is legacy code from runners.py which manually creates temp files in the current working directory. I updated it to use tempfile.NamedTemporaryFile which writes to the system temp directory instead. However, I set delete=False because the C++ subprocess needs to read the file after Python closes it, so manual cleanup is still required. This is likely why the legacy code avoided tempfile in the first place to prevent automatic deletion before the subprocess could read the file.
| append_results=append_results | ||
| ) | ||
|
|
||
| return result |
There was a problem hiding this comment.
If build=False and search=False, result remains None. run_benchmark sets build=True, search=True when both are False, but _run_trial doesn't have that guard. Also, if build=True and search=False and the build succeeds, a BuildResult is returned, but the caller (objective()) tries to access result.recall and result.search_time_ms which don't exist on BuildResult, right?
There was a problem hiding this comment.
In tune mode, if build=False and search=False were passed to run_benchmark(), the guard converts both to True before calling _run_tune(). So _run_trial always receives valid values and returns a populated result. This edge case is already handled.
For the build=True, search=False edge case in tune mode, I added validation at the start of _run_tune() that raises a ValueError if search=False. This is intentional because tune mode requires search metrics (recall, latency, throughput) for Optuna to optimize
bdice
left a comment
There was a problem hiding this comment.
Approving packaging/dependency changes.
dantegd
left a comment
There was a problem hiding this comment.
Thanks for all the changes, large but very good PR! I think there might be some minor things or improvement we might find out as more benchmarking is done by others using the tool, so let's merge it so we can have tests during 26.04 early.
|
/merge |
Summary
This PR introduces a pluggable backend architecture for
cuvs-bench, enabling extensibility to support multiple benchmark backend types beyond the current C++ Google Benchmark executables.Motivation
Currently,
cuvs-benchis tightly coupled to C++ Google Benchmark executables via subprocess calls. This limits the ability to:This PR lays the foundation for a plugin system that addresses these limitations.
Changes
Phase 1: Core Infrastructure
backends/base.py: Defines theBenchmarkBackendabstract interface and data structures (Dataset,BuildResult,SearchResult)backends/registry.py: ImplementsBackendRegistryfor centralized plugin management with support for:backends/__init__.py: Package initialization with auto-registration of built-in backendsPhase 2: C++ Backend Wrapper
backends/cpp_gbench.py:CppGoogleBenchmarkBackendwraps existing C++ executables to maintain 100% backward compatibility"cpp_gbench"Testing
tests/test_registry.py: Comprehensive unit tests for registry and base infrastructure (20 tests)tests/test_cpp_gbench.py: Unit tests for C++ backend wrapper (8 tests)Backward Compatibility
✅ Existing C++ benchmarks continue to work without modification
✅ No changes required to existing benchmark configurations
✅ Current CLI and orchestration remain functional
Future Work
This PR enables future phases:
cuvs_bench.runorchestration to use the plugin systemTesting Instructions
cd python/cuvs_bench python -m pytest cuvs_bench/tests/test_registry.py cuvs_bench/tests/test_cpp_gbench.py -vCloses #1100