Add pluggable backend architecture to cuvs-bench by jnke2016 · Pull Request #1536 · rapidsai/cuvs

jnke2016 · 2025-11-12T17:12:29Z

Summary

This PR introduces a pluggable backend architecture for cuvs-bench, enabling extensibility to support multiple benchmark backend types beyond the current C++ Google Benchmark executables.

Motivation

Currently, cuvs-bench is tightly coupled to C++ Google Benchmark executables via subprocess calls. This limits the ability to:

Benchmark Python-native libraries (e.g., hnswlib-python, faiss-python)
Test network-based vector databases (e.g., Milvus, Qdrant, Elasticsearch)
Allow users to add custom backends without modifying core cuvs-bench code

This PR lays the foundation for a plugin system that addresses these limitations.

Changes

Phase 1: Core Infrastructure

backends/base.py: Defines the BenchmarkBackend abstract interface and data structures (Dataset, BuildResult, SearchResult)
backends/registry.py: Implements BackendRegistry for centralized plugin management with support for:
- Dynamic loading from Python modules
- Configuration-based loading from YAML files
- Runtime plugin registration
backends/__init__.py: Package initialization with auto-registration of built-in backends

Phase 2: C++ Backend Wrapper

backends/cpp_gbench.py: CppGoogleBenchmarkBackend wraps existing C++ executables to maintain 100% backward compatibility
Handles subprocess execution, temporary file management, and Google Benchmark JSON parsing
Auto-registered on import as "cpp_gbench"

Testing

tests/test_registry.py: Comprehensive unit tests for registry and base infrastructure (20 tests)
tests/test_cpp_gbench.py: Unit tests for C++ backend wrapper (8 tests)
All 28 tests passing ✅

Backward Compatibility

✅ Existing C++ benchmarks continue to work without modification
✅ No changes required to existing benchmark configurations
✅ Current CLI and orchestration remain functional

Future Work

This PR enables future phases:

Phase 3: Python in-process backend (hnswlib-python, numpy-based algorithms)
Phase 4: Network VDB backend (Milvus, Qdrant, Elasticsearch)
Phase 5: Migration of existing cuvs_bench.run orchestration to use the plugin system

Testing Instructions

cd python/cuvs_bench
python -m pytest cuvs_bench/tests/test_registry.py cuvs_bench/tests/test_cpp_gbench.py -v

Closes #1100

cjnolet · 2025-11-17T18:09:24Z

+                gbench_results = json.load(f)
+
+            # Merge with existing results (if any)
+            self._merge_build_results(output_json, temp_output_json, gbench_results)


One thing we need to verify is that in the case where google bench is stopped abruptly or otherwise fails during index build (or search), we don't lose any of the existing results that might have been generated on previous runs. We also want to start where we left off when possible so that a user doesn't have to exhaustively run anything again if it stopped halfway through.

For example, say a user runs a bunch of the benchmarks and then stops during the index build. It's possible the new json that was produced is invalid or corrupt because the user may have stopped the benchmarks prior to any results being written a file. We need to make sure that if the json isn't valid, we don't blindly overwrite any existing results with corrupt data.

For some context / background- I had this working last year because we were finding users were running into this issue quite frequently. I implemented a "two-phase commit" type of approach where we would create any new data as its own unique temporary file and then, upon a user stopping the program or the underlying c++ failing, the python layer would check to see if it can merge those results into existing results. If it could, it would do the merge. If it couldn't, it would just accept that the file is corrupt and not touch any existing "committed" results.

…arser requirements

…nd API. Add registry pattern for dynamic config loader registration, enabling future backends like Milvus without modifying core orchestration code

… the orchestrator to dynamically discover both components. The C++ Google Benchmark backend is now automatically registered when the module is imported, so users no longer need to manually register it. Future backends like Milvus will still require explicit registration.

dantegd

The BenchmarkBackend abstract class is well-designed, with clear build() and search() contracts, Optional lifecycle hooks and pre-flight checks for GPU/network requirements which makes things flexible, very nice!

Dataset, BuildResult, and SearchResult are well-thought-out and the plugin registry approach is a good fit for this use case.

The overall design approach is good and reasonble, though there could be opportunities for simplification if we really wanted to, but overall a very good PR.

dantegd · 2026-01-15T15:14:59Z

+                queries_per_second=0.0,
+                recall=0.0,
+                algorithm="",
+                search_params={},


Suggested change

search_params={},

search_params=[],

In base.py we have: search_params: List[Dict[str, Any]] so changing this to match would be good.

Good catch. I made that change in base.py when I realized that the legacy benchmark batches all indexes into one. I forgot to update the code in cpp_gbench.py for the early-return cases (no indexes or skipping). But during the actual run, it was already returning first_index.search_params directly, which is the correct List[Dict] type

dantegd · 2026-01-15T15:25:18Z

+        return Dataset(
+            name=dataset_config.name,
+            # Note: C++ backend loads vectors from files, so we pass empty arrays
+            base_vectors=np.array([]),


Suggested change

base_vectors=np.array([]),

base_vectors=np.empty((0, 0)),

This will avoid a crash/error in Dataset with empty arrays, with the property in Dataset where we have:

@property def dims(self) -> int: """Vector dimensionality.""" return self.base_vectors.shape[1]

Or alternatively, make the property handle the empty array gracefully if you don't want to change the default value here.

Right. Though the already implemented C++ backend doesn't call dims as it reads from files, it was meant to serve other backends that may need the actual vector arrays. I'll go with both approaches: use np.empty((0, 0)) for the empty arrays AND make the dims property handle empty arrays gracefully. This way we're defensive on both fronts.

dantegd · 2026-01-15T15:26:47Z

+
+            # Check defaults
+            assert backend.data_prefix == ""
+            assert backend.index_prefix == ""


Suggested change

assert backend.index_prefix == ""

CppGoogleBenchmarkBackend doesn't have an index_prefix attribute, does it? Sorry if I missed it

Oh I haven't updated these tests since the major refactoring I made from to the backend. I initially had index_prefix which I ended up removing. I will refactor these tests today

dantegd · 2026-01-15T15:29:36Z

+        super().__init__(config)
+
+        self.executable_path = Path(config["executable_path"])
+        self.data_prefix = config.get("data_prefix", "")


Here is where I don't see index_prefix but I see it all over in the tests, not sure if it's missing here or I missed where it's coming from

Yep you are right

dantegd · 2026-01-15T15:39:45Z

+            temp_exec.unlink(missing_ok=True)
+
+    def test_backend_gpu_detection(self):
+        """Test GPU detection from backend name."""


I'm a bit confused here, this test assumes GPU is detected from executable name, but supports_gpu actually reads from config:

# base.py line 524 return self.config.get("requires_gpu", False)

The test creates executables with names like CUVS_IVF_FLAT but doesn't set requires_gpu in config, could use some documentation about the assumptions of GPU detection.

I'm a bit confused here, this test assumes GPU is detected from executable name

Yes, this was the initial assumption because the C++ executables follow a naming pattern like CUVS_IVF_FLAT_ANN_BENCH where the CUVS prefix indicates CUDA/GPU-based algorithms. I initially thought parsing the executable name would be a quick way to infer GPU requirements. However, this approach was error-prone and redundant since algorithms.yaml already declares requires_gpu for each algorithm. To match the legacy run.py which validates algorithms using algorithms.yaml, I refactored to read requires_gpu from config instead.

dantegd · 2026-01-15T15:42:09Z

+    def supports_gpu(self) -> bool:
+        """
+        Whether this backend uses GPU acceleration.
+
+        Reads from config["requires_gpu"], matching algorithms.yaml structure.
+
+        Returns
+        -------
+        bool
+            True if backend uses GPU, False otherwise (default)
+        """
+        return self.config.get("requires_gpu", False)


Suggested change

def supports_gpu(self) -> bool:

"""

Whether this backend uses GPU acceleration.

Reads from config["requires_gpu"], matching algorithms.yaml structure.

Returns

-------

bool

True if backend uses GPU, False otherwise (default)

"""

return self.config.get("requires_gpu", False)

def requires_gpu(self) -> bool:

"""

Whether this backend uses GPU acceleration.

Reads from config["requires_gpu"], matching algorithms.yaml structure.

Returns

-------

bool

True if backend uses GPU, False otherwise (default)

"""

return self.config.get("requires_gpu", False)

The naming is confusing. If a backend "supports" GPU but doesn't "require" it, the pre-flight check would incorrectly skip it when GPU is unavailable. The semantic difference matters for writing correct plugins.
Suggestion: Use consistent naming: requires_gpu property that reads requires_gpu config.

You're right. The naming creates a semantic mismatch. The current behavior is 'requires' (skip if unavailable), not 'supports' (run anyway). I'll rename the property to requires_gpu to match both the config key and the actual behavior. This also makes it clearer for future plugin authors that if they set requires_gpu=True, the backend will be skipped when GPU is unavailable. I also explicitly mentioned it in the docstring.

dantegd · 2026-01-15T15:44:07Z

+    # =========================================================================
+    # Helper methods (copied from run.py as-is)
+    # =========================================================================


Are the methods copied here to avoid circular imports or what's the reason? If not, we should import these from run, otherwise it would be better to extract the shared utilities to a common module.

Those methods were copied to ensure backward compatibility with the legacy output format (same JSON structure, same file paths). My initial thought was to eventually remove run.py and runners.py since the new pluggable backend is self-contained and replicates their functionality. Importing from run.py would create a dependency on code I intended to deprecate. That said, if we decide to keep both workflows long-term, extracting the shared utilities to a common module would be cleaner than duplication. If you have any suggestion please let me know.

That makes total sense, I would just suggest adding a comment to the old run.py/runners.py that they are deprecated (it's not in the PR yet, is it? If it is, sorry for missing it!)

dantegd · 2026-01-15T15:45:57Z

+
+        # Dry run: print command and return without executing
+        if dry_run:
+            print(f"Benchmark command for {self.output_filename[0]}:\n{' '.join(cmd)}\n")


There are multiple prints in the code that seem purposeful and useful for the user, I think this would be better served if we start using Python logging module

Right. Working on this

dantegd · 2026-01-15T15:47:46Z

@@ -0,0 +1,47 @@
+#


This is a general comment, so adding it here. The module structure has potential circular dependency:

backends/__init__.py imports from registry

orchestrator/__init__.py imports from backends.registry

orchestrator/registry.py re-exports and calls _register_builtin_loaders()
While Python handles this with caching, the architecture is fragile, I wonder if there might be a better way to architect that part.

The auto-registration of the C++ backend at import time was convenient but creates this fragility. Let me think through this.

dantegd · 2026-01-15T15:49:26Z

+# SPDX-License-Identifier: Apache-2.0
+
+"""
+Unit tests for the C++ Google Benchmark backend.


I think a good improvement would be to add integration tests with mock subprocess for CppGoogleBenchmarkBackend

…equires_gpu for semantic clarity. Also update docstring

…to be more descriptive

dantegd

Great improvements, just have some additional comments and requests

dantegd · 2026-02-11T16:11:47Z

+            "real_time": self.search_time_ms,
+            "time_unit": "s",


Suggested change

"real_time": self.search_time_ms,

"time_unit": "s",

"real_time": self.search_time_ms,

"time_unit": "ms",

Unlike build, we are looking at self.search_time_ms instead of self.search_time_s here, so either the label or the variable need a small correction

Good catch that was a typo. the The C++ Google Benchmark outputs build time in s and search time in ms

dantegd · 2026-02-11T16:28:36Z

+            queries_per_second=20.0,
+            recall=0.95,
+            algorithm="test_algo",
+            search_params={"nprobe": 10}


The type annotation states search_params as List[Dict[str, Any]], just a minor niptick here is that this is a straight dict.

But also will make the assertion below json_result["nprobe"] == 10 throw an error, no? Because to_json will return

result = { ... "search_params": {"nprobe": 10}, # stored as a nested dict under key "search_params" ... }

Thanks for pointing this out. This is a bug in the test. Initially when writing the test, I had search_params written as a dictionary. However, to match the legacy benchmark API (run.py + runners.py) which enables more than one search parameter in a single run, I updated the SearchResult type annotation to List[Dict[str, Any]] but forgot to update the test data accordingly.
I didn't catch this earlier because dataclasses don't enforce type annotations at runtime, and since the test doesn't execute a code path that indexes into the list (x[0]) or iterates expecting dicts, the bug flew under the radar.

But also will make the assertion below json_result["nprobe"] == 10 throw an error, no?

Yes, this will fail. I forgot to update both:
The test data: still passes {"nprobe": 10} instead of [{"nprobe": 10}]
The assertion: still accesses json_result["nprobe"] instead of json_result["search_params"][0]["nprobe"]

dantegd · 2026-02-11T16:36:44Z

+            index_name = f"{algo}_{group}" if group != "base" else f"{algo}"
+            for i in range(len(params)):
+                index["build_param"][param_names[i]] = params[i]
+                index_name += "." + f"{param_names[i]}{params[i]}"


Minor nitpick:

The loop's primary purpose is building the index_name string, but it also re-assigns the same key-value pairs that were already written into build_param before, no? Since param_names and params are derived from the same source, the build_param writes are no-ops.

Consider separating the two concerns for clarity:

Suggested change

index_name = f"{algo}_{group}" if group != "base" else f"{algo}"

for i in range(len(params)):

index["build_param"][param_names[i]] = params[i]

index_name += "." + f"{param_names[i]}{params[i]}"

index = {

"algo": algo,

"build_param": dict(zip(param_names, params)),

}

index_name = f"{algo}_{group}" if group != "base" else f"{algo}"

for name, val in zip(param_names, params):

index_name += f".{name}{val}"

Thanks for catching this. As you may have noticed, config_loaders is an exact replica of config methods in run.py (legacy benchmark API) to ensure full backward compatibility. The redundant assignment was inherited from the original code. I'll clean this up in a follow-up commit since it's a no-op with no functional impact

dantegd · 2026-02-11T16:42:09Z

+        force: bool = False,
+        search_threads: Optional[int] = None,
+        dry_run: bool = False,
+        append_results: bool = False


This parameter doesn't exist in the base class search() signature, unless I missed it. If a future Python backend is used in tune mode, it will get an unexpected keyword argument. Either add append_results to the base class or handle it via the config dict.

I only added it to cpp_gbench because it's specific to that backend. My thought process was that other backends like Milvus might write to a DB, not JSON. However, as you accurately pointed out, since it's not defined in the base class, a user implementing another backend might not be aware it's required, which could lead to a TypeError. I'll go with the config dict option to avoid exposing a backend-specific parameter that would be ignored by others.

I updated the config passed at cpp_gbench instantiation to include an additional key (append_results) which is specific to that backend and will be read internally via self.config.get("append_results", False). Another backend like Milvus would simply ignore this config key since it doesn't read it.

Actually, after thoroughly thinking about it, I think it's better for now to always append results in tune mode and not expose an argument for it, keeping things simple. This enables analysis of convergence and Pareto frontiers across all trials. The append logic is handled internally by the orchestrator based on trial number. In the future, if there's a use case, we can add an option to retain only the last iteration or the best result (champion)

The base class interface stays clean (no backend-specific params)

CppGoogleBenchmarkBackend reads append_results from its config dict

Other backends (e.g., Milvus) can ignore this config key or implement their own storage logic

dantegd · 2026-02-11T16:43:52Z

+        temp_conf_filename = f"{self.dataset_name}_{self.output_filename[0]}_{uuid.uuid1()}.json"
+        with open(temp_conf_filename, 'w') as f:


Both build() and search() write temp JSON configs to the current working directory. The import tempfile at the top is unused I think? Using tempfile.NamedTemporaryFile or at least tempfile.gettempdir() would be a nice improvement.

Right, this is legacy code from runners.py which manually creates temp files in the current working directory. I updated it to use tempfile.NamedTemporaryFile which writes to the system temp directory instead. However, I set delete=False because the C++ subprocess needs to read the file after Python closes it, so manual cleanup is still required. This is likely why the legacy code avoided tempfile in the first place to prevent automatic deletion before the subprocess could read the file.

dantegd · 2026-02-11T16:48:52Z

+                append_results=append_results
+            )
+
+        return result


If build=False and search=False, result remains None. run_benchmark sets build=True, search=True when both are False, but _run_trial doesn't have that guard. Also, if build=True and search=False and the build succeeds, a BuildResult is returned, but the caller (objective()) tries to access result.recall and result.search_time_ms which don't exist on BuildResult, right?

In tune mode, if build=False and search=False were passed to run_benchmark(), the guard converts both to True before calling _run_tune(). So _run_trial always receives valid values and returns a populated result. This edge case is already handled.

For the build=True, search=False edge case in tune mode, I added validation at the start of _run_tune() that raises a ValueError if search=False. This is intentional because tune mode requires search metrics (recall, latency, throughput) for Optuna to optimize

…te docstrings

bdice

Approving packaging/dependency changes.

dantegd

Thanks for all the changes, large but very good PR! I think there might be some minor things or improvement we might find out as more benchmarking is done by others using the tool, so let's merge it so we can have tests during 26.04 early.

cjnolet · 2026-02-17T15:34:16Z

/merge

add pluggable backend architecture to cuvs-bench

d37c290

jnke2016 requested a review from a team as a code owner November 12, 2025 17:12

github-project-automation Bot added this to Unstructured Data Processing Nov 12, 2025

github-project-automation Bot moved this to Todo in Unstructured Data Processing Nov 12, 2025

cjnolet reviewed Nov 17, 2025

View reviewed changes

cjnolet added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Nov 17, 2025

cjnolet moved this from Todo to In Progress in Unstructured Data Processing Jan 5, 2026

cjnolet assigned jnke2016 Jan 5, 2026

jnke2016 added 12 commits January 6, 2026 23:52

refactor backends by adding paths to Dataset

c7ad0a6

refactor cpp_gbench by aligning JSON config with runners.py and C++ p…

11b7501

…arser requirements

Add plugin interface with GPU/network pre-flight checks

b49d7c3

Introduce BenchmarkOrchestrator to replace run.py using the new backe…

0c28f30

…nd API. Add registry pattern for dynamic config loader registration, enabling future backends like Milvus without modifying core orchestration code

batch multiple indexes per command (matches runners.py)

6df87c6

batch indexes per command, read algo from config

8856485

remove auto-registration at import time

be459aa

group indexes by executable, match runners.py batching

4664e4b

update docstrings

b2c2732

update docstrings

6c39af9

update docstrings based on modifications in the API

ab74913

dantegd reviewed Jan 15, 2026

View reviewed changes

jnke2016 added 8 commits January 20, 2026 02:20

return List[Dict] for search_params to match base.py type

70060e3

align search_params type and handle empty Dataset arrays

22e5f8d

update tests based on changes to backend and rename supports_gpu to r…

8870c52

…equires_gpu for semantic clarity. Also update docstring

update docstring

7f3e339

remove unnecessary indirection to the config

2b4405d

simplify path resolution and consolidate registry

ccbc0ea

update docstring

32d8d9d

rename variable

426e646

jnke2016 added 5 commits February 5, 2026 06:58

update docstrings

83835a4

Add tune mode support in config_loaders for single-config Optuna trials

45256f5

add fixme regarding YAML file for tune mode

28dbb77

accumulate all trial results in single JSON file and rename variable …

1911aa3

…to be more descriptive

Merge remote-tracking branch 'upstream/main' into pluggable-backend

1b8b3a4

dantegd requested changes Feb 11, 2026

View reviewed changes

cjnolet and others added 15 commits February 13, 2026 11:15

Merge branch 'main' into pluggable-backend

8692b4e

Remove unused subprocess output capture

26290c7

undo changes to copyright

5738316

fix style

792bbed

resolve search_params type and time_unit bugs in tests and base.py

b114c10

remove redundant build_param assignment in config_loaders loop

5d2d9fb

pass append_results via backend config instead of method arg and upda…

361d3b6

…te docstrings

leverage temp files instead of writing to the current dir

fb70018

throw an exception search=false on tune mode

56f360a

fix tests by adding missing algo property to DummyBackend

f0c5743

Merge remote-tracking branch 'upstream/main' into pluggable-backend

a82cded

deprecate legacy benchmark API

24cc511

fix style

93fee10

update copyright

4983c7e

update scikit-learn version

f780e5f

jnke2016 requested a review from a team as a code owner February 14, 2026 08:53

jnke2016 requested a review from bdice February 14, 2026 08:53

jnke2016 added 2 commits February 14, 2026 01:57

upgrade scikit-learn

323e4a2

remove tmp file

342a935

bdice approved these changes Feb 15, 2026

View reviewed changes

dantegd approved these changes Feb 17, 2026

View reviewed changes

rapids-bot Bot merged commit 8fab5b3 into rapidsai:main Feb 17, 2026
78 checks passed

github-project-automation Bot moved this from In Progress to Done in Unstructured Data Processing Feb 17, 2026

		temp_conf_filename = f"{self.dataset_name}_{self.output_filename[0]}_{uuid.uuid1()}.json"
		with open(temp_conf_filename, 'w') as f:

Conversation

jnke2016 commented Nov 12, 2025 • edited by cjnolet Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Changes

Phase 1: Core Infrastructure

Phase 2: C++ Backend Wrapper

Testing

Backward Compatibility

Future Work

Testing Instructions

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dantegd left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnke2016 Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dantegd left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

jnke2016 commented Nov 12, 2025 •

edited by cjnolet

Loading

jnke2016 Jan 20, 2026 •

edited

Loading

jnke2016 Feb 14, 2026 •

edited

Loading