Skip to content

Add pluggable backend architecture to cuvs-bench#1536

Merged
rapids-bot[bot] merged 49 commits into
rapidsai:mainfrom
jnke2016:pluggable-backend
Feb 17, 2026
Merged

Add pluggable backend architecture to cuvs-bench#1536
rapids-bot[bot] merged 49 commits into
rapidsai:mainfrom
jnke2016:pluggable-backend

Conversation

@jnke2016

@jnke2016 jnke2016 commented Nov 12, 2025

Copy link
Copy Markdown
Contributor

Summary

This PR introduces a pluggable backend architecture for cuvs-bench, enabling extensibility to support multiple benchmark backend types beyond the current C++ Google Benchmark executables.

Motivation

Currently, cuvs-bench is tightly coupled to C++ Google Benchmark executables via subprocess calls. This limits the ability to:

  • Benchmark Python-native libraries (e.g., hnswlib-python, faiss-python)
  • Test network-based vector databases (e.g., Milvus, Qdrant, Elasticsearch)
  • Allow users to add custom backends without modifying core cuvs-bench code

This PR lays the foundation for a plugin system that addresses these limitations.

Changes

Phase 1: Core Infrastructure

  • backends/base.py: Defines the BenchmarkBackend abstract interface and data structures (Dataset, BuildResult, SearchResult)
  • backends/registry.py: Implements BackendRegistry for centralized plugin management with support for:
    • Dynamic loading from Python modules
    • Configuration-based loading from YAML files
    • Runtime plugin registration
  • backends/__init__.py: Package initialization with auto-registration of built-in backends

Phase 2: C++ Backend Wrapper

  • backends/cpp_gbench.py: CppGoogleBenchmarkBackend wraps existing C++ executables to maintain 100% backward compatibility
  • Handles subprocess execution, temporary file management, and Google Benchmark JSON parsing
  • Auto-registered on import as "cpp_gbench"

Testing

  • tests/test_registry.py: Comprehensive unit tests for registry and base infrastructure (20 tests)
  • tests/test_cpp_gbench.py: Unit tests for C++ backend wrapper (8 tests)
  • All 28 tests passing

Backward Compatibility

✅ Existing C++ benchmarks continue to work without modification
✅ No changes required to existing benchmark configurations
✅ Current CLI and orchestration remain functional

Future Work

This PR enables future phases:

  • Phase 3: Python in-process backend (hnswlib-python, numpy-based algorithms)
  • Phase 4: Network VDB backend (Milvus, Qdrant, Elasticsearch)
  • Phase 5: Migration of existing cuvs_bench.run orchestration to use the plugin system

Testing Instructions

cd python/cuvs_bench
python -m pytest cuvs_bench/tests/test_registry.py cuvs_bench/tests/test_cpp_gbench.py -v

Closes #1100

gbench_results = json.load(f)

# Merge with existing results (if any)
self._merge_build_results(output_json, temp_output_json, gbench_results)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing we need to verify is that in the case where google bench is stopped abruptly or otherwise fails during index build (or search), we don't lose any of the existing results that might have been generated on previous runs. We also want to start where we left off when possible so that a user doesn't have to exhaustively run anything again if it stopped halfway through.

For example, say a user runs a bunch of the benchmarks and then stops during the index build. It's possible the new json that was produced is invalid or corrupt because the user may have stopped the benchmarks prior to any results being written a file. We need to make sure that if the json isn't valid, we don't blindly overwrite any existing results with corrupt data.

For some context / background- I had this working last year because we were finding users were running into this issue quite frequently. I implemented a "two-phase commit" type of approach where we would create any new data as its own unique temporary file and then, upon a user stopping the program or the underlying c++ failing, the python layer would check to see if it can merge those results into existing results. If it could, it would do the merge. If it couldn't, it would just accept that the file is corrupt and not touch any existing "committed" results.

@cjnolet cjnolet added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Nov 17, 2025
@cjnolet cjnolet moved this from Todo to In Progress in Unstructured Data Processing Jan 5, 2026
…nd API. Add registry pattern for dynamic config loader registration, enabling future backends like Milvus without modifying core orchestration code
… the orchestrator to dynamically discover both components. The C++ Google Benchmark backend is now automatically registered when the module is imported, so users no longer need to manually register it. Future backends like Milvus will still require explicit registration.

@dantegd dantegd left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The BenchmarkBackend abstract class is well-designed, with clear build() and search() contracts, Optional lifecycle hooks and pre-flight checks for GPU/network requirements which makes things flexible, very nice!

Dataset, BuildResult, and SearchResult are well-thought-out and the plugin registry approach is a good fit for this use case.

The overall design approach is good and reasonble, though there could be opportunities for simplification if we really wanted to, but overall a very good PR.

queries_per_second=0.0,
recall=0.0,
algorithm="",
search_params={},

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
search_params={},
search_params=[],

In base.py we have: search_params: List[Dict[str, Any]] so changing this to match would be good.

@jnke2016 jnke2016 Jan 20, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I made that change in base.py when I realized that the legacy benchmark batches all indexes into one. I forgot to update the code in cpp_gbench.py for the early-return cases (no indexes or skipping). But during the actual run, it was already returning first_index.search_params directly, which is the correct List[Dict] type

return Dataset(
name=dataset_config.name,
# Note: C++ backend loads vectors from files, so we pass empty arrays
base_vectors=np.array([]),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
base_vectors=np.array([]),
base_vectors=np.empty((0, 0)),

This will avoid a crash/error in Dataset with empty arrays, with the property in Dataset where we have:

    @property
    def dims(self) -> int:
        """Vector dimensionality."""
        return self.base_vectors.shape[1]

Or alternatively, make the property handle the empty array gracefully if you don't want to change the default value here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Though the already implemented C++ backend doesn't call dims as it reads from files, it was meant to serve other backends that may need the actual vector arrays. I'll go with both approaches: use np.empty((0, 0)) for the empty arrays AND make the dims property handle empty arrays gracefully. This way we're defensive on both fronts.


# Check defaults
assert backend.data_prefix == ""
assert backend.index_prefix == ""

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert backend.index_prefix == ""

CppGoogleBenchmarkBackend doesn't have an index_prefix attribute, does it? Sorry if I missed it

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I haven't updated these tests since the major refactoring I made from to the backend. I initially had index_prefix which I ended up removing. I will refactor these tests today

super().__init__(config)

self.executable_path = Path(config["executable_path"])
self.data_prefix = config.get("data_prefix", "")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is where I don't see index_prefix but I see it all over in the tests, not sure if it's missing here or I missed where it's coming from

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep you are right

temp_exec.unlink(missing_ok=True)

def test_backend_gpu_detection(self):
"""Test GPU detection from backend name."""

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused here, this test assumes GPU is detected from executable name, but supports_gpu actually reads from config:

# base.py line 524
return self.config.get("requires_gpu", False)

The test creates executables with names like CUVS_IVF_FLAT but doesn't set requires_gpu in config, could use some documentation about the assumptions of GPU detection.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused here, this test assumes GPU is detected from executable name

Yes, this was the initial assumption because the C++ executables follow a naming pattern like CUVS_IVF_FLAT_ANN_BENCH where the CUVS prefix indicates CUDA/GPU-based algorithms. I initially thought parsing the executable name would be a quick way to infer GPU requirements. However, this approach was error-prone and redundant since algorithms.yaml already declares requires_gpu for each algorithm. To match the legacy run.py which validates algorithms using algorithms.yaml, I refactored to read requires_gpu from config instead.

Comment on lines +454 to +465
def supports_gpu(self) -> bool:
"""
Whether this backend uses GPU acceleration.

Reads from config["requires_gpu"], matching algorithms.yaml structure.

Returns
-------
bool
True if backend uses GPU, False otherwise (default)
"""
return self.config.get("requires_gpu", False)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def supports_gpu(self) -> bool:
"""
Whether this backend uses GPU acceleration.
Reads from config["requires_gpu"], matching algorithms.yaml structure.
Returns
-------
bool
True if backend uses GPU, False otherwise (default)
"""
return self.config.get("requires_gpu", False)
def requires_gpu(self) -> bool:
"""
Whether this backend uses GPU acceleration.
Reads from config["requires_gpu"], matching algorithms.yaml structure.
Returns
-------
bool
True if backend uses GPU, False otherwise (default)
"""
return self.config.get("requires_gpu", False)

The naming is confusing. If a backend "supports" GPU but doesn't "require" it, the pre-flight check would incorrectly skip it when GPU is unavailable. The semantic difference matters for writing correct plugins.
Suggestion: Use consistent naming: requires_gpu property that reads requires_gpu config.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. The naming creates a semantic mismatch. The current behavior is 'requires' (skip if unavailable), not 'supports' (run anyway). I'll rename the property to requires_gpu to match both the config key and the actual behavior. This also makes it clearer for future plugin authors that if they set requires_gpu=True, the backend will be skipped when GPU is unavailable. I also explicitly mentioned it in the docstring.

Comment on lines +397 to +399
# =========================================================================
# Helper methods (copied from run.py as-is)
# =========================================================================

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the methods copied here to avoid circular imports or what's the reason? If not, we should import these from run, otherwise it would be better to extract the shared utilities to a common module.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those methods were copied to ensure backward compatibility with the legacy output format (same JSON structure, same file paths). My initial thought was to eventually remove run.py and runners.py since the new pluggable backend is self-contained and replicates their functionality. Importing from run.py would create a dependency on code I intended to deprecate. That said, if we decide to keep both workflows long-term, extracting the shared utilities to a common module would be cleaner than duplication. If you have any suggestion please let me know.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes total sense, I would just suggest adding a comment to the old run.py/runners.py that they are deprecated (it's not in the PR yet, is it? If it is, sorry for missing it!)


# Dry run: print command and return without executing
if dry_run:
print(f"Benchmark command for {self.output_filename[0]}:\n{' '.join(cmd)}\n")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are multiple prints in the code that seem purposeful and useful for the user, I think this would be better served if we start using Python logging module

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Working on this

@@ -0,0 +1,47 @@
#

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a general comment, so adding it here. The module structure has potential circular dependency:

  • backends/__init__.py imports from registry
  • orchestrator/__init__.py imports from backends.registry
  • orchestrator/registry.py re-exports and calls _register_builtin_loaders()
    While Python handles this with caching, the architecture is fragile, I wonder if there might be a better way to architect that part.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The auto-registration of the C++ backend at import time was convenient but creates this fragility. Let me think through this.

# SPDX-License-Identifier: Apache-2.0

"""
Unit tests for the C++ Google Benchmark backend.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a good improvement would be to add integration tests with mock subprocess for CppGoogleBenchmarkBackend

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right

@dantegd dantegd left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great improvements, just have some additional comments and requests

Comment on lines +192 to +193
"real_time": self.search_time_ms,
"time_unit": "s",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"real_time": self.search_time_ms,
"time_unit": "s",
"real_time": self.search_time_ms,
"time_unit": "ms",

Unlike build, we are looking at self.search_time_ms instead of self.search_time_s here, so either the label or the variable need a small correction

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch that was a typo. the The C++ Google Benchmark outputs build time in s and search time in ms

queries_per_second=20.0,
recall=0.95,
algorithm="test_algo",
search_params={"nprobe": 10}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type annotation states search_params as List[Dict[str, Any]], just a minor niptick here is that this is a straight dict.

But also will make the assertion below json_result["nprobe"] == 10 throw an error, no? Because to_json will return

result = {
...
    "search_params": {"nprobe": 10},  # stored as a nested dict under key "search_params"
...
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out. This is a bug in the test. Initially when writing the test, I had search_params written as a dictionary. However, to match the legacy benchmark API (run.py + runners.py) which enables more than one search parameter in a single run, I updated the SearchResult type annotation to List[Dict[str, Any]] but forgot to update the test data accordingly.
I didn't catch this earlier because dataclasses don't enforce type annotations at runtime, and since the test doesn't execute a code path that indexes into the list (x[0]) or iterates expecting dicts, the bug flew under the radar.

But also will make the assertion below json_result["nprobe"] == 10 throw an error, no?

Yes, this will fail. I forgot to update both:
The test data: still passes {"nprobe": 10} instead of [{"nprobe": 10}]
The assertion: still accesses json_result["nprobe"] instead of json_result["search_params"][0]["nprobe"]

Comment on lines +752 to +755
index_name = f"{algo}_{group}" if group != "base" else f"{algo}"
for i in range(len(params)):
index["build_param"][param_names[i]] = params[i]
index_name += "." + f"{param_names[i]}{params[i]}"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nitpick:

The loop's primary purpose is building the index_name string, but it also re-assigns the same key-value pairs that were already written into build_param before, no? Since param_names and params are derived from the same source, the build_param writes are no-ops.

Consider separating the two concerns for clarity:

Suggested change
index_name = f"{algo}_{group}" if group != "base" else f"{algo}"
for i in range(len(params)):
index["build_param"][param_names[i]] = params[i]
index_name += "." + f"{param_names[i]}{params[i]}"
index = {
"algo": algo,
"build_param": dict(zip(param_names, params)),
}
index_name = f"{algo}_{group}" if group != "base" else f"{algo}"
for name, val in zip(param_names, params):
index_name += f".{name}{val}"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this. As you may have noticed, config_loaders is an exact replica of config methods in run.py (legacy benchmark API) to ensure full backward compatibility. The redundant assignment was inherited from the original code. I'll clean this up in a follow-up commit since it's a no-op with no functional impact

force: bool = False,
search_threads: Optional[int] = None,
dry_run: bool = False,
append_results: bool = False

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This parameter doesn't exist in the base class search() signature, unless I missed it. If a future Python backend is used in tune mode, it will get an unexpected keyword argument. Either add append_results to the base class or handle it via the config dict.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only added it to cpp_gbench because it's specific to that backend. My thought process was that other backends like Milvus might write to a DB, not JSON. However, as you accurately pointed out, since it's not defined in the base class, a user implementing another backend might not be aware it's required, which could lead to a TypeError. I'll go with the config dict option to avoid exposing a backend-specific parameter that would be ignored by others.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the config passed at cpp_gbench instantiation to include an additional key (append_results) which is specific to that backend and will be read internally via self.config.get("append_results", False). Another backend like Milvus would simply ignore this config key since it doesn't read it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, after thoroughly thinking about it, I think it's better for now to always append results in tune mode and not expose an argument for it, keeping things simple. This enables analysis of convergence and Pareto frontiers across all trials. The append logic is handled internally by the orchestrator based on trial number. In the future, if there's a use case, we can add an option to retain only the last iteration or the best result (champion)

  • The base class interface stays clean (no backend-specific params)
  • CppGoogleBenchmarkBackend reads append_results from its config dict
  • Other backends (e.g., Milvus) can ignore this config key or implement their own storage logic

Comment on lines +147 to +148
temp_conf_filename = f"{self.dataset_name}_{self.output_filename[0]}_{uuid.uuid1()}.json"
with open(temp_conf_filename, 'w') as f:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both build() and search() write temp JSON configs to the current working directory. The import tempfile at the top is unused I think? Using tempfile.NamedTemporaryFile or at least tempfile.gettempdir() would be a nice improvement.

@jnke2016 jnke2016 Feb 14, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this is legacy code from runners.py which manually creates temp files in the current working directory. I updated it to use tempfile.NamedTemporaryFile which writes to the system temp directory instead. However, I set delete=False because the C++ subprocess needs to read the file after Python closes it, so manual cleanup is still required. This is likely why the legacy code avoided tempfile in the first place to prevent automatic deletion before the subprocess could read the file.

append_results=append_results
)

return result

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If build=False and search=False, result remains None. run_benchmark sets build=True, search=True when both are False, but _run_trial doesn't have that guard. Also, if build=True and search=False and the build succeeds, a BuildResult is returned, but the caller (objective()) tries to access result.recall and result.search_time_ms which don't exist on BuildResult, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In tune mode, if build=False and search=False were passed to run_benchmark(), the guard converts both to True before calling _run_tune(). So _run_trial always receives valid values and returns a populated result. This edge case is already handled.

For the build=True, search=False edge case in tune mode, I added validation at the start of _run_tune() that raises a ValueError if search=False. This is intentional because tune mode requires search metrics (recall, latency, throughput) for Optuna to optimize

@jnke2016 jnke2016 requested a review from a team as a code owner February 14, 2026 08:53
@jnke2016 jnke2016 requested a review from bdice February 14, 2026 08:53

@bdice bdice left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving packaging/dependency changes.

@dantegd dantegd left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the changes, large but very good PR! I think there might be some minor things or improvement we might find out as more benchmarking is done by others using the tool, so let's merge it so we can have tests during 26.04 early.

@cjnolet

cjnolet commented Feb 17, 2026

Copy link
Copy Markdown
Member

/merge

@rapids-bot rapids-bot Bot merged commit 8fab5b3 into rapidsai:main Feb 17, 2026
78 checks passed
@github-project-automation github-project-automation Bot moved this from In Progress to Done in Unstructured Data Processing Feb 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[FEA] Create new cuvs-vdb-bench package

4 participants