Experimental cppyy#1769
Conversation
|
@mstimberg the initial PR I'll keep working on this branch itself and finally make the synapses too working on this , cheers :) |
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
|
@mstimberg I added a introspection to do the stuff we were discussing yesterday , like to view the cppy code and codeobjects and even change it over the fly , I have added an attached jupyter sample for reference which you can use and play around with :) Attaching a few screenshots of how it looks :
|
| self.namespace[name] = value | ||
|
|
||
| # ── Dynamic arrays: store BOTH the data view AND the capsule ── | ||
| # The data view (_ptr_array_*) gives C++ direct pointer access |
There was a problem hiding this comment.
So I realised something while coding this , the way we coded dynamic arrays in runtime to be available for runtime mode , we'll have to change that for cppyy as RuntimeDevice currently creates its dynamic arrays through Cython wrappers. Those Cython wrappers own the underlying DynamicArray1D<T>*. If we want to go fully cppyy-native or have multiple backends only , we'll need to change how the RuntimeDevice stores dynamic arrays in cppyy codegen backend — either replacing the Cython wrappers with cppyy-managed objects, as that is what would be best ...
|
here is a snippet I used to test things : import time
import numpy as np
from brian2 import *
prefs.codegen.target = 'cppyy'
prefs.codegen.runtime.cppyy.enable_introspection = True # Enable introspection
# prefs.codegen.target = 'cython'
# prefs.codegen.runtime.cython.cache_dir = 'cythontmp/'
# prefs.codegen.runtime.cython.delete_source_files = False
# Hodgkin-Huxley neuron model
num_neurons = 100
duration = 500*ms
# Parameters
area = 20000*umetre**2
Cm = 1*ufarad*cm**-2 * area
gl = 5e-5*siemens*cm**-2 * area
El = -65*mV
EK = -90*mV
ENa = 50*mV
g_na = 100*msiemens*cm**-2 * area
g_kd = 30*msiemens*cm**-2 * area
VT = -63*mV
eqs = Equations('''
dv/dt = (gl*(El-v) - g_na*(m*m*m)*h*(v-ENa) - g_kd*(n*n*n*n)*(v-EK) + I)/Cm : volt
dm/dt = 0.32*(mV**-1)*4*mV/exprel((13.*mV-v+VT)/(4*mV))/ms*(1-m)-0.28*(mV**-1)*5*mV/exprel((v-VT-40.*mV)/(5*mV))/ms*m : 1
dn/dt = 0.032*(mV**-1)*5*mV/exprel((15.*mV-v+VT)/(5*mV))/ms*(1.-n)-.5*exp((10.*mV-v+VT)/(40.*mV))/ms*n : 1
dh/dt = 0.128*exp((17.*mV-v+VT)/(18.*mV))/ms*(1.-h)-4./(1+exp((40.*mV-v+VT)/(5.*mV)))/ms*h : 1
I : amp
''')
group = NeuronGroup(num_neurons, eqs,
threshold='v > -40*mV',
refractory='v > -40*mV',
method='exponential_euler')
group.v = El
group.I = '0.7*nA * i / num_neurons'
# SpikeMonitor: records spike times and indices (dynamic arrays)
spike_mon = SpikeMonitor(group)
# StateMonitor: records v for a few neurons every timestep (2D dynamic array)
state_mon = StateMonitor(group, 'v', record=[0, 25, 50, 75, 99])
print(f"Running {num_neurons} HH neurons for {duration}...")
t_start = time.perf_counter()
run(duration)
t_elapsed = time.perf_counter() - t_start
print(f"Done in {t_elapsed:.2f}s")
print(f"\nTotal spikes: {spike_mon.num_spikes}")
print(f"StateMonitor recorded {state_mon.t.shape[0]} timesteps "
f"for {len(state_mon.record)} neurons")
# --- Now use the introspector ---
from brian2.codegen.runtime.cppyy_rt.introspector import get_introspector
intro = get_introspector()
# ---- 1. List all compiled code objects ----
print("=" * 60)
print("LIST OBJECTS")
print("=" * 60)
print(intro.list_objects())
# ---- 2. Inspect the state updater ----
print("\n" + "=" * 60)
print("INSPECT STATE UPDATER")
print("=" * 60)
# Using glob pattern — "stateupdater*" matches the full name
print(intro.inspect("*stateupdater*"))
# ---- 3. View just the params ----
print("\n" + "=" * 60)
print("PARAMS")
print("=" * 60)
print(intro.params("*stateupdater*"))
# ---- 4. View the namespace ----
print("\n" + "=" * 60)
print("NAMESPACE")
print("=" * 60)
print(intro.namespace("*stateupdater*"))
# ---- 5. View C++ globals ----
print("\n" + "=" * 60)
print("C++ GLOBALS")
print("=" * 60)
print(intro.cpp_globals())
# ---- 6. Evaluate a C++ expression ----
print("\n" + "=" * 60)
print("EVAL C++")
print("=" * 60)
print(f"M_PI = {intro.eval_cpp('M_PI')}")
print(f"sizeof(double) = {intro.eval_cpp('sizeof(double)', 'size_t')}")
print(f"_brian_mod(7, 3) = {intro.eval_cpp('_brian_mod(7, 3)', 'int32_t')}") |
- Rewrite ratemonitor.cpp to use capsule-based resize pattern (was using nonexistent .push_back() on DynamicArray) - Add _brian_cppyy_seed/_brian_cppyy_seed_random to support code and wire into RuntimeDevice.seed() for reproducible simulations - Add parameter count logging in run_block() for debugging - Add subgroup filtering to ratemonitor (matching Cython behavior)
Add CppyyDynamicArray1D/2D as drop-in replacements for Cython wrappers. dynamicarray.py now tries Cython first, falls back to cppyy if Cython extensions aren't compiled. Same API: .data, .resize(), .get_capsule(). PyCapsule names are identical so templates work with either backend.
- cppyy-backed SpikeQueue as drop-in Cython replacement - Synapse templates: synapses, push_spikes, create_array, create_generator - Capsule-based parameter passing for queue and dynamic arrays - Python-side synapse bookkeeping after cppyy code object runs - Fallback chain in spikequeue.py: Cython → cppyy
…xtraction, consolidated helpers - synapses_create_generator: 1024-element buffer for pre/post arrays (O(n/1024) resizes vs O(n)) - spikemonitor: extract capsules once before spike loop, cache data pointers - statemonitor: extract 2D capsules once before per-neuron loop - ratemonitor: use get_array_name() instead of hardcoded _dynamic_array_ prefix - synapses/synapses_push_spikes: move _extract_spike_queue to global support code in cppyy_rt.py - test-cppyy-audit.py: 16-test subprocess-isolated suite (all passing)
Rewrites docs/cppyy-backend.md with full architecture visualization: - End-to-end flow, three naming worlds, parameter sync invariant - Template architecture, zero-copy data bridge, synapse lifecycle - DynamicArray/SpikeQueue backends, monitor data flow - Guard code, global support code, compilation lifecycle - Updated limitations and next steps
…er protocol
- Port spikegenerator.cpp and spatialstateupdate.cpp from Cython templates
- Use bare N (not {{ N }}) for Constant variables in templates
- Fix cppyy int64_t buffer protocol on LP64 platforms: map int64_t→long
in _cppyy_c_data_type() since cppyy rejects int64_t* (long long*)
but accepts long* for numpy int64 arrays
- Add SpikeGeneratorGroup tests (basic + periodic) to test suite
- All 18 tests pass
- Remove cppyy_dynamicarray.py and cppyy_spikequeue.py: DynamicArray and SpikeQueue are compiled from Cython at install time, no runtime fallback needed. Revert dynamicarray.py and spikequeue.py to Cython-only with hard ImportError. - Fix 12 standalone test failures (NotImplementedError before run()): replaced self.variables["_source_offset"].get_value() with int(getattr(self.source, "start", 0)) in both _add_synapses_from_arrays and _add_synapses_generator. CPPStandaloneDevice rejects get_value() before run(); the offset values are Python-time constants. getattr(..., 0) also handles Synapses-as-source (no .start attribute). - Fix test_synapses_state_monitor (Python-side size desync): the new Cython synapse creation templates update C++ m_size directly but Python-side .size was only synced for cppyy code objects. Call _resize() unconditionally for all backends. Keep _update_synapse_numbers() cppyy-only — Cython templates already update N_outgoing/ N_incoming in C++; calling it again doubles the counts. - Fix SyntaxWarning in introspector.py: invalid escape sequence \d -> \\d in docstring.
_resize() and get_value() on _synaptic_pre cannot be called during connect() under CPPStandaloneDevice — the C++ code is only scheduled, not executed, so synapse counts are not yet known. Guard both blocks in _add_synapses_from_arrays and _add_synapses_generator with isinstance(get_device(), RuntimeDevice) so standalone tests pass while the Cython/cppyy runtime fixes from the previous commit are preserved.
…one failures len(self) calls get_value() which raises NotImplementedError on CPPStandaloneDevice before run(). Move old_num_synapses capture inside the RuntimeDevice guard so it is only evaluated on runtime (numpy/cython/cppyy) devices.
- Add group_get_indices.cpp template: loops over N neurons, evaluates the condition expression, and collects matching indices into a pre-allocated output buffer (_return_values_buf) with a count in _return_values_n. - CppyyCodeGenerator.determine_keywords(): detect group_get_indices by checking that both _cond and _indices are AuxiliaryVariables (unique to the IndexWrapper.__getitem__ path), then append the two output-buffer params to function_params so the C++ signature includes them. - CppyyCodeObject.variables_to_namespace(): inject _return_values_buf and _return_values_n numpy arrays when template_name == 'group_get_indices'. - CppyyCodeObject._build_param_mapping(): mirror the two extra entries so the Python call-site args match the C++ signature. - CppyyCodeObject.run_block(): after compiled_func(*args), if this is a group_get_indices codeobj return the sliced result array. - conftest.py: add cppyy implementation of fake_randn so tests using the fake_randn_randn_fixture work under the cppyy target. - tests/__init__.py: auto-detect cppyy alongside numpy/cython so calling brian2.test() without explicit targets also runs the cppyy suite. - run_test_suite.py: detect cppyy availability and add it to in_parallel so CI standalone:false jobs also exercise the cppyy target.
initialise_queue() calls get_value() on eventspace, _delays and synapse_sources, which raises NotImplementedError under CPPStandaloneDevice before run(). The before_run() override that calls it was added for cppyy (C++ before_code blocks can't invoke Python), but the guard was missing. Under standalone mode the queue is set up in the generated C++ code, so Python must not try to initialise it during before_run().
- Add cppyy>=3.1 as optional dependency (pip install .[cppyy]) - Install cppyy on all non-standalone runners (Linux, macOS, Windows) - Add ilammy/msvc-dev-cmd step on Windows so Cling can find cl.exe at JIT time - Add DYLD_LIBRARY_PATH for macOS runners to resolve cppyy's hardcoded MacPorts zstd path against Homebrew locations (arm64 + Intel) - Soft-fail the install step so CI is not broken if cppyy is unavailable
CPyCppyy has no pre-built wheel for Python 3.14+ on Windows. Building from source fails: the pre-built cppyy_backend-1.15.3 .lib is missing Cppyy::GetNumBasesLongestBranch which CPyCppyy 1.13.0 requires at link time. Re-enable once cppyy publishes compatible Windows wheels.
…names When Brian2 GC's a TimedArray (e.g. at test teardown), its Python name becomes available for reuse. A subsequent test can create a new TimedArray with the same name but different K/N parameters, generating a different C++ function body under the same symbol (e.g. `_timedarray`). The previous #ifndef guard was keyed on the body content-hash, so two bodies with the same symbol but different hashes would both try to define the same C++ symbol in Cling — causing a "redefinition" error. Fix strategy: - cppyy_generator: wrap each user-function support code piece in a guard keyed by the C++ *symbol name* (not body hash) so Cling only compiles the first occurrence of any given name. Fix _extract_primary_cpp_symbol to only inspect the first declaration line (not function body lines). - cppyy_rt: add _rename_conflicting_user_functions() that detects when a function name is reused with a different body (different content hash) and renames both the function and its _namespace_*_values global in the code string. This prevents both the Cling redefinition error and the cppyy "buffer too large for value" error from reassigning a double* global to an array of a different size.
…apses_create_generator
When result_index_condition=True and if_expression is set (e.g. S.connect("i==j")),
both create_cond and update sections independently declare `const int32_t _post_idx =
_raw_post_idx;` in the same C++ scope. Cling rejects the second declaration as a
redefinition.
Fix: wrap the create_cond code section in a braced scope `{}` with the condition
result captured to `_create_cond_result`. The update section then declares _post_idx
first in the outer scope, which is also available for the buffer-filling loop.
This fixes ~14 test_subgroup.py and test_synapses.py failures (test_synaptic_propagation,
test_synapse_creation_generator_*, test_spike_monitor, test_no_reference_*, etc.).
The cppyy group_variable_set.cpp and group_variable_set_conditional.cpp
templates were missing the {# ALLOWS_SCALAR_WRITE #} directive that Cython
equivalents have. Without it, the code generator raises "Writing to scalar
variable X not allowed in this context" when setting shared variables like
G.E_L = "expression", S.delay = 1*ms, etc.
Fixes test_scalar_variable, test_delay_specification, test_delays_pathways,
test_scalar_parameter_access, and related tests.
…ator to support Synapses-as-target
…ator; use mutable _uiter_size for fixed-size sample
… timedarray/binomial, fix introspector SyntaxWarning
… GSL skipping
Three bugs caused CI failures for the cppyy runtime target:
1. `static std::mt19937 _brian_cppyy_rng` had internal linkage, so each
new Cling translation unit (compiled per network.run() call) got a fresh
default-seeded copy — all runs produced identical random values.
Fix: remove `static` to give external linkage; one shared instance across
all TUs. Also move `_dist_rand` to file scope (no static).
2. `seed()` checked `hasattr(cppyy.gbl, "_brian_cppyy_seed")` before the
support code was compiled, so pre-run seed() calls were silent no-ops.
Fix: call `_ensure_support_code()` eagerly inside `seed()`.
3. `get/set_random_state()` ignored C++ RNG state entirely, so
`restore(restore_random_state=True)` could not reproduce identical runs.
Fix: expose `_brian_cppyy_get/set_rng_state()` C++ functions (using
std::ostringstream/istringstream) and integrate into get/set_random_state().
Additionally, `std::normal_distribution` has an internal cache that cannot
be serialized. Replace with a custom Marsaglia polar method using explicit
`_brian_randn_has_spare` / `_brian_randn_spare` file-scope variables that
round-trip cleanly through the state string.
GSL tests were also failing because `skip_if_not_implemented` only skipped
for the numpy target, not cppyy. Fix: check `effective in ("numpy", "cppyy")`.
…ay in run_block
Three per-call savings in CppyyCodeObject.run_block, the hot path that every
code object hits on every timestep. With ~14 code objects/timestep over
50k+ timesteps for a Kremer-class run, micro-overhead compounds heavily.
- Remove per-call logger.diagnostic(): each call formatted a debug string
and invoked BrianLogger._log even when the level was filtered out.
Single biggest win (~30-40% reduction on warm sim).
- Guard np.ascontiguousarray() behind not val.flags.c_contiguous: Brian2
arrays are virtually always C-contiguous, so the unconditional call was
a ~0.1 µs/array no-op. Also cache the 1-element empty-array dummy at
module level instead of np.zeros'ing it per call.
- Cache the normalized args tuple per block. _build_args() runs once per
cache miss; run_block then dispatches the cached tuple directly. The
cache is cleared by update_namespace() only when nonconstant_values
(dynamic-array references) are present — static-namespace blocks keep
the cache for the entire run. The val-is-None fallback still allocates
a fresh np.zeros (C++ may write to it).
Measured on EXTRA_CLING_ARGS=" -O2", arm64, Py 3.13, cppyy 3.5.0:
warm sim ratio cppyy/cython
before after
small_lif 1.93x 1.61x
coba 2.14x 1.25x
kremer3 1.79x 1.10x
cppyy's cold-compile advantage is preserved: 19-47x faster end-to-end on
a cold Cython cache. 193 tests pass across test_neurongroup, test_monitor,
test_synapses, test_subgroup, test_spikegenerator, test_poissongroup,
test_poissoninput, test_refractory, test_thresholder with target=cppyy.
Single file, ~30 net lines, no template or ABI change.
…cache Two changes that together make cppyy beat Cython on warm sim and on multi-run (parameter-sweep / store-restore) workflows, on top of the diagnostic / ascontiguousarray / args-tuple-cache work in dd21662. 1. Per-block fast-dispatch (CppyyCodeObject) At the end of compile_block, for code objects whose namespace is fully static (nonconstant_values is empty — i.e. all stateupdate / threshold / reset / push_spikes / synapses run blocks once connect() is done), eagerly call _build_args(block) and store (compiled_func, args_tuple) in self._fast_dispatch[block]. run_block then short-circuits with a single dict.get and one tuple unpack, skipping the cache-miss check and the per-call template_name string compares. The three template_name == "..." string compares for return-value templates (group_get_indices, group_variable_get, group_variable_get_conditional) are replaced by a single self._return_kind: str | None set once in __init__ and consulted once per call. update_namespace clears _fast_dispatch defensively (no-op for static blocks; matters only if a subclass later opts in to nonconstant_values). 2. Process-level Cling compile cache Module-level _compiled_block_cache: dict[sha256, (compiled_func, unique_func_name)] keyed on the canonical post-rename / post-guard / pre-counter-rename source. In compile_block, before allocating a new counter suffix and calling cppyy.cppdef, look up the cache. On hit, reuse the previously-compiled cppyy proxy; on miss, do the existing flow and store. cppyy proxies are bound to cppyy.gbl, not to a code object — sharing across CppyyCodeObject instances is safe. Per-codeobject globals (e.g. _namespace_timedarray_values) are still re-pointed by _set_user_func_globals on every compile_block call, hit or miss. _rename_conflicting_user_functions already disambiguates bodies that would collide, so the cache key only matches when reuse is correct. Zero impact on workloads with unique codeobj names per iteration (Brian2's default for unnamed objects); 2.6-4.6x faster setup on repeated Network.run() with stable names. Measured on EXTRA_CLING_ARGS=" -O2", arm64, Py 3.13, cppyy 3.5.0, Cython 3.1.3 (median of 8+ samples, subprocess-isolated): warm sim ratio cppyy/cython before dd21662 after dd21662 after THIS small_lif 1.93x 1.61x 0.87x coba 2.14x 1.25x 0.85x kremer3 1.79x 1.10x 0.87x 5-iteration parameter-sweep total (stable names, fresh cython cache): cython: 15.47 s cppyy: 0.69 s = 22x faster end-to-end 193 tests pass across test_neurongroup, test_monitor, test_synapses, test_subgroup, test_spikegenerator, test_poissongroup, test_poissoninput, test_refractory, test_thresholder with target=cppyy. Single file, ~120 net lines, no template / generator / ABI change.


Problem
The current Brian2 code generation pipeline suffers from a fundamental performance bottleneck. The issue is not tied to the specific tools we use, but rather to the Ahead-of-Time (AOT) compilation paradigm itself.
Regardless of whether we use Cython (our current approach) or manual C-extensions, the workflow remains slow and cumbersome:
In other words, the bottleneck lies in the file-based, external-compiler, AOT workflow.
Proposed Solution: JIT Compilation with
cppyyThis PR introduces
cppyyas a new runtime code generation target, shifting from AOT to Just-in-Time (JIT) compilation.With
cppyy, C++ code is compiled in-memory using the Cling C++ interpreter, which eliminates:Current Status
Next Steps
Fix for dynamic arrays and spikequeue and synapses