Experimental cppyy by Legend101Zz · Pull Request #1769 · brian-team/brian2

Legend101Zz · 2026-02-13T19:30:33Z

Problem

The current Brian2 code generation pipeline suffers from a fundamental performance bottleneck. The issue is not tied to the specific tools we use, but rather to the Ahead-of-Time (AOT) compilation paradigm itself.

Regardless of whether we use Cython (our current approach) or manual C-extensions, the workflow remains slow and cumbersome:

Generate large C++ source files on disk
Invoke an external compiler (e.g., g++, clang) with significant overhead
Wait for compilation to complete (often 15–40 seconds, which disrupts interactivity)
Dynamically load the compiled result through a complex process

In other words, the bottleneck lies in the file-based, external-compiler, AOT workflow.

Proposed Solution: JIT Compilation with `cppyy`

This PR introduces cppyy as a new runtime code generation target, shifting from AOT to Just-in-Time (JIT) compilation.

With cppyy, C++ code is compiled in-memory using the Cling C++ interpreter, which eliminates:

File I/O overhead
External compiler process spawning
Long compilation waiting times
Complex dynamic loading procedures

Current Status

End-to-end JIT compilation pipeline
Basic neuron group simulations
State updates, thresholds, and resets
Template system for different operations
Integration with the device layer

Next Steps

Fix for dynamic arrays and spikequeue and synapses

Legend101Zz · 2026-02-13T19:32:11Z

@mstimberg the initial PR I'll keep working on this branch itself and finally make the synapses too working on this , cheers :)

review-notebook-app · 2026-02-14T10:32:32Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

Legend101Zz · 2026-02-14T10:34:42Z

@mstimberg I added a introspection to do the stuff we were discussing yesterday , like to view the cppy code and codeobjects and even change it over the fly , I have added an attached jupyter sample for reference which you can use and play around with :)

Attaching a few screenshots of how it looks :

Legend101Zz · 2026-02-14T10:40:59Z

+                self.namespace[name] = value
+
+            # ── Dynamic arrays: store BOTH the data view AND the capsule ──
+            # The data view (_ptr_array_*) gives C++ direct pointer access


So I realised something while coding this , the way we coded dynamic arrays in runtime to be available for runtime mode , we'll have to change that for cppyy as RuntimeDevice currently creates its dynamic arrays through Cython wrappers. Those Cython wrappers own the underlying DynamicArray1D<T>*. If we want to go fully cppyy-native or have multiple backends only , we'll need to change how the RuntimeDevice stores dynamic arrays in cppyy codegen backend — either replacing the Cython wrappers with cppyy-managed objects, as that is what would be best ...

Legend101Zz · 2026-02-14T10:42:49Z

here is a snippet I used to test things :

import time

import numpy as np

from brian2 import *

prefs.codegen.target = 'cppyy'
prefs.codegen.runtime.cppyy.enable_introspection = True # Enable introspection

# prefs.codegen.target = 'cython'
# prefs.codegen.runtime.cython.cache_dir = 'cythontmp/'
# prefs.codegen.runtime.cython.delete_source_files = False



# Hodgkin-Huxley neuron model

num_neurons = 100
duration = 500*ms

# Parameters
area = 20000*umetre**2
Cm = 1*ufarad*cm**-2 * area
gl = 5e-5*siemens*cm**-2 * area
El = -65*mV
EK = -90*mV
ENa = 50*mV
g_na = 100*msiemens*cm**-2 * area
g_kd = 30*msiemens*cm**-2 * area
VT = -63*mV

eqs = Equations('''
dv/dt = (gl*(El-v) - g_na*(m*m*m)*h*(v-ENa) - g_kd*(n*n*n*n)*(v-EK) + I)/Cm : volt
dm/dt = 0.32*(mV**-1)*4*mV/exprel((13.*mV-v+VT)/(4*mV))/ms*(1-m)-0.28*(mV**-1)*5*mV/exprel((v-VT-40.*mV)/(5*mV))/ms*m : 1
dn/dt = 0.032*(mV**-1)*5*mV/exprel((15.*mV-v+VT)/(5*mV))/ms*(1.-n)-.5*exp((10.*mV-v+VT)/(40.*mV))/ms*n : 1
dh/dt = 0.128*exp((17.*mV-v+VT)/(18.*mV))/ms*(1.-h)-4./(1+exp((40.*mV-v+VT)/(5.*mV)))/ms*h : 1
I : amp
''')

group = NeuronGroup(num_neurons, eqs,
                    threshold='v > -40*mV',
                    refractory='v > -40*mV',
                    method='exponential_euler')
group.v = El
group.I = '0.7*nA * i / num_neurons'


# SpikeMonitor: records spike times and indices (dynamic arrays)
spike_mon = SpikeMonitor(group)

# StateMonitor: records v for a few neurons every timestep (2D dynamic array)
state_mon = StateMonitor(group, 'v', record=[0, 25, 50, 75, 99])


print(f"Running {num_neurons} HH neurons for {duration}...")
t_start = time.perf_counter()
run(duration)
t_elapsed = time.perf_counter() - t_start
print(f"Done in {t_elapsed:.2f}s")

print(f"\nTotal spikes: {spike_mon.num_spikes}")
print(f"StateMonitor recorded {state_mon.t.shape[0]} timesteps "
      f"for {len(state_mon.record)} neurons")

# --- Now use the introspector ---
from brian2.codegen.runtime.cppyy_rt.introspector import get_introspector

intro = get_introspector()

# ---- 1. List all compiled code objects ----
print("=" * 60)
print("LIST OBJECTS")
print("=" * 60)
print(intro.list_objects())

# ---- 2. Inspect the state updater ----
print("\n" + "=" * 60)
print("INSPECT STATE UPDATER")
print("=" * 60)
# Using glob pattern — "stateupdater*" matches the full name
print(intro.inspect("*stateupdater*"))

# ---- 3. View just the params ----
print("\n" + "=" * 60)
print("PARAMS")
print("=" * 60)
print(intro.params("*stateupdater*"))

# ---- 4. View the namespace ----
print("\n" + "=" * 60)
print("NAMESPACE")
print("=" * 60)
print(intro.namespace("*stateupdater*"))

# ---- 5. View C++ globals ----
print("\n" + "=" * 60)
print("C++ GLOBALS")
print("=" * 60)
print(intro.cpp_globals())

# ---- 6. Evaluate a C++ expression ----
print("\n" + "=" * 60)
print("EVAL C++")
print("=" * 60)
print(f"M_PI = {intro.eval_cpp('M_PI')}")
print(f"sizeof(double) = {intro.eval_cpp('sizeof(double)', 'size_t')}")
print(f"_brian_mod(7, 3) = {intro.eval_cpp('_brian_mod(7, 3)', 'int32_t')}")

- Rewrite ratemonitor.cpp to use capsule-based resize pattern (was using nonexistent .push_back() on DynamicArray) - Add _brian_cppyy_seed/_brian_cppyy_seed_random to support code and wire into RuntimeDevice.seed() for reproducible simulations - Add parameter count logging in run_block() for debugging - Add subgroup filtering to ratemonitor (matching Cython behavior)

Add CppyyDynamicArray1D/2D as drop-in replacements for Cython wrappers. dynamicarray.py now tries Cython first, falls back to cppyy if Cython extensions aren't compiled. Same API: .data, .resize(), .get_capsule(). PyCapsule names are identical so templates work with either backend.

- cppyy-backed SpikeQueue as drop-in Cython replacement - Synapse templates: synapses, push_spikes, create_array, create_generator - Capsule-based parameter passing for queue and dynamic arrays - Python-side synapse bookkeeping after cppyy code object runs - Fallback chain in spikequeue.py: Cython → cppyy

…xtraction, consolidated helpers - synapses_create_generator: 1024-element buffer for pre/post arrays (O(n/1024) resizes vs O(n)) - spikemonitor: extract capsules once before spike loop, cache data pointers - statemonitor: extract 2D capsules once before per-neuron loop - ratemonitor: use get_array_name() instead of hardcoded _dynamic_array_ prefix - synapses/synapses_push_spikes: move _extract_spike_queue to global support code in cppyy_rt.py - test-cppyy-audit.py: 16-test subprocess-isolated suite (all passing)

Rewrites docs/cppyy-backend.md with full architecture visualization: - End-to-end flow, three naming worlds, parameter sync invariant - Template architecture, zero-copy data bridge, synapse lifecycle - DynamicArray/SpikeQueue backends, monitor data flow - Guard code, global support code, compilation lifecycle - Updated limitations and next steps

…er protocol - Port spikegenerator.cpp and spatialstateupdate.cpp from Cython templates - Use bare N (not {{ N }}) for Constant variables in templates - Fix cppyy int64_t buffer protocol on LP64 platforms: map int64_t→long in _cppyy_c_data_type() since cppyy rejects int64_t* (long long*) but accepts long* for numpy int64 arrays - Add SpikeGeneratorGroup tests (basic + periodic) to test suite - All 18 tests pass

- Remove cppyy_dynamicarray.py and cppyy_spikequeue.py: DynamicArray and SpikeQueue are compiled from Cython at install time, no runtime fallback needed. Revert dynamicarray.py and spikequeue.py to Cython-only with hard ImportError. - Fix 12 standalone test failures (NotImplementedError before run()): replaced self.variables["_source_offset"].get_value() with int(getattr(self.source, "start", 0)) in both _add_synapses_from_arrays and _add_synapses_generator. CPPStandaloneDevice rejects get_value() before run(); the offset values are Python-time constants. getattr(..., 0) also handles Synapses-as-source (no .start attribute). - Fix test_synapses_state_monitor (Python-side size desync): the new Cython synapse creation templates update C++ m_size directly but Python-side .size was only synced for cppyy code objects. Call _resize() unconditionally for all backends. Keep _update_synapse_numbers() cppyy-only — Cython templates already update N_outgoing/ N_incoming in C++; calling it again doubles the counts. - Fix SyntaxWarning in introspector.py: invalid escape sequence \d -> \\d in docstring.

_resize() and get_value() on _synaptic_pre cannot be called during connect() under CPPStandaloneDevice — the C++ code is only scheduled, not executed, so synapse counts are not yet known. Guard both blocks in _add_synapses_from_arrays and _add_synapses_generator with isinstance(get_device(), RuntimeDevice) so standalone tests pass while the Cython/cppyy runtime fixes from the previous commit are preserved.

…one failures len(self) calls get_value() which raises NotImplementedError on CPPStandaloneDevice before run(). Move old_num_synapses capture inside the RuntimeDevice guard so it is only evaluated on runtime (numpy/cython/cppyy) devices.

- Add group_get_indices.cpp template: loops over N neurons, evaluates the condition expression, and collects matching indices into a pre-allocated output buffer (_return_values_buf) with a count in _return_values_n. - CppyyCodeGenerator.determine_keywords(): detect group_get_indices by checking that both _cond and _indices are AuxiliaryVariables (unique to the IndexWrapper.__getitem__ path), then append the two output-buffer params to function_params so the C++ signature includes them. - CppyyCodeObject.variables_to_namespace(): inject _return_values_buf and _return_values_n numpy arrays when template_name == 'group_get_indices'. - CppyyCodeObject._build_param_mapping(): mirror the two extra entries so the Python call-site args match the C++ signature. - CppyyCodeObject.run_block(): after compiled_func(*args), if this is a group_get_indices codeobj return the sliced result array. - conftest.py: add cppyy implementation of fake_randn so tests using the fake_randn_randn_fixture work under the cppyy target. - tests/__init__.py: auto-detect cppyy alongside numpy/cython so calling brian2.test() without explicit targets also runs the cppyy suite. - run_test_suite.py: detect cppyy availability and add it to in_parallel so CI standalone:false jobs also exercise the cppyy target.

initialise_queue() calls get_value() on eventspace, _delays and synapse_sources, which raises NotImplementedError under CPPStandaloneDevice before run(). The before_run() override that calls it was added for cppyy (C++ before_code blocks can't invoke Python), but the guard was missing. Under standalone mode the queue is set up in the generated C++ code, so Python must not try to initialise it during before_run().

- Add cppyy>=3.1 as optional dependency (pip install .[cppyy]) - Install cppyy on all non-standalone runners (Linux, macOS, Windows) - Add ilammy/msvc-dev-cmd step on Windows so Cling can find cl.exe at JIT time - Add DYLD_LIBRARY_PATH for macOS runners to resolve cppyy's hardcoded MacPorts zstd path against Homebrew locations (arm64 + Intel) - Soft-fail the install step so CI is not broken if cppyy is unavailable

CPyCppyy has no pre-built wheel for Python 3.14+ on Windows. Building from source fails: the pre-built cppyy_backend-1.15.3 .lib is missing Cppyy::GetNumBasesLongestBranch which CPyCppyy 1.13.0 requires at link time. Re-enable once cppyy publishes compatible Windows wheels.

…names When Brian2 GC's a TimedArray (e.g. at test teardown), its Python name becomes available for reuse. A subsequent test can create a new TimedArray with the same name but different K/N parameters, generating a different C++ function body under the same symbol (e.g. `_timedarray`). The previous #ifndef guard was keyed on the body content-hash, so two bodies with the same symbol but different hashes would both try to define the same C++ symbol in Cling — causing a "redefinition" error. Fix strategy: - cppyy_generator: wrap each user-function support code piece in a guard keyed by the C++ *symbol name* (not body hash) so Cling only compiles the first occurrence of any given name. Fix _extract_primary_cpp_symbol to only inspect the first declaration line (not function body lines). - cppyy_rt: add _rename_conflicting_user_functions() that detects when a function name is reused with a different body (different content hash) and renames both the function and its _namespace_*_values global in the code string. This prevents both the Cling redefinition error and the cppyy "buffer too large for value" error from reassigning a double* global to an array of a different size.

…apses_create_generator When result_index_condition=True and if_expression is set (e.g. S.connect("i==j")), both create_cond and update sections independently declare `const int32_t _post_idx = _raw_post_idx;` in the same C++ scope. Cling rejects the second declaration as a redefinition. Fix: wrap the create_cond code section in a braced scope `{}` with the condition result captured to `_create_cond_result`. The update section then declares _post_idx first in the outer scope, which is also available for the buffer-filling loop. This fixes ~14 test_subgroup.py and test_synapses.py failures (test_synaptic_propagation, test_synapse_creation_generator_*, test_spike_monitor, test_no_reference_*, etc.).

The cppyy group_variable_set.cpp and group_variable_set_conditional.cpp templates were missing the {# ALLOWS_SCALAR_WRITE #} directive that Cython equivalents have. Without it, the code generator raises "Writing to scalar variable X not allowed in this context" when setting shared variables like G.E_L = "expression", S.delay = 1*ms, etc. Fixes test_scalar_variable, test_delay_specification, test_delays_pathways, test_scalar_parameter_access, and related tests.

…ator to support Synapses-as-target

…ator; use mutable _uiter_size for fixed-size sample

… timedarray/binomial, fix introspector SyntaxWarning

… GSL skipping Three bugs caused CI failures for the cppyy runtime target: 1. `static std::mt19937 _brian_cppyy_rng` had internal linkage, so each new Cling translation unit (compiled per network.run() call) got a fresh default-seeded copy — all runs produced identical random values. Fix: remove `static` to give external linkage; one shared instance across all TUs. Also move `_dist_rand` to file scope (no static). 2. `seed()` checked `hasattr(cppyy.gbl, "_brian_cppyy_seed")` before the support code was compiled, so pre-run seed() calls were silent no-ops. Fix: call `_ensure_support_code()` eagerly inside `seed()`. 3. `get/set_random_state()` ignored C++ RNG state entirely, so `restore(restore_random_state=True)` could not reproduce identical runs. Fix: expose `_brian_cppyy_get/set_rng_state()` C++ functions (using std::ostringstream/istringstream) and integrate into get/set_random_state(). Additionally, `std::normal_distribution` has an internal cache that cannot be serialized. Replace with a custom Marsaglia polar method using explicit `_brian_randn_has_spare` / `_brian_randn_spare` file-scope variables that round-trip cleanly through the state string. GSL tests were also failing because `skip_if_not_implemented` only skipped for the numpy target, not cppyy. Fix: check `effective in ("numpy", "cppyy")`.

…ay in run_block Three per-call savings in CppyyCodeObject.run_block, the hot path that every code object hits on every timestep. With ~14 code objects/timestep over 50k+ timesteps for a Kremer-class run, micro-overhead compounds heavily. - Remove per-call logger.diagnostic(): each call formatted a debug string and invoked BrianLogger._log even when the level was filtered out. Single biggest win (~30-40% reduction on warm sim). - Guard np.ascontiguousarray() behind not val.flags.c_contiguous: Brian2 arrays are virtually always C-contiguous, so the unconditional call was a ~0.1 µs/array no-op. Also cache the 1-element empty-array dummy at module level instead of np.zeros'ing it per call. - Cache the normalized args tuple per block. _build_args() runs once per cache miss; run_block then dispatches the cached tuple directly. The cache is cleared by update_namespace() only when nonconstant_values (dynamic-array references) are present — static-namespace blocks keep the cache for the entire run. The val-is-None fallback still allocates a fresh np.zeros (C++ may write to it). Measured on EXTRA_CLING_ARGS=" -O2", arm64, Py 3.13, cppyy 3.5.0: warm sim ratio cppyy/cython before after small_lif 1.93x 1.61x coba 2.14x 1.25x kremer3 1.79x 1.10x cppyy's cold-compile advantage is preserved: 19-47x faster end-to-end on a cold Cython cache. 193 tests pass across test_neurongroup, test_monitor, test_synapses, test_subgroup, test_spikegenerator, test_poissongroup, test_poissoninput, test_refractory, test_thresholder with target=cppyy. Single file, ~30 net lines, no template or ABI change.

…cache Two changes that together make cppyy beat Cython on warm sim and on multi-run (parameter-sweep / store-restore) workflows, on top of the diagnostic / ascontiguousarray / args-tuple-cache work in dd21662. 1. Per-block fast-dispatch (CppyyCodeObject) At the end of compile_block, for code objects whose namespace is fully static (nonconstant_values is empty — i.e. all stateupdate / threshold / reset / push_spikes / synapses run blocks once connect() is done), eagerly call _build_args(block) and store (compiled_func, args_tuple) in self._fast_dispatch[block]. run_block then short-circuits with a single dict.get and one tuple unpack, skipping the cache-miss check and the per-call template_name string compares. The three template_name == "..." string compares for return-value templates (group_get_indices, group_variable_get, group_variable_get_conditional) are replaced by a single self._return_kind: str | None set once in __init__ and consulted once per call. update_namespace clears _fast_dispatch defensively (no-op for static blocks; matters only if a subclass later opts in to nonconstant_values). 2. Process-level Cling compile cache Module-level _compiled_block_cache: dict[sha256, (compiled_func, unique_func_name)] keyed on the canonical post-rename / post-guard / pre-counter-rename source. In compile_block, before allocating a new counter suffix and calling cppyy.cppdef, look up the cache. On hit, reuse the previously-compiled cppyy proxy; on miss, do the existing flow and store. cppyy proxies are bound to cppyy.gbl, not to a code object — sharing across CppyyCodeObject instances is safe. Per-codeobject globals (e.g. _namespace_timedarray_values) are still re-pointed by _set_user_func_globals on every compile_block call, hit or miss. _rename_conflicting_user_functions already disambiguates bodies that would collide, so the cache key only matches when reuse is correct. Zero impact on workloads with unique codeobj names per iteration (Brian2's default for unnamed objects); 2.6-4.6x faster setup on repeated Network.run() with stable names. Measured on EXTRA_CLING_ARGS=" -O2", arm64, Py 3.13, cppyy 3.5.0, Cython 3.1.3 (median of 8+ samples, subprocess-isolated): warm sim ratio cppyy/cython before dd21662 after dd21662 after THIS small_lif 1.93x 1.61x 0.87x coba 2.14x 1.25x 0.85x kremer3 1.79x 1.10x 0.87x 5-iteration parameter-sweep total (stable names, fresh cython cache): cython: 15.47 s cppyy: 0.69 s = 22x faster end-to-end 193 tests pass across test_neurongroup, test_monitor, test_synapses, test_subgroup, test_spikegenerator, test_poissongroup, test_poissoninput, test_refractory, test_thresholder with target=cppyy. Single file, ~120 net lines, no template / generator / ABI change.

Legend101Zz added 3 commits February 14, 2026 00:55

feat: cppyy codegen backend working changes

f7c9956

remove: unneeded files

a16280b

fix: template

d390875

Legend101Zz mentioned this pull request Feb 13, 2026

Add cppyy Runtime Code Generation Target for Brian2 #1674

Closed

Legend101Zz marked this pull request as draft February 13, 2026 19:33

Legend101Zz added 2 commits February 14, 2026 15:18

feat: add cppyy introspector and dynamic array fix

9431c44

add: jupyter notebook for tests

69fe747

fix: remove unneeded code

5a2320b

Legend101Zz commented Feb 14, 2026

View reviewed changes

Legend101Zz mentioned this pull request Feb 14, 2026

[Proposal] Introspection Engine for Automatic Backend Tuning & Advisory (GSoC 2026) #1753

Open

Legend101Zz added 9 commits March 19, 2026 22:55

Merge branch 'master' into experimental-cppyy

ca3daf0

chore: update cppyy docs

5c26a88

chore: delete old test

837d1ab

mstimberg mentioned this pull request Mar 25, 2026

Investigate implementing rand/randn in pure Cython #1264

Open

mushkanrana73 added a commit to mushkanrana73/brian2 that referenced this pull request Mar 28, 2026

test:add cython synapses regression coverage for brian-team#1769

2cb28f7

mushkanrana73 added a commit to mushkanrana73/brian2 that referenced this pull request Mar 28, 2026

test:add cython synapses regression coverage for brian-team#1769

06ab8a6

mushkanrana73 mentioned this pull request Mar 28, 2026

test:add cython synapses regression coverage for #1769 #1805

Closed

Legend101Zz added 3 commits April 15, 2026 00:10

Legend101Zz added 12 commits April 15, 2026 22:32

fix: use constant_or_scalar for N_pre/N_post in synapses_create_gener…

40c8fa3

…ator to support Synapses-as-target

fix: raise IndexError from C++ bounds errors in synapses_create_gener…

78a1d59

…ator; use mutable _uiter_size for fixed-size sample

fix: rewrite threshold/group_variable_get templates, add cppyy key to…

801a9b4

… timedarray/binomial, fix introspector SyntaxWarning

chore: remove dev-only scratch files and draft docs from PR

bb73bac

This was referenced May 6, 2026

Improve compilation speed in standalone mode #1825

Open

Investigate our use of update_namespace #1831

Open

Legend101Zz added 2 commits May 25, 2026 23:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experimental cppyy#1769

Experimental cppyy#1769
Legend101Zz wants to merge 32 commits into
brian-team:masterfrom
Legend101Zz:experimental-cppyy

Legend101Zz commented Feb 13, 2026

Uh oh!

Legend101Zz commented Feb 13, 2026

Uh oh!

review-notebook-app Bot commented Feb 14, 2026

Uh oh!

Legend101Zz commented Feb 14, 2026

Uh oh!

Legend101Zz Feb 14, 2026

Uh oh!

Legend101Zz commented Feb 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Legend101Zz commented Feb 13, 2026

Problem

Proposed Solution: JIT Compilation with cppyy

Current Status

Next Steps

Fix for dynamic arrays and spikequeue and synapses

Uh oh!

Legend101Zz commented Feb 13, 2026

Uh oh!

review-notebook-app Bot commented Feb 14, 2026

Uh oh!

Legend101Zz commented Feb 14, 2026

Uh oh!

Legend101Zz Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

Legend101Zz commented Feb 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Proposed Solution: JIT Compilation with `cppyy`