Open
Description
The current implementation of parallel_cache
is quite slow.
As an example, running test/tracerEq/test_consistency_2d.py::test_nonconst_tracer[DIRK33]"
in Thetis (with the changes from #3982) and a hot cache produces a flamegraph like:
All the parts showing PyOP2 Cache...
are where we are accessing the cache.
I think that rewriting this in C/Cython would be quite impactful for problems where we call a lot of very small parloops.