Skip to content

Defer Python kernel compilation until invocation#3948

Merged
lmondada merged 7 commits intoNVIDIA:mainfrom
lmondada:lm/aot-cache
Feb 25, 2026
Merged

Defer Python kernel compilation until invocation#3948
lmondada merged 7 commits intoNVIDIA:mainfrom
lmondada:lm/aot-cache

Conversation

@lmondada
Copy link
Collaborator

@lmondada lmondada commented Feb 13, 2026

This PR introduces "deferred compilation" in Python. By default, kernels are no longer compiled to MLIR (aka pre-compiled) at kernel definition time, but only when the kernel is invoked for the first time. As before, the MLIR module is then cached for further invocation. For example,

import cudaq

@cudaq.kernel
def foo():
    pass

would not trigger any compilation, until foo() is called for the first time. This enables:

  1. Faster package load times: import cudaq takes ~4s now, versus ~8s before, as the kernels included with cudaq are no longer compiled when the package is loaded. This will also apply to any other library that provides kernel definitions.
  2. Kernels can be defined out-of-order: previously, calling kernel B from kernel A would error if B was defined after A. This is now supported (see tests).
  3. Captured variables that aren't defined at kernel definition time but are defined at invocation time are now supported.

Compilation can be forced (thus recovering the old behaviour) using the defer_compilation=False flag. The following:

@cudaq.kernel(defer_compilation=False)
def foo():
    pass

triggers compilation immediately.

Limitations

To limit the scope of this PR, I introduced two limitations (see tests for examples):

  1. Recursive kernel calls are still not supported. The way kernels are captured and lifted into arguments would have to handle recursive calls specially.
  2. A kernel builder A cannot call another kernel B using apply_call(B) if B wasn't compiled beforehand. In that case, the user will receive a clear error suggesting to either use defer_compilation=False or calling B.compile() directly, e.g.
@cudaq.kernel(defer_compilation=True)
def notPrecompiledKernel():
    pass

kernel = cudaq.make_kernel()
kernel.apply_call(notPrecompiledKernel)  # fails. User must set `defer_compilation=False`
                                         # or call `notPrecompiledKernel.compile()`

@lmondada lmondada changed the title [wip [wip] [python] Delay MLIR compilation until kernel invocation Feb 13, 2026
@github-actions
Copy link

CUDA Quantum Docs Bot: A preview of the documentation can be found here.

@github-actions
Copy link

CUDA Quantum Docs Bot: A preview of the documentation can be found here.

Signed-off-by: Luca Mondada <luca@mondada.net>
@github-actions
Copy link

CUDA Quantum Docs Bot: A preview of the documentation can be found here.

@schweitzpgi
Copy link
Collaborator

Captured variables that aren't defined at kernel definition time but are defined at invocation time are now supported.

This was already true. Symbols from EGB scopes were lambda lifted to be arguments to the kernel. This change allows EGB scoped variables to be undefined at the point the kernel is defined. Since the kernel's code is only generated later, the type(s) of these symbol(s) isn't required and no prior declaration is needed.

@schweitzpgi
Copy link
Collaborator

schweitzpgi commented Feb 18, 2026

Recursive kernel calls are still not supported. The way kernels are captured and lifted into arguments would have to handle recursive calls specially.

I'm not clear on why this is any different than any other kernel call.

@cudaq.kernel
def foo():
  foo()

should record the call to itself as a lambda lifted callable argument when we build the code for foo in the AST bridge step. This is required to come first since we'd have no idea what symbols appear in the body of foo if we tried to resolve those symbols before building the kernel code.

Given that, we then must resolve all the lambda lifted arguments. In this example, that's the symbol foo. That symbol must resolve to the kernel decorator we're building which is bound to the same symbol. heh. But we just built that code, so there is nothing to do.

Even if we inject several layers of calls ($\ge 1$) in the middle, foo is foo and the recursion naturally terminates here.

@lmondada
Copy link
Collaborator Author

lmondada commented Feb 19, 2026

I'm not clear on why this is any different than any other kernel call.
[...]
Given that, we then must resolve all the lambda lifted arguments. In this example, that's the symbol foo. That symbol must resolve to the kernel decorator we're building which is bound to the same symbol. heh. But we just built that code, so there is nothing to do.

First of all, recursion is currently not supported in main. This is because the symbol foo within the kernel definition:

@cudaq.kernel
def foo():
    foo()

is undefined at the definition time of foo (which is when it currently gets AOT compiled). Similarly, foo calling bar calling foo (or any other recursive setup) would not compile, given that one of the kernels would have to reference a not-yet-defined kernel.


That being said, I agree that with this code change, supporting recursion becomes easier. The execution you outline is how it should (and will) work, but the reason it currently does not work is that compilation and execution proceeds as follows:

  1. Compile foo.
  2. Find a reference to foo within the body. Lambda lift that into an argument. Now foo takes a (captured) callable argument.
  3. Compiling completes successfully
  4. Kernel foo is invoked, so the captured arguments must be resolved. Captured foo is resolved to the same decorator as being invoked.
  5. Now we need to (recursively) resolve the captured arguments of captured foo. Here foo gets resolved again to the decorator... this is how we get stuck in a loop of repeatedly capturing foo.

We need to add a conditional in the resolution of captured arguments that checks whether foo has been resolved before (and keep a list of all resolved callables somewhere), so that we can resolve it without recursing. I've already agreed with Bettina that I will work on a PR for this next.

@lmondada
Copy link
Collaborator Author

lmondada commented Feb 19, 2026

Captured variables that aren't defined at kernel definition time but are defined at invocation time are now supported.

This was already true. Symbols from EGB scopes were lambda lifted to be arguments to the kernel. This change allows EGB scoped variables to be undefined at the point the kernel is defined. Since the kernel's code is only generated later, the type(s) of these symbol(s) isn't required and no prior declaration is needed.

What I meant is that the following previously threw an error

@cudaq.kernel
def bar(i: int):
    q = cudaq.qvector(n)

n = 3
    
bar()

@github-actions
Copy link

CUDA Quantum Docs Bot: A preview of the documentation can be found here.

Copy link
Collaborator

@bettinaheim bettinaheim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All in all, looks great. I would pull out the change for the behavior of print at least out of this PR - then you also don't have all that noise for test edits.

@schweitzpgi
Copy link
Collaborator

schweitzpgi commented Feb 24, 2026

Is this a new PR? I thought I already reviewed something like this and left comments.
Nevermind. The GUI appears to be confused as I see my comments are still here.

Signed-off-by: Luca Mondada <luca@mondada.net>
Signed-off-by: Luca Mondada <luca@mondada.net>
Copy link
Collaborator

@schweitzpgi schweitzpgi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#3965 exposes a serious issue with the resolution approach that I believe this is proposing. The example in #3965 does not call the kernel outer but rather references it. This exposes the following problem with the approach here. The kernel outer escapes the context of its creation and its content is unprocessed with the deferred compilation. This means that we will have lost the symbol inner and what it is bound to.

We really need to make sure we're not letting outer escape without figuring out what enclosed scope symbols it refers to and what they resolve to at the point the reference escapes.

If you read this carefully, it implies that neither call site nor definition site resolution is sufficient.

Note: Discussed with Luca and resolved my questions.

@lmondada
Copy link
Collaborator Author

lmondada commented Feb 24, 2026

@schweitzpgi #4005 resolves the issue you bring up.

It was already an issue before the changes that are proposed here: whilst compilation was working fine, at call time the captured arguments could not be resolved. So, either way, we do not get around keeping a reference to the frame defining the kernel, independently of when we compile it.

It's thus unrelated to the changes here.

Copy link
Collaborator

@schweitzpgi schweitzpgi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did we land wrt setting the default of deferred to False? Since it still defaults to True I can infer the answer, but I missed the explanation.

Since #4005 is also in play here, we want to have tests that use deferred and not deferred code paths and validation that both cases work correctly. Deferring the instantiation of the IR should be completely orthogonal to tracking contexts to resolving arguments.

@github-actions
Copy link

CUDA Quantum Docs Bot: A preview of the documentation can be found here.

Signed-off-by: Luca Mondada <luca@mondada.net>
@github-actions
Copy link

CUDA Quantum Docs Bot: A preview of the documentation can be found here.

@lmondada
Copy link
Collaborator Author

lmondada commented Feb 25, 2026

Where did we land wrt setting the default of deferred to False? Since it still defaults to True I can infer the answer, but I missed the explanation.

Here are the pros and cons of having deferred compilation on by default:

  • pro: speeds up loading of scripts, libraries etc that define kernels but might not use them immediately
  • pro: allows out-of-order definition, which is the correct Python semantics.
  • pro: easier support for recursion (it works naturally when deferring compilation, given that the kernel reference exists by the time it is getting compiled)
  • con: does not work with kernel builder. This is temporary and will be fixed by lambda lifting.

The only disadvantage is temporary and will go away. Being closer to Python semantics is a serious pro.

I'm not a huge fan of reverting all the test changes. None of the existing tests require delayed compilation, so why reintroduce that?

I've reverted all the test changes.

I think printing the kernel is now compiling it.

Yes, I had made that change on Bettina's request. As a result of this, most tests could be reverted to their original. Thanks for spotting that, it's addressed in the latest commit.

I think compile is misleading. What this is really doing is building the IR, if it doesn't yet exist. Maybe we should call it build_ir or instantiate_ir for clarity?

Previously, this was done in a function named pre_compile, but a compile function existed alongside it that was a no-op. I've consolidated both into one but agree that compile is ambiguous. I ask for permission to resolve this in a separate PR as getting rid of compile altogether will change a lot of unrelated code (see e.g. the uses in python/tests/kernel/test_kernel_features.py)

We want to have tests that use deferred and not deferred code paths and validation that both cases work correctly

I have reverted more tests to defer_compilation=False, so there should be a good mix of both now. If you want to go beyond this, we could parametrise many tests in python/tests/kernel/test_assignments.py and elsewhere on whether they should defer compilation or not. We could then run all tests twice, once for each setting. However, this gets expensive quickly. Can we explore the options in a separate PR?

Deferring the instantiation of the IR should be completely orthogonal to tracking contexts to resolving arguments.

Correct, this PR does not affect how contexts are resolved. Delaying compilation, however, surfaced more of the existing bugs around symbol resolution in the incorrect scope, as compilation was no longer happening in the scope of definition.

Copy link
Collaborator

@schweitzpgi schweitzpgi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed follow-up actions with Luca.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release notes Changes need to be captured in the release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants