Skip to content

[python] Unify specialize and launch in ModuleLauncher#4052

Open
lmondada wants to merge 3 commits intoNVIDIA:mainfrom
lmondada:lm/launch-specialize
Open

[python] Unify specialize and launch in ModuleLauncher#4052
lmondada wants to merge 3 commits intoNVIDIA:mainfrom
lmondada:lm/launch-specialize

Conversation

@lmondada
Copy link
Collaborator

@lmondada lmondada commented Feb 26, 2026

This PR is the start of a line of work aiming to unify the "specialize kernel" and "launch kernel" execution paths for kernels defined (and executed from) Python.

I have started by creating a CompiledKernel class that can be shared by the two execution paths at the bottom of their respective call stacks (see diagram). As we progress, we should be able to progressively unify the two call stacks into one.

flowchart TB
    subgraph unify["This PR"]
        module["ModuleLauncher::compileModule"]
        module --> compiled[CompiledKernel]
    end
    qpu["QPU::specializeModule"]
    qpu2["QPU::launchModule"]
    platform["quantum_platform::specializeModule"]
    platform2["quantum_platform::launchModule"]
    strm["cudaq::streamlinedSpecializeModule"]
    strm2["cudaq::streamlinedLaunchModule"]
    pyLaunchModule ~~~ strm
    pyLaunchModule["cudaq::pyLaunchModule"]
    clm["cudaq::clean\_launch\_module"]
    marshal["marshal\_and\_retain\_module"]
    marshal2["marshal\_and\_launch\_module"]

    subgraph cls["PyKernelDecorator"]
        python["beta_reduction"]
        python2["\_\_call\_\_"]
    end

    python --> marshal --> strm --> platform --> qpu --> module
    python2 --> marshal2 --> clm --> pyLaunchModule -->strm2 --> platform2 --> qpu2 --> module
    
    platform2 --> otherqpu
    subgraph other[other QPUs]
        otherqpu[...]
    end
Loading

@lmondada
Copy link
Collaborator Author

lmondada commented Feb 27, 2026

Thanks @Renaud-K for the early review!

I have settled for a design of CompiledKernel that does not depend on MLIR and can just live in its own file and get compiled in the cudaq-common library. The drawback of this is that we need to keep track of more data (e.g. the JITEngine AND the function pointer), but I think it is well worth it to remove the dependency.

To keep it easy to construct instances of this type, I've created a factory function that needs MLIR and lives in JIT.cpp. I'm making the class constructor private to make it impossible to create invalid class instances. The only way to create CompiledKernel instances is thus through the factory, which is a friend of the class.

My hope is that over time we can start using this class in other parts of the runtime, and merge execution paths between C++ and Python.

Signed-off-by: Luca Mondada <luca@mondada.net>
@github-actions
Copy link

CUDA Quantum Docs Bot: A preview of the documentation can be found here.

Signed-off-by: Luca Mondada <luca@mondada.net>
@lmondada
Copy link
Collaborator Author

lmondada commented Mar 3, 2026

@Renaud-K You were right that JitEngine can be stored by value. I am not sure what compilation error I had stumbled across, but I must have mis-interpreted it.

Thanks a lot for the prototyping!

I've made that change in the latest commit. Does this go in the right direction? I would look into merging JitEngine and CompiledKernel in a separate PR, as that would involve changes to QPUs etc

@Renaud-K
Copy link
Collaborator

Renaud-K commented Mar 3, 2026

I am comfortable with the CompiledKernel changes, however the QPU.cpp changes maybe should be reviewed by @schweitzpgi .

Note that I would still like to break-up cudaq-common because it has a dependency with the cudaq-operator, which the compiler does not need. So, what would be cudaq-device-code, would have:
• DeviceCodeRegistry.h/.cpp
• CodeGenConfig{.h,.cpp}
• Resources{.h,.cpp}
• ThunkInterface.h
• Kernel_utils.h
• CompiledKernel{.h,.cpp}

That will also be a task for the next cycle.

@lmondada
Copy link
Collaborator Author

lmondada commented Mar 4, 2026

@schweitzpgi Hey Eric, I'm starting to remove duplication between the specializeModule and launchModule paths.

After talking to Renaud, the CompiledKernel type this PR introduces seems like a good way to do that. It stores the function pointer and JitEngine in one place (and hence manages their lifetimes together), without any MLIR dependency. In a future PR I can replace the raw function pointers returned along the specializeModule path with the CompiledKernel type.

I think there is also an overlap with your "universal cache" idea. In the future, the compiled kernel type could be expanded to keep a hash of all launch arguments and be the main data type that we cache. Of course, more thinking will be required on this.

In terms of observable functionality, this PR is a no-op. The code that has been de-duplicated was identical across the two execution paths.


#include "CompiledKernel.h"

namespace cudaq {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
namespace cudaq {

We should move to the LLVM style so as to better catch bugs, etc.

Copy link
Collaborator Author

@lmondada lmondada Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, I'm removing the namespace {} block.


namespace cudaq {

CompiledKernel::CompiledKernel(JitEngine engine, std::string kernelName,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
CompiledKernel::CompiledKernel(JitEngine engine, std::string kernelName,
cudaq::CompiledKernel::CompiledKernel(JitEngine engine, std::string kernelName,

void *buff = const_cast<void *>(rawArgs.back());
return reinterpret_cast<KernelThunkResultType (*)(void *, bool)>(funcPtr)(
buff, /*client_server=*/false);
} else {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks.


void (*CompiledKernel::getEntryPoint() const)() { return entryPoint; }

const JitEngine CompiledKernel::getEngine() const { return engine; }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why const on the return value? It's rather useless, eh?

const int f() { return 0; }
int i = f();

is perfectly valid C++.


// TODO: remove these two methods once the CompiledKernel is returned to
// Python.
void (*getEntryPoint() const)();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm dubious of this. CUDA-Q does not define entry point kernels as always having void(void) signatures. So this seems like a cul-de-sac.

std::string kernelName,
bool hasResult) {
std::string fullName = cudaq::runtime::cudaqGenPrefixName + kernelName;
std::string entryName = hasResult ? kernelName + ".thunk" : fullName;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code looks like the efficiency hack used in Python. i.e., unrelated to the C++ implementation. But it is appearing here in what looks like common code?

@schweitzpgi
Copy link
Collaborator

@schweitzpgi Hey Eric, I'm starting to remove duplication between the specializeModule and launchModule paths.

After talking to Renaud, the CompiledKernel type this PR introduces seems like a good way to do that. It stores the function pointer and JitEngine in one place (and hence manages their lifetimes together), without any MLIR dependency. In a future PR I can replace the raw function pointers returned along the specializeModule path with the CompiledKernel type.

I think there is also an overlap with your "universal cache" idea. In the future, the compiled kernel type could be expanded to keep a hash of all launch arguments and be the main data type that we cache. Of course, more thinking will be required on this.

In terms of observable functionality, this PR is a no-op. The code that has been de-duplicated was identical across the two execution paths.

I'm trying to play catchup here, so I appreciate your patience.

  • I don't understand the need to divorce "Compiled Kernels" from MLIR. In my mind, a CUDA-Q kernel is Quake is MLIR. The daylight between these things should be zero.
  • The notion that a kernel can be reduced to a function pointer is just plain wrong. (Yeah, I know the specialization path used that, but that was a pre-existing artifact of the prototype and not good practice.) A kernel is quite literally a collection of functions and trying to be reductive and pretend it is a single function pointer is only going to lead to future problems. At any rate, this is precisely why Python now threads Modules through the runtime rather than cavalierly and incorrectly assuming it can lop off functions and data structures however it might fancy any time it deems it cool to do so. In short, we should keep the entire module of code and select the correct function for the circumstances. That is not true, even now...

Thanks for combining the mostly duplicative paths. That seems like a good thing for sure.

@Renaud-K
Copy link
Collaborator

Renaud-K commented Mar 6, 2026

I don't understand the need to divorce "Compiled Kernels" from MLIR.

We have moved the compiler cudaq-runtime-mlir out of the QPUs and we will eventually want to remove the dependency all QPUs have with it. The QPUs should not have to link with cudaq-runtime-mlir. In modularizing the runtime, the QPU and compiler would be 2 independent entities that share information in a manner that does not couple them. This can be achieved with lower level libraries. In many ways, we already do that in the remote QPU case, where we really only pass the QPU a string representation of the kernel and other STL based containers. In the emulation case however, the QPU currently does the name lookup in the jitted code. This action creates a dependency. We would like to move to an exchange format between compiler and QPU which does not. So, just like an std::string, an std::function is low level and decouples the two. This is the motivation.

Co-authored-by: Eric Schweitz <eschweitz@nvidia.com>
Signed-off-by: Luca Mondada <72734770+lmondada@users.noreply.github.com>
@github-actions
Copy link

CUDA Quantum Docs Bot: A preview of the documentation can be found here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants