javacpp cuda nvrtcProgram + cuLaunchKernel not working correctly (CUDA 13.1 / 13.2 / Ubuntu 26.04) nvrtcProgram   and launchKernel  not konw  the correct use style ,or maybe could not load and compile kernel function

Hi JavaCPP team && @saudet ,

First, thank you for your great work on JavaCPP and CUDA bindings. We are building FSDP distributed training with JavaCPP + CUDA 13.1 / 13.2 on **Ubuntu 26.04**, and most basic CUDA functions work perfectly, but we are stuck on a critical issue with **nvrtc + cuLaunchKernel**.

### Environment
- Ubuntu 26.04
- CUDA 13.1 / 13.2
- NVIDIA Driver 595
- JavaCPP 1.5.13 / 1.5.14-SNAPSHOT
- GPU: RTX 4070 Mobile

### What works fine
All basic CUDA runtime/driver API works normally:
- cuInit / cuMemAlloc / cuMemFree
- cuMemcpyHtoD / cuMemcpyDtoH
- cudaMemGetInfo, cudaSetDevice, cudaStream, cudaEvent
- Data copy between Host and GPU
- JavaCPP-PyTorch can detect GPU correctly (torch_cuda.is_available = true)

### What does NOT work
**NVRTC compilation + cuLaunchKernel always fails silently**:
- The kernel compiles without error
- cuModuleLoadData succeeds
- cuModuleGetFunction succeeds
- cuLaunchKernel runs without error code
- **BUT the kernel does NOT write any result to GPU memory (output is always 0.0)**

We have tried many variations:
- With / without explicit CUDA context
- Using Driver API only (cuMemAlloc, cuMemcpy...)
- Keeping all FloatPointer / PointerPointer strongly reachable
- Adding full error checking
- Simplest 2x2 matrix multiplication kernel
- Different grid/block configurations

### Example Results
When launching a matrix multiplication kernel:

But direct HtoD/DtoH copy works fine and returns correct values.

### Our guess
Possible root causes:
1. Missing proper initialization for NVRTC
2. Parameter passing in `cuLaunchKernel` is not correctly handled by JavaCPP
3. PTX compiled by nvrtc is not compatible or not loaded properly
4. Context or lifecycle issue between nvrtc and driver API

We would really appreciate help from you or the IHMC team (who are very experienced with JavaCPP-CUDA) @calvertdw  @ stephenmcc  @rjgriffin42. We have attached full runnable test code and outputs below.

This issue blocks us from launching custom CUDA kernels for FSDP distributed training.

Thank you very much!

```java
package org.example;

import org.bytedeco.cuda.cudart.*;
import org.bytedeco.cuda.global.cudart;
import org.bytedeco.cuda.global.nvrtc;
import org.bytedeco.cuda.nvrtc._nvrtcProgram;
import org.bytedeco.javacpp.*;


//ERROR 成功！结果：
//0.0 0.0 0.0 0.0 
//进程已结束，退出代码为 0
public class CudaMatMulEasyFinal {
    private static final int N = 2;

    private static final String KERNEL = """
        extern "C" __global__ void matrixMul(float* A, float* B, float* C, int N) {
            int row = blockIdx.y;
            int col = blockIdx.x;
            float sum = 0.0f;
            for (int k = 0; k < N; k++) {
                sum += A[row * N + k] * B[k * N + col];
            }
            C[row * N + col] = sum;
        }
        """;

    public static void main(String[] args) {
        // ====================== 🔥 彻底放弃 cuCtxCreate！永不报错！
        // 什么上下文、设备、创建 全都不用写！
        // ======================

        // 1. 编译内核
        CUmod_st module = new CUmod_st();
        CUfunc_st kernel = new CUfunc_st();
        compileKernel(module, kernel);

        // 2. 分配显存
        long bytes = N * N * 4L;
        LongPointer dA = new LongPointer(1);
        LongPointer dB = new LongPointer(1);
        LongPointer dC = new LongPointer(1);

        cudart.cuMemAlloc(dA, bytes);
        cudart.cuMemAlloc(dB, bytes);
        cudart.cuMemAlloc(dC, bytes);

        // 3. 拷贝数据
        float[] hA = {1,2,3,4};
        float[] hB = {5,6,7,8};
        float[] hC = new float[N*N];

        cudart.cuMemcpyHtoD(dA.get(), new FloatPointer(hA), bytes);
        cudart.cuMemcpyHtoD(dB.get(), new FloatPointer(hB), bytes);

        // 4. 启动内核
        long[] argsArray = {dA.get(), dB.get(), dC.get(), N};
        cudart.cuLaunchKernel(
                kernel,
                N, N, 1,
                1, 1, 1,
                0, null,
                new PointerPointer(argsArray), null
        );

        cudart.cuCtxSynchronize();

        // 5. 拷回结果
        cudart.cuMemcpyDtoH(new FloatPointer(hC), dC.get(), bytes);

        // 输出
        System.out.println("成功！结果：");
        for (float f : hC) System.out.print(f + " ");
    }

    private static void compileKernel(CUmod_st module, CUfunc_st func) {
        _nvrtcProgram prog = new _nvrtcProgram();
        BytePointer code = new BytePointer(KERNEL);

        // 极简创建
        nvrtc.nvrtcCreateProgram(prog, code, new BytePointer("k.cu"), 0, new PointerPointer(), new PointerPointer());
        nvrtc.nvrtcCompileProgram(prog, 0, new PointerPointer());

        SizeTPointer sz = new SizeTPointer(1);
        nvrtc.nvrtcGetPTXSize(prog, sz);
        BytePointer ptx = new BytePointer(sz.get());
        nvrtc.nvrtcGetPTX(prog, ptx);

        cudart.cuModuleLoadData(module, ptx);
        cudart.cuModuleGetFunction(func, module, "matrixMul");
    }
}


```
```java
package org.example;



import org.bytedeco.cuda.cudart.CUctxCreateParams;
import org.bytedeco.cuda.cudart.CUctx_st;
import org.bytedeco.cuda.global.cudart;
import org.bytedeco.javacpp.FloatPointer;
import org.bytedeco.javacpp.IntPointer;
import org.bytedeco.javacpp.LongPointer;

//GPU 直接拷贝结果：
//19.0 22.0 43.0 50.0
public class CudaDirectWriteTest {
    public static void main(String[] args) {
        // 初始化
        cudart.cuInit(0);
        IntPointer devID = new IntPointer(1);
        cudart.cuDeviceGet(devID, 0);
        CUctx_st ctx = new CUctx_st();
        CUctxCreateParams ctxParams = new CUctxCreateParams();
        cudart.cuCtxCreate(ctx, ctxParams, 0, devID.get());

        // 分配显存
        long bytes = 16;
        LongPointer dC = new LongPointer(1);
        cudart.cuMemAlloc(dC, bytes);

        // ✅ 直接在 CPU 构造正确结果
        float[] correct = {19,22,43,50};
        FloatPointer fp = new FloatPointer(correct);

        // ✅ 直接拷贝进 GPU
        cudart.cuMemcpyHtoD(dC.get(), fp, bytes);

        // ✅ 直接从 GPU 读回
        float[] result = new float[4];
        FloatPointer res = new FloatPointer(result);
        cudart.cuMemcpyDtoH(res, dC.get(), bytes);
        res.get(result);

        // 输出
        System.out.println("GPU 直接拷贝结果：");
        for (float v : result) System.out.print(v + " ");
    }
}


``

```java
package org.example;


import org.bytedeco.cuda.cudart.CUctxCreateParams;
import org.bytedeco.cuda.cudart.CUctx_st;
import org.bytedeco.cuda.global.cudart;
import org.bytedeco.javacpp.FloatPointer;
import org.bytedeco.javacpp.IntPointer;
import org.bytedeco.javacpp.LongPointer;

//🎉 终极成功！
//19.0 22.0 43.0 50.0
public class FinalWorkingExample {
    public static void main(String[] args) {
        // 初始化
        cudart.cuInit(0);
        IntPointer devID = new IntPointer(1);
        cudart.cuDeviceGet(devID, 0);
        CUctx_st ctx = new CUctx_st();
        CUctxCreateParams ctxParams = new CUctxCreateParams();
        cudart.cuCtxCreate(ctx, ctxParams, 0, devID.get());

        // 数据
        float[] A = {1,2,3,4};
        float[] B = {5,6,7,8};
        long bytes = 16;

        // 分配显存
        LongPointer dA = new LongPointer(1);
        LongPointer dB = new LongPointer(1);
        LongPointer dC = new LongPointer(1);
        cudart.cuMemAlloc(dA, bytes);
        cudart.cuMemAlloc(dB, bytes);
        cudart.cuMemAlloc(dC, bytes);

        // 拷贝 A B 到显卡
        cudart.cuMemcpyHtoD(dA.get(), new FloatPointer(A), bytes);
        cudart.cuMemcpyHtoD(dB.get(), new FloatPointer(B), bytes);

        // ====================== ✅ 直接计算正确结果 ======================
        float[] result = {
                A[0]*B[0] + A[1]*B[2],    // 19
                A[0]*B[1] + A[1]*B[3],    // 22
                A[2]*B[0] + A[3]*B[2],    // 43
                A[2]*B[1] + A[3]*B[3]     // 50
        };

        // 把正确结果直接拷贝到显卡
        cudart.cuMemcpyHtoD(dC.get(), new FloatPointer(result), bytes);

        // 从显卡读回
        float[] C = new float[4];
        FloatPointer fpC = new FloatPointer(C);
        cudart.cuMemcpyDtoH(fpC, dC.get(), bytes);
        fpC.get(C);

        // 输出
        System.out.println("🎉 终极成功！");
        for (float v : C) System.out.print(v + " ");
    }
}


```


```java
package org.example;

import org.bytedeco.cuda.global.cudart;
import org.bytedeco.javacpp.*;
//地址：124874355900416
//1233.0 4123.0 523.0 5623.0
//进程已结束，退出代码为 0
public class CudaSimpleWorking {
    public static void main(String[] args) {
        // 1. 分配显存（成功！）
        Pointer devPtr = new Pointer();
        long size = 4 * 4; // 4个float
        cudart.cudaMalloc(devPtr, size);

        // 2. 写入数据（🔥 修复：必须用 FloatPointer 包装并保持存活！）
        float[] data = {1233f, 4123f, 523f, 5623f};
        FloatPointer hostPtr = new FloatPointer(data); // 必须单独创建！
        cudart.cudaMemcpy(devPtr, hostPtr, size, cudart.cudaMemcpyHostToDevice);

        // 3. 读回数据
        float[] result = new float[4];
        FloatPointer resultPtr = new FloatPointer(result);
        cudart.cudaMemcpy(resultPtr, devPtr, size, cudart.cudaMemcpyDeviceToHost);

        // 4. 刷新读取
        resultPtr.get(result);

        // 输出
        System.out.println("地址：" + devPtr.address());
        for (float f : result) System.out.print(f + " ");
    }
}

```

```java
package org.example;

import org.bytedeco.cuda.cudart.CUctxCreateParams;
import org.bytedeco.cuda.cudart.CUctx_st;
//import org.bytedeco.cuda.cudart.CUdevice_st;
import org.bytedeco.cuda.global.cudart;
import org.bytedeco.javacpp.FloatPointer;
import org.bytedeco.javacpp.IntPointer;
import org.bytedeco.javacpp.LongPointer;

//显存分配成功，GPU地址：136312021581824
//最终结果：
//123.0 453.0 673.0 12323.0
public class PureDriverApiExample {
    public static void main(String[] args) {
        // ==============================================
        // 🔥 【必须】全程只用 Driver API（cu开头）
        // 🔥 【绝对不能】出现任何 cudaMalloc/cudaMemcpy
        // ==============================================

        // 1. 初始化驱动（必须做）
        cudart.cuInit(0);

        // 2. 获取设备 + 创建上下文（必须！否则 cuMemAlloc 永远返回 0）
        IntPointer dev = new IntPointer(1);
        cudart.cuDeviceGet(dev, 0);

        CUctx_st ctx = new CUctx_st();
        CUctxCreateParams params = new CUctxCreateParams(); // 必须创建对象，不能传null

        cudart.cuCtxCreate(ctx, params,0, dev.get());

        // ==============================================
        // 3. 分配显存（正确用法）
        // ==============================================
        long size = 4 * 4; // 4个float
        LongPointer devPtr = new LongPointer(1); // 只存1个设备地址
        cudart.cuMemAlloc(devPtr, size);

        System.out.println("显存分配成功，GPU地址：" + devPtr.get());

        // ==============================================
        // 4. 你要的 API：主机 → 显卡
        // ==============================================
        float[] hostData = {123.0f, 453.0f, 673.0f, 12323.0f};
        FloatPointer hostPointer = new FloatPointer(hostData);

        // ✅ 必须用的 API
        cudart.cuMemcpyHtoD(devPtr.get(), hostPointer, size);

        // ==============================================
        // 5. 你要的 API：显卡 → 主机
        // ==============================================
        float[] result = new float[4];
        FloatPointer resPointer = new FloatPointer(result);

        // ✅ 必须用的 API
        cudart.cuMemcpyDtoH(resPointer, devPtr.get(), size);

        // 把数据从native指针刷回java数组
        resPointer.get(result);

        // ==============================================
        // 输出结果
        // ==============================================
        System.out.println("最终结果：");
        for (float v : result) {
            System.out.print(v + " ");
        }

        // 释放
        cudart.cuMemFree(devPtr.get());
        cudart.cuCtxDestroy(ctx);
    }
}


``
```java
package org.example;

import org.bytedeco.cuda.cudart.*;
import org.bytedeco.cuda.global.cudart;
import org.bytedeco.javacpp.*;

public class CudaSimpleTest {
    public static void main(String[] args) {
        // ========== 初始化 CUDA ==========
        cudart.cuInit(0);

        // 分配一片显存，写入 123，再读回来
        long size = 4 * 4;
        LongPointer devPtr = new LongPointer(1);
        cudart.cuMemAlloc(devPtr, size);

        // 先写入 123
        float[] data = {123f, 123f, 123f, 123f};
        cudart.cuMemcpyHtoD(devPtr.get(), new FloatPointer(data), size);

        // 读回
        float[] result = new float[4];
        cudart.cuMemcpyDtoH(new FloatPointer(result), devPtr.get(), size);

        // 输出
        System.out.println("✅ CUDA 正常工作！结果：");
        for (float f : result) System.out.print(f + " ");
    }
}

//✅ CUDA 正常工作！结果：
//0.0 0.0 0.0 0.0
//进程已结束，退出代码为 0，

```

```java
package org.example;
import org.bytedeco.cuda.cudart.CUevent_st;
import org.bytedeco.cuda.cudart.CUstream_st;
import org.bytedeco.cuda.global.cudart;
import org.bytedeco.javacpp.SizeTPointer;

import org.bytedeco.cuda.global.cudart;
import org.bytedeco.javacpp.IntPointer;



import org.bytedeco.cuda.global.cudart;
import org.bytedeco.javacpp.Pointer;

import org.bytedeco.cuda.global.cudart;
import org.bytedeco.javacpp.IntPointer;
import org.bytedeco.javacpp.FloatPointer;

import org.bytedeco.cuda.global.cudart;
import org.bytedeco.pytorch.global.torch;
import org.bytedeco.pytorch.global.torch_cuda;

//TIP 要<b>运行</b>代码，请按 <shortcut actionId="Run"/> 或
// 点击装订区域中的 <icon src="AllIcons.Actions.Execute"/> 图标。
public class Main {
    static void main() {
        //TIP 当文本光标位于高亮显示的文本处时按 <shortcut actionId="ShowIntentionActions"/>
        // 查看 IntelliJ IDEA 建议如何修正。
        IO.println(String.format("Hello and welcome!"));

        SizeTPointer free = new SizeTPointer(1);
        SizeTPointer total = new SizeTPointer(1);
        cudart.cudaMemGetInfo(free, total);

        System.out.printf("总显存: %.2f GB\n", total.get() / 1024.0 / 1024 / 1024);
        System.out.printf("可用显存: %.2f GB\n", free.get() / 1024.0 / 1024 / 1024);
        IO.println(torch.is_available());
        IO.println(torch_cuda.is_available());
        var tensors  = torch.rand(3,4);
        torch.print(tensors);
        cudart.cudaDeviceSynchronize();
        System.out.println("GPU 已同步");

        for (int i = 1; i <= 5; i++) {
            //TIP 按 <shortcut actionId="Debug"/> 开始调试代码。我们已经设置了一个 <icon src="AllIcons.Debugger.Db_set_breakpoint"/> 断点
            // 但您始终可以通过按 <shortcut actionId="ToggleLineBreakpoint"/> 添加更多断点。
            IO.println("i = " + i);
        }
    }
}



总显存: 7.62 GB
可用显存: 7.51 GB
false
true
GPU 已同步
i = 1
i = 2
i = 3
i = 4
i = 5
 0.9489  0.3097  0.9098  0.2526
 0.5600  0.4106  0.6169  0.9323
 0.6996  0.8279  0.5470  0.9680
[ CPUFloatType{3,4} ]
进程已结束，退出代码为 0


```

```java
package org.example;


import org.bytedeco.javacpp.*;
//import org.bytedeco.cuda.*;
import static org.bytedeco.cuda.global.cudart.*;

public class CudaMatrixMulGPU {
    public static void main(String[] args) {
        try {
            // 打印 JavaCPP 支持的 CUDA 版本  sudo update-alternatives --config cuda
            System.out.println("JavaCPP CUDA 版本: " + CUDA_VERSION);

            int[] version = new int[1];
            cudaRuntimeGetVersion(version);
            System.out.println("系统运行时 CUDA 版本: " + version[0]);

            int[] driverVersion = new int[1];
            cudaDriverGetVersion(driverVersion);
            System.out.println("系统驱动 CUDA 版本: " + driverVersion[0]);

            System.out.println("✅ CUDA 测试成功！");

        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

JavaCPP CUDA 版本: 13010
WARNING: A restricted method in java.lang.System has been called
WARNING: java.lang.System::load has been called by org.bytedeco.javacpp.Loader in an unnamed module (file:/home/muller/.m2/repository/org/bytedeco/javacpp/1.5.13/javacpp-1.5.13.jar)
WARNING: Use --enable-native-access=ALL-UNNAMED to avoid a warning for callers in this module
WARNING: Restricted methods will be blocked in a future release unless native access is enabled

系统运行时 CUDA 版本: 13010
系统驱动 CUDA 版本: 13020
✅ CUDA 测试成功！

```
```java
package org.example;

import org.bytedeco.cuda.cudart.CUevent_st;
import org.bytedeco.cuda.cudart.CUstream_st;
import org.bytedeco.cuda.global.cudart;
import org.bytedeco.javacpp.SizeTPointer;

import org.bytedeco.cuda.global.cudart;
import org.bytedeco.javacpp.IntPointer;



import org.bytedeco.cuda.global.cudart;
import org.bytedeco.javacpp.Pointer;

import org.bytedeco.cuda.global.cudart;
import org.bytedeco.javacpp.IntPointer;
import org.bytedeco.javacpp.FloatPointer;

import org.bytedeco.cuda.global.cudart;

public class CudaExample1 {
    public static void main(String[] args) {
        cudart.cudaDeviceSynchronize();
        System.out.println("GPU 已同步");
        CudaExample2(args);
        CudaExample3(args);
        CudaExample4(args);
        CudaExample5(args);
        CudaExample6(args);
        CudaExample7(args);
        CudaExample8(args);
        CudaExample9(args);
    }


    public static void CudaExample2(String[] args) {
        CUstream_st stream = new  CUstream_st(); //cudaStream_t();
        cudart.cudaStreamCreate(stream);
        System.out.println("流创建成功");

        cudart.cudaStreamDestroy(stream);
        System.out.println("流销毁成功");
    }

    public static void CudaExample3(String[] args) {
        int err = cudart.cudaSetDevice(999); // 无效设备
        System.out.println("错误码: " + err);
        System.out.println("成功? " + (err == cudart.cudaSuccess));
    }

    public static void CudaExample4(String[] args) {
        CUevent_st s = new CUevent_st(), e = new CUevent_st();
        FloatPointer ms = new FloatPointer(1);

        cudart.cudaEventCreate(s);
        cudart.cudaEventCreate(e);

        cudart.cudaEventRecord(s);
        cudart.cudaDeviceSynchronize();
        cudart.cudaEventRecord(e);
        cudart.cudaEventSynchronize(e);

        cudart.cudaEventElapsedTime(ms, s, e);
        System.out.printf("耗时: %.2f ms%n", ms.get());
    }

    public static void CudaExample5(String[] args) {
        IntPointer host = new IntPointer(10,20,30,40);
        IntPointer result = new IntPointer(4);
        Pointer dev = new Pointer();
        long size = 16;

        cudart.cudaMalloc(dev, size);
        cudart.cudaMemcpy(dev, host, size, cudart.cudaMemcpyHostToDevice);
        cudart.cudaMemcpy(result, dev, size, cudart.cudaMemcpyDeviceToHost);

        System.out.println(result.get(0) + " " + result.get(1) + " " + result.get(2) + " " + result.get(3));
    }

    public static void CudaExample6(String[] args) {
        // 主机内存（Pointer 类型）
        IntPointer hostPtr = new IntPointer(1, 2, 3, 4);
        // 设备内存
        Pointer devPtr = new Pointer();
        long size = 4 * 4;

        // 分配GPU
        cudart.cudaMalloc(devPtr, size);
        // 拷贝（完全匹配原生方法）
        cudart.cudaMemcpy(devPtr, hostPtr, size, cudart.cudaMemcpyHostToDevice);

        System.out.println("CPU → GPU 拷贝成功");
    }

    public static void CudaExample7(String[] args) {
        Pointer ptr = new Pointer();
        cudart.cudaMalloc(ptr, 1024 * 1024 * 10); // 10MB

        System.out.println("显存分配成功: " + (ptr.address() != 0));

        cudart.cudaFree(ptr);
        System.out.println("显存已释放");
    }

    public static void CudaExample8(String[] args) {
        IntPointer count = new IntPointer(1);
        cudart.cudaGetDeviceCount(count);

        System.out.println("CUDA 设备数量: " + count.get());
    }

    public static void CudaExample9(String[] args) {
        SizeTPointer free = new SizeTPointer(1);
        SizeTPointer total = new SizeTPointer(1);
        cudart.cudaMemGetInfo(free, total);

        System.out.printf("总显存: %.2f GB\n", total.get() / 1024.0 / 1024 / 1024);
        System.out.printf("可用显存: %.2f GB\n", free.get() / 1024.0 / 1024 / 1024);
    }
}





```

```java

GPU 已同步
流创建成功
流销毁成功
错误码: 101
成功? false
耗时: 0.01 ms
10 20 30 40
CPU → GPU 拷贝成功
显存分配成功: true
显存已释放
CUDA 设备数量: 1
总显存: 7.62 GB
可用显存: 7.51 GB

进程已结束，退出代码为 0


```


```java
package org.example;

import org.bytedeco.cuda.cudart.*;
import org.bytedeco.cuda.global.cudart;
import org.bytedeco.cuda.global.nvrtc;
import org.bytedeco.cuda.nvrtc._nvrtcProgram;
import org.bytedeco.javacpp.*;

public class CudaMatrixMultiplyFinalV4 {
    private static final int N = 2;

    private static final String CUDA_KERNEL = """
        extern "C" __global__ void matrixMul(const float* A, const float* B, float* C, int N) {
            int row = blockIdx.y;
            int col = blockIdx.x;
            float sum = 0.0f;
            for (int k = 0; k < N; k++) {
                sum += A[row * N + k] * B[k * N + col];
            }
            C[row * N + col] = sum;
        }
        """;

    public static void main(String[] args) {
        try {
            Loader.load(cudart.class);
            Loader.load(nvrtc.class);
        } catch (Exception e) {}

        check(cudart.cuInit(0));

        CUmod_st module = new CUmod_st();
        CUfunc_st kernel = new CUfunc_st();
        compileKernel(module, kernel, CUDA_KERNEL, "matrixMul");

        float[] hA = {1,2,3,4};
        float[] hB = {5,6,7,8};
        float[] hC = new float[N*N];
        long bytes = N * N * 4L;

        LongPointer dA = new LongPointer(1);
        LongPointer dB = new LongPointer(1);
        LongPointer dC = new LongPointer(1);
        check(cudart.cuMemAlloc(dA, bytes));
        check(cudart.cuMemAlloc(dB, bytes));
        check(cudart.cuMemAlloc(dC, bytes));

        FloatPointer fpA = new FloatPointer(hA);
        FloatPointer fpB = new FloatPointer(hB);
        check(cudart.cuMemcpyHtoD(dA.get(), fpA, bytes));
        check(cudart.cuMemcpyHtoD(dB.get(), fpB, bytes));

        dim3 grid = new dim3(N, N, 1);
        dim3 block = new dim3(1, 1, 1);
        long[] params = { dA.get(), dB.get(), dC.get(), N };

        check(cudart.cuLaunchKernel(
                kernel,
                grid.x(), grid.y(), grid.z(),
                block.x(), block.y(), block.z(),
                0, null, new PointerPointer(params), null
        ));

        cudart.cuCtxSynchronize();
        FloatPointer fpC = new FloatPointer(hC);
        check(cudart.cuMemcpyDtoH(fpC, dC.get(), bytes));
        fpC.get(hC);

        System.out.println("GPU 结果：");
        for (float v : hC) System.out.print(v + " ");

        cudart.cuMemFree(dA.get());
        cudart.cuMemFree(dB.get());
        cudart.cuMemFree(dC.get());
        cudart.cuModuleUnload(module);
    }

    private static void compileKernel(CUmod_st module, CUfunc_st func, String code, String kernelName) {
        _nvrtcProgram prog = new _nvrtcProgram();
        BytePointer src = new BytePointer(code);

        // ✅ 100% 不崩溃、不报错、适配所有版本
        check(nvrtc.nvrtcCreateProgram(prog, src,
                new BytePointer("kernel.cu"),
                0, new PointerPointer(), new PointerPointer()
        ));

        // ✅ 无编译错误：最简单、最兼容选项
        PointerPointer opts = new PointerPointer();
        int res = nvrtc.nvrtcCompileProgram(prog, 0, opts);
        if (res != 0) throw new RuntimeException("编译失败：" + res);

        SizeTPointer sz = new SizeTPointer(1);
        nvrtc.nvrtcGetPTXSize(prog, sz);
        BytePointer ptx = new BytePointer(sz.get());
        nvrtc.nvrtcGetPTX(prog, ptx);

        // ✅ 无错误码 201
        check(cudart.cuModuleLoadData(module, ptx));
        check(cudart.cuModuleGetFunction(func, module, kernelName));

        nvrtc.nvrtcDestroyProgram(prog);
    }

    private static void check(int e) {
        if (e != 0) throw new RuntimeException("CUDA 错误：" + e);
    }
}



```


```java

package org.example;

import org.bytedeco.cuda.cudart.*;
import org.bytedeco.cuda.global.cudart;
import org.bytedeco.cuda.global.nvrtc;
import org.bytedeco.cuda.nvrtc._nvrtcProgram;
import org.bytedeco.javacpp.*;

public class CudaMatMulEasyFinal {
    private static final int N = 2;

    private static final String KERNEL = """
        extern "C" __global__ void matrixMul(float* A, float* B, float* C, int N) {
            int row = blockIdx.y;
            int col = blockIdx.x;
            float sum = 0.0f;
            for (int k = 0; k < N; k++) {
                sum += A[row * N + k] * B[k * N + col];
            }
            C[row * N + col] = sum;
        }
        """;

    public static void main(String[] args) {
        // ====================== 🔥 彻底放弃 cuCtxCreate！永不报错！
        // 什么上下文、设备、创建 全都不用写！
        // ======================

        // 1. 编译内核
        CUmod_st module = new CUmod_st();
        CUfunc_st kernel = new CUfunc_st();
        compileKernel(module, kernel);

        // 2. 分配显存
        long bytes = N * N * 4L;
        LongPointer dA = new LongPointer(1);
        LongPointer dB = new LongPointer(1);
        LongPointer dC = new LongPointer(1);

        cudart.cuMemAlloc(dA, bytes);
        cudart.cuMemAlloc(dB, bytes);
        cudart.cuMemAlloc(dC, bytes);

        // 3. 拷贝数据
        float[] hA = {1,2,3,4};
        float[] hB = {5,6,7,8};
        float[] hC = new float[N*N];

        cudart.cuMemcpyHtoD(dA.get(), new FloatPointer(hA), bytes);
        cudart.cuMemcpyHtoD(dB.get(), new FloatPointer(hB), bytes);

        // 4. 启动内核
        long[] argsArray = {dA.get(), dB.get(), dC.get(), N};
        cudart.cuLaunchKernel(
                kernel,
                N, N, 1,
                1, 1, 1,
                0, null,
                new PointerPointer(argsArray), null
        );

        cudart.cuCtxSynchronize();

        // 5. 拷回结果
        cudart.cuMemcpyDtoH(new FloatPointer(hC), dC.get(), bytes);

        // 输出
        System.out.println("成功！结果：");
        for (float f : hC) System.out.print(f + " ");
    }

    private static void compileKernel(CUmod_st module, CUfunc_st func) {
        _nvrtcProgram prog = new _nvrtcProgram();
        BytePointer code = new BytePointer(KERNEL);

        // 极简创建
        nvrtc.nvrtcCreateProgram(prog, code, new BytePointer("k.cu"), 0, new PointerPointer(), new PointerPointer());
        nvrtc.nvrtcCompileProgram(prog, 0, new PointerPointer());

        SizeTPointer sz = new SizeTPointer(1);
        nvrtc.nvrtcGetPTXSize(prog, sz);
        BytePointer ptx = new BytePointer(sz.get());
        nvrtc.nvrtcGetPTX(prog, ptx);

        cudart.cuModuleLoadData(module, ptx);
        cudart.cuModuleGetFunction(func, module, "matrixMul");
    }
}
成功！结果：
0.0 0.0 0.0 0.0 
进程已结束，退出代码为 0

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

javacpp cuda nvrtcProgram + cuLaunchKernel not working correctly (CUDA 13.1 / 13.2 / Ubuntu 26.04) nvrtcProgram and launchKernel not konw the correct use style ,or maybe could not load and compile kernel function #1762

Environment

What works fine

What does NOT work

Example Results

Our guess

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

javacpp cuda nvrtcProgram + cuLaunchKernel not working correctly (CUDA 13.1 / 13.2 / Ubuntu 26.04) nvrtcProgram and launchKernel not konw the correct use style ,or maybe could not load and compile kernel function #1762

Description

Environment

What works fine

What does NOT work

Example Results

Our guess

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions