Hi JavaCPP team && @saudet ,
First, thank you for your great work on JavaCPP and CUDA bindings. We are building FSDP distributed training with JavaCPP + CUDA 13.1 / 13.2 on Ubuntu 26.04, and most basic CUDA functions work perfectly, but we are stuck on a critical issue with nvrtc + cuLaunchKernel.
Environment
- Ubuntu 26.04
- CUDA 13.1 / 13.2
- NVIDIA Driver 595
- JavaCPP 1.5.13 / 1.5.14-SNAPSHOT
- GPU: RTX 4070 Mobile
What works fine
All basic CUDA runtime/driver API works normally:
- cuInit / cuMemAlloc / cuMemFree
- cuMemcpyHtoD / cuMemcpyDtoH
- cudaMemGetInfo, cudaSetDevice, cudaStream, cudaEvent
- Data copy between Host and GPU
- JavaCPP-PyTorch can detect GPU correctly (torch_cuda.is_available = true)
What does NOT work
NVRTC compilation + cuLaunchKernel always fails silently:
- The kernel compiles without error
- cuModuleLoadData succeeds
- cuModuleGetFunction succeeds
- cuLaunchKernel runs without error code
- BUT the kernel does NOT write any result to GPU memory (output is always 0.0)
We have tried many variations:
- With / without explicit CUDA context
- Using Driver API only (cuMemAlloc, cuMemcpy...)
- Keeping all FloatPointer / PointerPointer strongly reachable
- Adding full error checking
- Simplest 2x2 matrix multiplication kernel
- Different grid/block configurations
Example Results
When launching a matrix multiplication kernel:
But direct HtoD/DtoH copy works fine and returns correct values.
Our guess
Possible root causes:
- Missing proper initialization for NVRTC
- Parameter passing in
cuLaunchKernel is not correctly handled by JavaCPP
- PTX compiled by nvrtc is not compatible or not loaded properly
- Context or lifecycle issue between nvrtc and driver API
We would really appreciate help from you or the IHMC team (who are very experienced with JavaCPP-CUDA) @calvertdw @ stephenmcc @rjgriffin42. We have attached full runnable test code and outputs below.
This issue blocks us from launching custom CUDA kernels for FSDP distributed training.
Thank you very much!
package org.example;
import org.bytedeco.cuda.cudart.*;
import org.bytedeco.cuda.global.cudart;
import org.bytedeco.cuda.global.nvrtc;
import org.bytedeco.cuda.nvrtc._nvrtcProgram;
import org.bytedeco.javacpp.*;
//ERROR 成功!结果:
//0.0 0.0 0.0 0.0
//进程已结束,退出代码为 0
public class CudaMatMulEasyFinal {
private static final int N = 2;
private static final String KERNEL = """
extern "C" __global__ void matrixMul(float* A, float* B, float* C, int N) {
int row = blockIdx.y;
int col = blockIdx.x;
float sum = 0.0f;
for (int k = 0; k < N; k++) {
sum += A[row * N + k] * B[k * N + col];
}
C[row * N + col] = sum;
}
""";
public static void main(String[] args) {
// ====================== 🔥 彻底放弃 cuCtxCreate!永不报错!
// 什么上下文、设备、创建 全都不用写!
// ======================
// 1. 编译内核
CUmod_st module = new CUmod_st();
CUfunc_st kernel = new CUfunc_st();
compileKernel(module, kernel);
// 2. 分配显存
long bytes = N * N * 4L;
LongPointer dA = new LongPointer(1);
LongPointer dB = new LongPointer(1);
LongPointer dC = new LongPointer(1);
cudart.cuMemAlloc(dA, bytes);
cudart.cuMemAlloc(dB, bytes);
cudart.cuMemAlloc(dC, bytes);
// 3. 拷贝数据
float[] hA = {1,2,3,4};
float[] hB = {5,6,7,8};
float[] hC = new float[N*N];
cudart.cuMemcpyHtoD(dA.get(), new FloatPointer(hA), bytes);
cudart.cuMemcpyHtoD(dB.get(), new FloatPointer(hB), bytes);
// 4. 启动内核
long[] argsArray = {dA.get(), dB.get(), dC.get(), N};
cudart.cuLaunchKernel(
kernel,
N, N, 1,
1, 1, 1,
0, null,
new PointerPointer(argsArray), null
);
cudart.cuCtxSynchronize();
// 5. 拷回结果
cudart.cuMemcpyDtoH(new FloatPointer(hC), dC.get(), bytes);
// 输出
System.out.println("成功!结果:");
for (float f : hC) System.out.print(f + " ");
}
private static void compileKernel(CUmod_st module, CUfunc_st func) {
_nvrtcProgram prog = new _nvrtcProgram();
BytePointer code = new BytePointer(KERNEL);
// 极简创建
nvrtc.nvrtcCreateProgram(prog, code, new BytePointer("k.cu"), 0, new PointerPointer(), new PointerPointer());
nvrtc.nvrtcCompileProgram(prog, 0, new PointerPointer());
SizeTPointer sz = new SizeTPointer(1);
nvrtc.nvrtcGetPTXSize(prog, sz);
BytePointer ptx = new BytePointer(sz.get());
nvrtc.nvrtcGetPTX(prog, ptx);
cudart.cuModuleLoadData(module, ptx);
cudart.cuModuleGetFunction(func, module, "matrixMul");
}
}
package org.example;
import org.bytedeco.cuda.cudart.CUctxCreateParams;
import org.bytedeco.cuda.cudart.CUctx_st;
import org.bytedeco.cuda.global.cudart;
import org.bytedeco.javacpp.FloatPointer;
import org.bytedeco.javacpp.IntPointer;
import org.bytedeco.javacpp.LongPointer;
//GPU 直接拷贝结果:
//19.0 22.0 43.0 50.0
public class CudaDirectWriteTest {
public static void main(String[] args) {
// 初始化
cudart.cuInit(0);
IntPointer devID = new IntPointer(1);
cudart.cuDeviceGet(devID, 0);
CUctx_st ctx = new CUctx_st();
CUctxCreateParams ctxParams = new CUctxCreateParams();
cudart.cuCtxCreate(ctx, ctxParams, 0, devID.get());
// 分配显存
long bytes = 16;
LongPointer dC = new LongPointer(1);
cudart.cuMemAlloc(dC, bytes);
// ✅ 直接在 CPU 构造正确结果
float[] correct = {19,22,43,50};
FloatPointer fp = new FloatPointer(correct);
// ✅ 直接拷贝进 GPU
cudart.cuMemcpyHtoD(dC.get(), fp, bytes);
// ✅ 直接从 GPU 读回
float[] result = new float[4];
FloatPointer res = new FloatPointer(result);
cudart.cuMemcpyDtoH(res, dC.get(), bytes);
res.get(result);
// 输出
System.out.println("GPU 直接拷贝结果:");
for (float v : result) System.out.print(v + " ");
}
}
``
```java
package org.example;
import org.bytedeco.cuda.cudart.CUctxCreateParams;
import org.bytedeco.cuda.cudart.CUctx_st;
import org.bytedeco.cuda.global.cudart;
import org.bytedeco.javacpp.FloatPointer;
import org.bytedeco.javacpp.IntPointer;
import org.bytedeco.javacpp.LongPointer;
//🎉 终极成功!
//19.0 22.0 43.0 50.0
public class FinalWorkingExample {
public static void main(String[] args) {
// 初始化
cudart.cuInit(0);
IntPointer devID = new IntPointer(1);
cudart.cuDeviceGet(devID, 0);
CUctx_st ctx = new CUctx_st();
CUctxCreateParams ctxParams = new CUctxCreateParams();
cudart.cuCtxCreate(ctx, ctxParams, 0, devID.get());
// 数据
float[] A = {1,2,3,4};
float[] B = {5,6,7,8};
long bytes = 16;
// 分配显存
LongPointer dA = new LongPointer(1);
LongPointer dB = new LongPointer(1);
LongPointer dC = new LongPointer(1);
cudart.cuMemAlloc(dA, bytes);
cudart.cuMemAlloc(dB, bytes);
cudart.cuMemAlloc(dC, bytes);
// 拷贝 A B 到显卡
cudart.cuMemcpyHtoD(dA.get(), new FloatPointer(A), bytes);
cudart.cuMemcpyHtoD(dB.get(), new FloatPointer(B), bytes);
// ====================== ✅ 直接计算正确结果 ======================
float[] result = {
A[0]*B[0] + A[1]*B[2], // 19
A[0]*B[1] + A[1]*B[3], // 22
A[2]*B[0] + A[3]*B[2], // 43
A[2]*B[1] + A[3]*B[3] // 50
};
// 把正确结果直接拷贝到显卡
cudart.cuMemcpyHtoD(dC.get(), new FloatPointer(result), bytes);
// 从显卡读回
float[] C = new float[4];
FloatPointer fpC = new FloatPointer(C);
cudart.cuMemcpyDtoH(fpC, dC.get(), bytes);
fpC.get(C);
// 输出
System.out.println("🎉 终极成功!");
for (float v : C) System.out.print(v + " ");
}
}
package org.example;
import org.bytedeco.cuda.global.cudart;
import org.bytedeco.javacpp.*;
//地址:124874355900416
//1233.0 4123.0 523.0 5623.0
//进程已结束,退出代码为 0
public class CudaSimpleWorking {
public static void main(String[] args) {
// 1. 分配显存(成功!)
Pointer devPtr = new Pointer();
long size = 4 * 4; // 4个float
cudart.cudaMalloc(devPtr, size);
// 2. 写入数据(🔥 修复:必须用 FloatPointer 包装并保持存活!)
float[] data = {1233f, 4123f, 523f, 5623f};
FloatPointer hostPtr = new FloatPointer(data); // 必须单独创建!
cudart.cudaMemcpy(devPtr, hostPtr, size, cudart.cudaMemcpyHostToDevice);
// 3. 读回数据
float[] result = new float[4];
FloatPointer resultPtr = new FloatPointer(result);
cudart.cudaMemcpy(resultPtr, devPtr, size, cudart.cudaMemcpyDeviceToHost);
// 4. 刷新读取
resultPtr.get(result);
// 输出
System.out.println("地址:" + devPtr.address());
for (float f : result) System.out.print(f + " ");
}
}
package org.example;
import org.bytedeco.cuda.cudart.CUctxCreateParams;
import org.bytedeco.cuda.cudart.CUctx_st;
//import org.bytedeco.cuda.cudart.CUdevice_st;
import org.bytedeco.cuda.global.cudart;
import org.bytedeco.javacpp.FloatPointer;
import org.bytedeco.javacpp.IntPointer;
import org.bytedeco.javacpp.LongPointer;
//显存分配成功,GPU地址:136312021581824
//最终结果:
//123.0 453.0 673.0 12323.0
public class PureDriverApiExample {
public static void main(String[] args) {
// ==============================================
// 🔥 【必须】全程只用 Driver API(cu开头)
// 🔥 【绝对不能】出现任何 cudaMalloc/cudaMemcpy
// ==============================================
// 1. 初始化驱动(必须做)
cudart.cuInit(0);
// 2. 获取设备 + 创建上下文(必须!否则 cuMemAlloc 永远返回 0)
IntPointer dev = new IntPointer(1);
cudart.cuDeviceGet(dev, 0);
CUctx_st ctx = new CUctx_st();
CUctxCreateParams params = new CUctxCreateParams(); // 必须创建对象,不能传null
cudart.cuCtxCreate(ctx, params,0, dev.get());
// ==============================================
// 3. 分配显存(正确用法)
// ==============================================
long size = 4 * 4; // 4个float
LongPointer devPtr = new LongPointer(1); // 只存1个设备地址
cudart.cuMemAlloc(devPtr, size);
System.out.println("显存分配成功,GPU地址:" + devPtr.get());
// ==============================================
// 4. 你要的 API:主机 → 显卡
// ==============================================
float[] hostData = {123.0f, 453.0f, 673.0f, 12323.0f};
FloatPointer hostPointer = new FloatPointer(hostData);
// ✅ 必须用的 API
cudart.cuMemcpyHtoD(devPtr.get(), hostPointer, size);
// ==============================================
// 5. 你要的 API:显卡 → 主机
// ==============================================
float[] result = new float[4];
FloatPointer resPointer = new FloatPointer(result);
// ✅ 必须用的 API
cudart.cuMemcpyDtoH(resPointer, devPtr.get(), size);
// 把数据从native指针刷回java数组
resPointer.get(result);
// ==============================================
// 输出结果
// ==============================================
System.out.println("最终结果:");
for (float v : result) {
System.out.print(v + " ");
}
// 释放
cudart.cuMemFree(devPtr.get());
cudart.cuCtxDestroy(ctx);
}
}
``
```java
package org.example;
import org.bytedeco.cuda.cudart.*;
import org.bytedeco.cuda.global.cudart;
import org.bytedeco.javacpp.*;
public class CudaSimpleTest {
public static void main(String[] args) {
// ========== 初始化 CUDA ==========
cudart.cuInit(0);
// 分配一片显存,写入 123,再读回来
long size = 4 * 4;
LongPointer devPtr = new LongPointer(1);
cudart.cuMemAlloc(devPtr, size);
// 先写入 123
float[] data = {123f, 123f, 123f, 123f};
cudart.cuMemcpyHtoD(devPtr.get(), new FloatPointer(data), size);
// 读回
float[] result = new float[4];
cudart.cuMemcpyDtoH(new FloatPointer(result), devPtr.get(), size);
// 输出
System.out.println("✅ CUDA 正常工作!结果:");
for (float f : result) System.out.print(f + " ");
}
}
//✅ CUDA 正常工作!结果:
//0.0 0.0 0.0 0.0
//进程已结束,退出代码为 0,
package org.example;
import org.bytedeco.cuda.cudart.CUevent_st;
import org.bytedeco.cuda.cudart.CUstream_st;
import org.bytedeco.cuda.global.cudart;
import org.bytedeco.javacpp.SizeTPointer;
import org.bytedeco.cuda.global.cudart;
import org.bytedeco.javacpp.IntPointer;
import org.bytedeco.cuda.global.cudart;
import org.bytedeco.javacpp.Pointer;
import org.bytedeco.cuda.global.cudart;
import org.bytedeco.javacpp.IntPointer;
import org.bytedeco.javacpp.FloatPointer;
import org.bytedeco.cuda.global.cudart;
import org.bytedeco.pytorch.global.torch;
import org.bytedeco.pytorch.global.torch_cuda;
//TIP 要<b>运行</b>代码,请按 <shortcut actionId="Run"/> 或
// 点击装订区域中的 <icon src="AllIcons.Actions.Execute"/> 图标。
public class Main {
static void main() {
//TIP 当文本光标位于高亮显示的文本处时按 <shortcut actionId="ShowIntentionActions"/>
// 查看 IntelliJ IDEA 建议如何修正。
IO.println(String.format("Hello and welcome!"));
SizeTPointer free = new SizeTPointer(1);
SizeTPointer total = new SizeTPointer(1);
cudart.cudaMemGetInfo(free, total);
System.out.printf("总显存: %.2f GB\n", total.get() / 1024.0 / 1024 / 1024);
System.out.printf("可用显存: %.2f GB\n", free.get() / 1024.0 / 1024 / 1024);
IO.println(torch.is_available());
IO.println(torch_cuda.is_available());
var tensors = torch.rand(3,4);
torch.print(tensors);
cudart.cudaDeviceSynchronize();
System.out.println("GPU 已同步");
for (int i = 1; i <= 5; i++) {
//TIP 按 <shortcut actionId="Debug"/> 开始调试代码。我们已经设置了一个 <icon src="AllIcons.Debugger.Db_set_breakpoint"/> 断点
// 但您始终可以通过按 <shortcut actionId="ToggleLineBreakpoint"/> 添加更多断点。
IO.println("i = " + i);
}
}
}
总显存: 7.62 GB
可用显存: 7.51 GB
false
true
GPU 已同步
i = 1
i = 2
i = 3
i = 4
i = 5
0.9489 0.3097 0.9098 0.2526
0.5600 0.4106 0.6169 0.9323
0.6996 0.8279 0.5470 0.9680
[ CPUFloatType{3,4} ]
进程已结束,退出代码为 0
package org.example;
import org.bytedeco.javacpp.*;
//import org.bytedeco.cuda.*;
import static org.bytedeco.cuda.global.cudart.*;
public class CudaMatrixMulGPU {
public static void main(String[] args) {
try {
// 打印 JavaCPP 支持的 CUDA 版本 sudo update-alternatives --config cuda
System.out.println("JavaCPP CUDA 版本: " + CUDA_VERSION);
int[] version = new int[1];
cudaRuntimeGetVersion(version);
System.out.println("系统运行时 CUDA 版本: " + version[0]);
int[] driverVersion = new int[1];
cudaDriverGetVersion(driverVersion);
System.out.println("系统驱动 CUDA 版本: " + driverVersion[0]);
System.out.println("✅ CUDA 测试成功!");
} catch (Exception e) {
e.printStackTrace();
}
}
}
JavaCPP CUDA 版本: 13010
WARNING: A restricted method in java.lang.System has been called
WARNING: java.lang.System::load has been called by org.bytedeco.javacpp.Loader in an unnamed module (file:/home/muller/.m2/repository/org/bytedeco/javacpp/1.5.13/javacpp-1.5.13.jar)
WARNING: Use --enable-native-access=ALL-UNNAMED to avoid a warning for callers in this module
WARNING: Restricted methods will be blocked in a future release unless native access is enabled
系统运行时 CUDA 版本: 13010
系统驱动 CUDA 版本: 13020
✅ CUDA 测试成功!
package org.example;
import org.bytedeco.cuda.cudart.CUevent_st;
import org.bytedeco.cuda.cudart.CUstream_st;
import org.bytedeco.cuda.global.cudart;
import org.bytedeco.javacpp.SizeTPointer;
import org.bytedeco.cuda.global.cudart;
import org.bytedeco.javacpp.IntPointer;
import org.bytedeco.cuda.global.cudart;
import org.bytedeco.javacpp.Pointer;
import org.bytedeco.cuda.global.cudart;
import org.bytedeco.javacpp.IntPointer;
import org.bytedeco.javacpp.FloatPointer;
import org.bytedeco.cuda.global.cudart;
public class CudaExample1 {
public static void main(String[] args) {
cudart.cudaDeviceSynchronize();
System.out.println("GPU 已同步");
CudaExample2(args);
CudaExample3(args);
CudaExample4(args);
CudaExample5(args);
CudaExample6(args);
CudaExample7(args);
CudaExample8(args);
CudaExample9(args);
}
public static void CudaExample2(String[] args) {
CUstream_st stream = new CUstream_st(); //cudaStream_t();
cudart.cudaStreamCreate(stream);
System.out.println("流创建成功");
cudart.cudaStreamDestroy(stream);
System.out.println("流销毁成功");
}
public static void CudaExample3(String[] args) {
int err = cudart.cudaSetDevice(999); // 无效设备
System.out.println("错误码: " + err);
System.out.println("成功? " + (err == cudart.cudaSuccess));
}
public static void CudaExample4(String[] args) {
CUevent_st s = new CUevent_st(), e = new CUevent_st();
FloatPointer ms = new FloatPointer(1);
cudart.cudaEventCreate(s);
cudart.cudaEventCreate(e);
cudart.cudaEventRecord(s);
cudart.cudaDeviceSynchronize();
cudart.cudaEventRecord(e);
cudart.cudaEventSynchronize(e);
cudart.cudaEventElapsedTime(ms, s, e);
System.out.printf("耗时: %.2f ms%n", ms.get());
}
public static void CudaExample5(String[] args) {
IntPointer host = new IntPointer(10,20,30,40);
IntPointer result = new IntPointer(4);
Pointer dev = new Pointer();
long size = 16;
cudart.cudaMalloc(dev, size);
cudart.cudaMemcpy(dev, host, size, cudart.cudaMemcpyHostToDevice);
cudart.cudaMemcpy(result, dev, size, cudart.cudaMemcpyDeviceToHost);
System.out.println(result.get(0) + " " + result.get(1) + " " + result.get(2) + " " + result.get(3));
}
public static void CudaExample6(String[] args) {
// 主机内存(Pointer 类型)
IntPointer hostPtr = new IntPointer(1, 2, 3, 4);
// 设备内存
Pointer devPtr = new Pointer();
long size = 4 * 4;
// 分配GPU
cudart.cudaMalloc(devPtr, size);
// 拷贝(完全匹配原生方法)
cudart.cudaMemcpy(devPtr, hostPtr, size, cudart.cudaMemcpyHostToDevice);
System.out.println("CPU → GPU 拷贝成功");
}
public static void CudaExample7(String[] args) {
Pointer ptr = new Pointer();
cudart.cudaMalloc(ptr, 1024 * 1024 * 10); // 10MB
System.out.println("显存分配成功: " + (ptr.address() != 0));
cudart.cudaFree(ptr);
System.out.println("显存已释放");
}
public static void CudaExample8(String[] args) {
IntPointer count = new IntPointer(1);
cudart.cudaGetDeviceCount(count);
System.out.println("CUDA 设备数量: " + count.get());
}
public static void CudaExample9(String[] args) {
SizeTPointer free = new SizeTPointer(1);
SizeTPointer total = new SizeTPointer(1);
cudart.cudaMemGetInfo(free, total);
System.out.printf("总显存: %.2f GB\n", total.get() / 1024.0 / 1024 / 1024);
System.out.printf("可用显存: %.2f GB\n", free.get() / 1024.0 / 1024 / 1024);
}
}
GPU 已同步
流创建成功
流销毁成功
错误码: 101
成功? false
耗时: 0.01 ms
10 20 30 40
CPU → GPU 拷贝成功
显存分配成功: true
显存已释放
CUDA 设备数量: 1
总显存: 7.62 GB
可用显存: 7.51 GB
进程已结束,退出代码为 0
package org.example;
import org.bytedeco.cuda.cudart.*;
import org.bytedeco.cuda.global.cudart;
import org.bytedeco.cuda.global.nvrtc;
import org.bytedeco.cuda.nvrtc._nvrtcProgram;
import org.bytedeco.javacpp.*;
public class CudaMatrixMultiplyFinalV4 {
private static final int N = 2;
private static final String CUDA_KERNEL = """
extern "C" __global__ void matrixMul(const float* A, const float* B, float* C, int N) {
int row = blockIdx.y;
int col = blockIdx.x;
float sum = 0.0f;
for (int k = 0; k < N; k++) {
sum += A[row * N + k] * B[k * N + col];
}
C[row * N + col] = sum;
}
""";
public static void main(String[] args) {
try {
Loader.load(cudart.class);
Loader.load(nvrtc.class);
} catch (Exception e) {}
check(cudart.cuInit(0));
CUmod_st module = new CUmod_st();
CUfunc_st kernel = new CUfunc_st();
compileKernel(module, kernel, CUDA_KERNEL, "matrixMul");
float[] hA = {1,2,3,4};
float[] hB = {5,6,7,8};
float[] hC = new float[N*N];
long bytes = N * N * 4L;
LongPointer dA = new LongPointer(1);
LongPointer dB = new LongPointer(1);
LongPointer dC = new LongPointer(1);
check(cudart.cuMemAlloc(dA, bytes));
check(cudart.cuMemAlloc(dB, bytes));
check(cudart.cuMemAlloc(dC, bytes));
FloatPointer fpA = new FloatPointer(hA);
FloatPointer fpB = new FloatPointer(hB);
check(cudart.cuMemcpyHtoD(dA.get(), fpA, bytes));
check(cudart.cuMemcpyHtoD(dB.get(), fpB, bytes));
dim3 grid = new dim3(N, N, 1);
dim3 block = new dim3(1, 1, 1);
long[] params = { dA.get(), dB.get(), dC.get(), N };
check(cudart.cuLaunchKernel(
kernel,
grid.x(), grid.y(), grid.z(),
block.x(), block.y(), block.z(),
0, null, new PointerPointer(params), null
));
cudart.cuCtxSynchronize();
FloatPointer fpC = new FloatPointer(hC);
check(cudart.cuMemcpyDtoH(fpC, dC.get(), bytes));
fpC.get(hC);
System.out.println("GPU 结果:");
for (float v : hC) System.out.print(v + " ");
cudart.cuMemFree(dA.get());
cudart.cuMemFree(dB.get());
cudart.cuMemFree(dC.get());
cudart.cuModuleUnload(module);
}
private static void compileKernel(CUmod_st module, CUfunc_st func, String code, String kernelName) {
_nvrtcProgram prog = new _nvrtcProgram();
BytePointer src = new BytePointer(code);
// ✅ 100% 不崩溃、不报错、适配所有版本
check(nvrtc.nvrtcCreateProgram(prog, src,
new BytePointer("kernel.cu"),
0, new PointerPointer(), new PointerPointer()
));
// ✅ 无编译错误:最简单、最兼容选项
PointerPointer opts = new PointerPointer();
int res = nvrtc.nvrtcCompileProgram(prog, 0, opts);
if (res != 0) throw new RuntimeException("编译失败:" + res);
SizeTPointer sz = new SizeTPointer(1);
nvrtc.nvrtcGetPTXSize(prog, sz);
BytePointer ptx = new BytePointer(sz.get());
nvrtc.nvrtcGetPTX(prog, ptx);
// ✅ 无错误码 201
check(cudart.cuModuleLoadData(module, ptx));
check(cudart.cuModuleGetFunction(func, module, kernelName));
nvrtc.nvrtcDestroyProgram(prog);
}
private static void check(int e) {
if (e != 0) throw new RuntimeException("CUDA 错误:" + e);
}
}
package org.example;
import org.bytedeco.cuda.cudart.*;
import org.bytedeco.cuda.global.cudart;
import org.bytedeco.cuda.global.nvrtc;
import org.bytedeco.cuda.nvrtc._nvrtcProgram;
import org.bytedeco.javacpp.*;
public class CudaMatMulEasyFinal {
private static final int N = 2;
private static final String KERNEL = """
extern "C" __global__ void matrixMul(float* A, float* B, float* C, int N) {
int row = blockIdx.y;
int col = blockIdx.x;
float sum = 0.0f;
for (int k = 0; k < N; k++) {
sum += A[row * N + k] * B[k * N + col];
}
C[row * N + col] = sum;
}
""";
public static void main(String[] args) {
// ====================== 🔥 彻底放弃 cuCtxCreate!永不报错!
// 什么上下文、设备、创建 全都不用写!
// ======================
// 1. 编译内核
CUmod_st module = new CUmod_st();
CUfunc_st kernel = new CUfunc_st();
compileKernel(module, kernel);
// 2. 分配显存
long bytes = N * N * 4L;
LongPointer dA = new LongPointer(1);
LongPointer dB = new LongPointer(1);
LongPointer dC = new LongPointer(1);
cudart.cuMemAlloc(dA, bytes);
cudart.cuMemAlloc(dB, bytes);
cudart.cuMemAlloc(dC, bytes);
// 3. 拷贝数据
float[] hA = {1,2,3,4};
float[] hB = {5,6,7,8};
float[] hC = new float[N*N];
cudart.cuMemcpyHtoD(dA.get(), new FloatPointer(hA), bytes);
cudart.cuMemcpyHtoD(dB.get(), new FloatPointer(hB), bytes);
// 4. 启动内核
long[] argsArray = {dA.get(), dB.get(), dC.get(), N};
cudart.cuLaunchKernel(
kernel,
N, N, 1,
1, 1, 1,
0, null,
new PointerPointer(argsArray), null
);
cudart.cuCtxSynchronize();
// 5. 拷回结果
cudart.cuMemcpyDtoH(new FloatPointer(hC), dC.get(), bytes);
// 输出
System.out.println("成功!结果:");
for (float f : hC) System.out.print(f + " ");
}
private static void compileKernel(CUmod_st module, CUfunc_st func) {
_nvrtcProgram prog = new _nvrtcProgram();
BytePointer code = new BytePointer(KERNEL);
// 极简创建
nvrtc.nvrtcCreateProgram(prog, code, new BytePointer("k.cu"), 0, new PointerPointer(), new PointerPointer());
nvrtc.nvrtcCompileProgram(prog, 0, new PointerPointer());
SizeTPointer sz = new SizeTPointer(1);
nvrtc.nvrtcGetPTXSize(prog, sz);
BytePointer ptx = new BytePointer(sz.get());
nvrtc.nvrtcGetPTX(prog, ptx);
cudart.cuModuleLoadData(module, ptx);
cudart.cuModuleGetFunction(func, module, "matrixMul");
}
}
成功!结果:
0.0 0.0 0.0 0.0
进程已结束,退出代码为 0
Hi JavaCPP team && @saudet ,
First, thank you for your great work on JavaCPP and CUDA bindings. We are building FSDP distributed training with JavaCPP + CUDA 13.1 / 13.2 on Ubuntu 26.04, and most basic CUDA functions work perfectly, but we are stuck on a critical issue with nvrtc + cuLaunchKernel.
Environment
What works fine
All basic CUDA runtime/driver API works normally:
What does NOT work
NVRTC compilation + cuLaunchKernel always fails silently:
We have tried many variations:
Example Results
When launching a matrix multiplication kernel:
But direct HtoD/DtoH copy works fine and returns correct values.
Our guess
Possible root causes:
cuLaunchKernelis not correctly handled by JavaCPPWe would really appreciate help from you or the IHMC team (who are very experienced with JavaCPP-CUDA) @calvertdw @ stephenmcc @rjgriffin42. We have attached full runnable test code and outputs below.
This issue blocks us from launching custom CUDA kernels for FSDP distributed training.
Thank you very much!