Skip to content

Commit 80afe2a

Browse files
committed
add chinese doc for cuda
1 parent f933bd4 commit 80afe2a

File tree

6 files changed

+375
-570
lines changed

6 files changed

+375
-570
lines changed

README.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -78,6 +78,11 @@ Scheduler:
7878
- [lesson 44-scx-simple](src/44-scx-simple/README.md) Introduction to the BPF Scheduler
7979
- [lesson 45-scx-nest](src/45-scx-nest/README.md) Implementing the `scx_nest` Scheduler
8080

81+
GPU:
82+
83+
- [lesson 47](src/47-cuda-events/README.md) Using eBPF to trace CUDA operations for GPU
84+
85+
8186
Other:
8287

8388
- [lesson 35-user-ringbuf](src/35-user-ringbuf/README.md) Asynchronously Send to Kernel with User Ring Buffer

README.zh.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,11 @@ Android:
7272
调度器:
7373

7474
- [lesson 44-scx-simple](src/44-scx-simple/README.zh.md) None
75+
76+
GPU:
77+
78+
- [lesson 47-cuda-events](src/47-cuda-events/README.zh.md) 使用 eBPF 追踪 CUDA 操作
79+
7580
其他:
7681

7782
- [lesson 35-user-ringbuf](src/35-user-ringbuf/README.zh.md) eBPF开发实践:使用 user ring buffer 向内核异步发送信息

src/47-cuda-events/README.md

Lines changed: 3 additions & 89 deletions
Original file line numberDiff line numberDiff line change
@@ -481,16 +481,6 @@ The `cuda_events` tool supports these options:
481481
- `-p PATH`: Specify the path to the CUDA runtime library or application
482482
- `-d PID`: Trace only the specified process ID
483483

484-
## Learning Objectives
485-
486-
Through this tutorial, you'll learn:
487-
488-
1. How CUDA applications interact with GPUs through the CUDA runtime API
489-
2. How to use eBPF uprobes to trace user-space libraries
490-
3. How to design efficient data structures for kernel-to-user communication
491-
4. How to process and display traced events in a user-friendly format
492-
5. How to filter events by process ID for focused debugging
493-
494484
## Next Steps
495485

496486
Once you're comfortable with this basic CUDA tracing tool, you could extend it to:
@@ -501,90 +491,14 @@ Once you're comfortable with this basic CUDA tracing tool, you could extend it t
501491
4. Create visualizations of CUDA operations for easier analysis
502492
5. Add support for other GPU frameworks like OpenCL or ROCm
503493

494+
For more detail about the cuda example and tutorial, you can checkout out repo and the code in <https://github.com/eunomia-bpf/basic-cuda-tutorial>
495+
496+
504497
## References
505498

506499
- CUDA Programming Guide: https://docs.nvidia.com/cuda/cuda-c-programming-guide/
507500
- NVIDIA CUDA Runtime API: https://docs.nvidia.com/cuda/cuda-runtime-api/
508501
- libbpf Documentation: https://libbpf.readthedocs.io/
509502
- Linux uprobes Documentation: https://www.kernel.org/doc/Documentation/trace/uprobetracer.txt
510503

511-
## Benchmarking Tracing Overhead
512-
513-
While tracing is an invaluable tool for debugging and understanding CUDA applications, it does introduce some overhead. We've included a benchmarking tool to help you measure this overhead.
514-
515-
### The Benchmark Tool
516-
517-
The `bench.cu` program performs several CUDA operations repeatedly and measures their execution time:
518-
519-
1. Memory allocation (`cudaMalloc`)
520-
2. Memory transfers (host to device and device to host)
521-
3. Kernel launches
522-
4. Memory deallocation (`cudaFree`)
523-
5. Full operations (the complete sequence)
524-
525-
Each operation is executed many times to get statistically significant results, and the average time per operation is reported in microseconds.
526-
527-
### Running the Benchmark
528-
529-
To build the benchmark tool:
530-
531-
```bash
532-
make bench
533-
```
534-
535-
To run a complete benchmark that compares performance with and without tracing:
536-
537-
```bash
538-
make benchmark
539-
```
540-
541-
This will run the benchmark twice:
542-
1. First without any tracing
543-
2. Then with the CUDA events tracer attached
544-
545-
You can also run individual benchmarks:
546-
547-
```bash
548-
# Without tracing
549-
make benchmark-no-trace
550-
551-
# With tracing
552-
make benchmark-with-trace
553-
```
554-
555-
### Interpreting the Results
556-
557-
The benchmark output shows the average time for each CUDA operation in microseconds. Compare the times with and without tracing to understand the overhead.
558-
559-
For example:
560-
561-
```
562-
# Without tracing
563-
cudaMalloc : 23.45 µs per operation
564-
cudaMemcpyH2D : 42.67 µs per operation
565-
cudaLaunchKernel : 15.89 µs per operation
566-
cudaMemcpyD2H : 38.12 µs per operation
567-
cudaFree : 10.34 µs per operation
568-
Full Operation : 130.47 µs per operation
569-
570-
# With tracing
571-
cudaMalloc : 25.12 µs per operation
572-
cudaMemcpyH2D : 45.89 µs per operation
573-
cudaLaunchKernel : 17.23 µs per operation
574-
cudaMemcpyD2H : 41.56 µs per operation
575-
cudaFree : 11.78 µs per operation
576-
Full Operation : 141.58 µs per operation
577-
```
578-
579-
In this example, tracing adds about 7-10% overhead to CUDA operations. This is typically acceptable for debugging and profiling purposes, but it's important to be aware of this impact when interpreting the results.
580-
581-
### Optimization Opportunities
582-
583-
If you find the tracing overhead too high for your use case, there are several ways to reduce it:
584-
585-
1. Trace only specific CUDA functions that are relevant to your investigation
586-
2. Filter by specific process IDs to minimize the number of events captured
587-
3. Disable return probes using the `-r` flag if you don't need return values
588-
4. Consider running eBPF in user-space with tools like [bpftime](https://github.com/eunomia-bpf/bpftime) to reduce context-switching overhead
589-
590504
If you'd like to dive deeper into eBPF, check out our tutorial repository at https://github.com/eunomia-bpf/bpf-developer-tutorial or visit our website at https://eunomia.dev/tutorials/.

0 commit comments

Comments
 (0)