You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Linux uprobes Documentation: https://www.kernel.org/doc/Documentation/trace/uprobetracer.txt
510
503
511
-
## Benchmarking Tracing Overhead
512
-
513
-
While tracing is an invaluable tool for debugging and understanding CUDA applications, it does introduce some overhead. We've included a benchmarking tool to help you measure this overhead.
514
-
515
-
### The Benchmark Tool
516
-
517
-
The `bench.cu` program performs several CUDA operations repeatedly and measures their execution time:
518
-
519
-
1. Memory allocation (`cudaMalloc`)
520
-
2. Memory transfers (host to device and device to host)
521
-
3. Kernel launches
522
-
4. Memory deallocation (`cudaFree`)
523
-
5. Full operations (the complete sequence)
524
-
525
-
Each operation is executed many times to get statistically significant results, and the average time per operation is reported in microseconds.
526
-
527
-
### Running the Benchmark
528
-
529
-
To build the benchmark tool:
530
-
531
-
```bash
532
-
make bench
533
-
```
534
-
535
-
To run a complete benchmark that compares performance with and without tracing:
536
-
537
-
```bash
538
-
make benchmark
539
-
```
540
-
541
-
This will run the benchmark twice:
542
-
1. First without any tracing
543
-
2. Then with the CUDA events tracer attached
544
-
545
-
You can also run individual benchmarks:
546
-
547
-
```bash
548
-
# Without tracing
549
-
make benchmark-no-trace
550
-
551
-
# With tracing
552
-
make benchmark-with-trace
553
-
```
554
-
555
-
### Interpreting the Results
556
-
557
-
The benchmark output shows the average time for each CUDA operation in microseconds. Compare the times with and without tracing to understand the overhead.
558
-
559
-
For example:
560
-
561
-
```
562
-
# Without tracing
563
-
cudaMalloc : 23.45 µs per operation
564
-
cudaMemcpyH2D : 42.67 µs per operation
565
-
cudaLaunchKernel : 15.89 µs per operation
566
-
cudaMemcpyD2H : 38.12 µs per operation
567
-
cudaFree : 10.34 µs per operation
568
-
Full Operation : 130.47 µs per operation
569
-
570
-
# With tracing
571
-
cudaMalloc : 25.12 µs per operation
572
-
cudaMemcpyH2D : 45.89 µs per operation
573
-
cudaLaunchKernel : 17.23 µs per operation
574
-
cudaMemcpyD2H : 41.56 µs per operation
575
-
cudaFree : 11.78 µs per operation
576
-
Full Operation : 141.58 µs per operation
577
-
```
578
-
579
-
In this example, tracing adds about 7-10% overhead to CUDA operations. This is typically acceptable for debugging and profiling purposes, but it's important to be aware of this impact when interpreting the results.
580
-
581
-
### Optimization Opportunities
582
-
583
-
If you find the tracing overhead too high for your use case, there are several ways to reduce it:
584
-
585
-
1. Trace only specific CUDA functions that are relevant to your investigation
586
-
2. Filter by specific process IDs to minimize the number of events captured
587
-
3. Disable return probes using the `-r` flag if you don't need return values
588
-
4. Consider running eBPF in user-space with tools like [bpftime](https://github.com/eunomia-bpf/bpftime) to reduce context-switching overhead
589
-
590
504
If you'd like to dive deeper into eBPF, check out our tutorial repository at https://github.com/eunomia-bpf/bpf-developer-tutorial or visit our website at https://eunomia.dev/tutorials/.
0 commit comments