Energy profiling tools: Core infrastructure with timing tool and export capabilities#299
Conversation
|
@dalg24 @masterleinad Hello! I would need some review on this PR, being the baseline blocks needed the daemon system and energy measurement tools (that would also need review #300) |
JBludau
left a comment
There was a problem hiding this comment.
I think we are getting there
| // Stack-based timing for robust region/kernel tracking | ||
| void start_region(const std::string& name, RegionType type, uint64_t id = 0); | ||
| void end_region(); |
There was a problem hiding this comment.
Fine but noting that if you can't correlate start and end of a region via some identifier then it won't be threadsafe.
a9c54ce to
5d9082f
Compare
| # - Tool interface definitions | ||
| # - Basic kernel timer tool | ||
|
|
||
| add_subdirectory(kokkos-tools) No newline at end of file |
There was a problem hiding this comment.
This red circle means "no newline at end of file".
Please fix it by adding a newline character.
Same comment everywhere.
| namespace KokkosTools { | ||
| namespace EnergyProfiler { | ||
|
|
||
| std::string generate_prefix() { |
There was a problem hiding this comment.
| namespace KokkosTools { | |
| namespace EnergyProfiler { | |
| std::string generate_prefix() { | |
| std::string KokkosTools::EnergyProfiler::generate_prefix() { |
| namespace KokkosTools { | ||
| namespace EnergyProfiler { |
There was a problem hiding this comment.
| namespace KokkosTools { | |
| namespace EnergyProfiler { | |
| namespace KokkosTools::EnergyProfiler { |
6fb522b to
f2f0606
Compare
5fbccdf to
2629fad
Compare
| std::lock_guard<std::mutex> lock(state.get_mutex()); | ||
| state.get_active_regions().push_back(region); | ||
| } catch (const std::exception& e) { | ||
| std::cerr << "Error in start_region: " << e.what() << std::endl; |
There was a problem hiding this comment.
hmm ... afaik that is not how exceptions are supposed to be used ... especially since you are not handling the exception but just printing and not rethrowing. So this would not lead to an abort.
if any of the functions in the try block throw, you are silencing that but also not restoring a valid state so that the program can continue. I would remove the try-catch
| TimingInfo region; | ||
| region.name = name; | ||
| region.type = type; | ||
| region.start_time = std::chrono::high_resolution_clock::now(); |
There was a problem hiding this comment.
when starting, you should take the time at the end and then use it to update the last region in the dqueue. This way you don't measure the construction or the lock
There was a problem hiding this comment.
As I am writing this: It of course does not hold for nested regions ... but that is too much for now
d1bae2d to
049a99f
Compare
…ling in start and end region functions
049a99f to
6fcf78e
Compare
|
A recent reference with related work on rocm: |
Thanks! |
This PR introduces the foundational infrastructure for energy profiling tools in Kokkos:
Features:
TimingInfostructure and region trackingEnergyProfilerStatesingleton for managing active/completed timing regionstiming_exportmodule for exporting timing data to CSV files with summary statisticsImplementation:
kp_energy_profiler- Core profiling functionalitytiming_utils.hpp/cpp- State management and helper functionstiming_export.hpp/cpp- CSV export and summary generationArchitecture:
This infrastructure provides a clean separation between timing collection, state management, and data export, making it easy for future energy monitoring providers to integrate their functionality. The design supports both synchronous and asynchronous profiling patterns (if combined with #300).