-
Notifications
You must be signed in to change notification settings - Fork 75
Energy profiling tools: NVML-based measurement tool #301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
ethan-puyaubreau
wants to merge
15
commits into
kokkos:develop
Choose a base branch
from
ethan-puyaubreau:feature/energy-profiler-nvml
base: develop
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
15 commits
Select commit
Hold shift + click to select a range
7ec6d88
Add energy profiler module with timing utilities and export functiona…
DaKerboul 22429bf
Remove unused chrono include from timing_export.hpp
DaKerboul f9b227b
Refactor energy profiler timing functions
DaKerboul e30b3fd
clang-format
DaKerboul e0de668
Add Daemon class for managing periodic task execution
DaKerboul 74f435a
Refactor Daemon::tick
DaKerboul 9b52c58
Rename variable Daemon::run method
DaKerboul 49073bc
Move files to upper folder
DaKerboul b96c9b1
Add NVML power profiling support to energy profiler module
DaKerboul cac7e07
Update CMakeLists.txt to enforce minimum CUDA version for NVML support
DaKerboul 9fd72b0
Enhance energy profiler with NVML support checks in CMake configuration
DaKerboul 058d9e3
Refactor error handling in energy profiler to use std::cerr instead o…
DaKerboul 8c29325
Refactor logging functions and reduce includes
DaKerboul d9a6342
Remove warning for ending region with no active regions in energy pro…
DaKerboul 9b9e6f5
Format error message for power sampling initialization in energy prof…
DaKerboul File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,11 @@ | ||
| kp_add_library(kp_energy_profiler | ||
| kp_energy_profiler.cpp | ||
| timing_utils.cpp | ||
| timing_export.cpp | ||
| nvml_provider.cpp | ||
| power_sampler.cpp | ||
| daemon.cpp | ||
| ) | ||
|
|
||
| target_link_libraries(kp_energy_profiler PRIVATE CUDA::nvml) | ||
| target_compile_definitions(kp_energy_profiler PRIVATE KOKKOS_ENERGY_PROFILER_HAS_NVML) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,45 @@ | ||
| //@HEADER | ||
| // ************************************************************************ | ||
| // | ||
| // Kokkos v. 4.0 | ||
| // Copyright (2022) National Technology & Engineering | ||
| // Solutions of Sandia, LLC (NTESS). | ||
| // | ||
| // Under the terms of Contract DE-NA0003525 with NTESS, | ||
| // the U.S. Government retains certain rights in this software. | ||
| // | ||
| // Part of Kokkos, under the Apache License v2.0 with LLVM Exceptions. | ||
| // See https://kokkos.org/LICENSE for license information. | ||
| // SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception | ||
| // | ||
| //@HEADER | ||
|
|
||
| #include "daemon.hpp" | ||
| #include <stdexcept> | ||
| #include <thread> | ||
|
|
||
| void Daemon::start() { | ||
| if (!running_) { | ||
| running_ = true; | ||
| thread_ = std::thread(&Daemon::run, this); | ||
| } else { | ||
| throw std::runtime_error("Daemon already started"); | ||
| } | ||
| } | ||
|
|
||
| void Daemon::run() { | ||
| while (running_) { | ||
| auto next_run = std::chrono::high_resolution_clock::now() + interval_; | ||
| func_(); | ||
| std::this_thread::sleep_until(next_run); | ||
| } | ||
| } | ||
|
|
||
| void Daemon::stop() { | ||
| if (running_) { | ||
| running_ = false; | ||
| thread_.join(); | ||
| } else { | ||
| throw std::runtime_error("Daemon not started"); | ||
| } | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,39 @@ | ||
| //@HEADER | ||
| // ************************************************************************ | ||
| // | ||
| // Kokkos v. 4.0 | ||
| // Copyright (2022) National Technology & Engineering | ||
| // Solutions of Sandia, LLC (NTESS). | ||
| // | ||
| // Under the terms of Contract DE-NA0003525 with NTESS, | ||
| // the U.S. Government retains certain rights in this software. | ||
| // | ||
| // Part of Kokkos, under the Apache License v2.0 with LLVM Exceptions. | ||
| // See https://kokkos.org/LICENSE for license information. | ||
| // SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception | ||
| // | ||
| //@HEADER | ||
|
|
||
| #pragma once | ||
|
|
||
| #include <functional> | ||
| #include <thread> | ||
| #include <chrono> | ||
|
|
||
| class Daemon { | ||
| public: | ||
| Daemon(std::function<void()> func, int interval_ms) | ||
| : interval_(interval_ms), func_(func){}; | ||
|
|
||
| void start(); | ||
| void run(); | ||
| void stop(); | ||
| bool is_running() const { return running_; } | ||
| std::thread& get_thread() { return thread_; } | ||
|
|
||
| private: | ||
| std::chrono::milliseconds interval_; | ||
| bool running_{false}; | ||
| std::function<void()> func_; | ||
| std::thread thread_; | ||
| }; |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,38 @@ | ||
| //@HEADER | ||
| // ************************************************************************ | ||
| // | ||
| // Kokkos v. 4.0 | ||
| // Copyright (2022) National Technology & Engineering | ||
| // Solutions of Sandia, LLC (NTESS). | ||
| // | ||
| // Under the terms of Contract DE-NA0003525 with NTESS, | ||
| // the U.S. Government retains certain rights in this software. | ||
| // | ||
| // Part of Kokkos, under the Apache License v2.0 with LLVM Exceptions. | ||
| // See https://kokkos.org/LICENSE for license information. | ||
| // SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception | ||
| // | ||
| //@HEADER | ||
|
|
||
| #pragma once | ||
|
|
||
| #include <cstddef> | ||
|
|
||
| namespace KokkosTools { | ||
| namespace EnergyProfiler { | ||
|
|
||
| // Sampling interval in milliseconds | ||
| constexpr int SAMPLING_INTERVAL_MS = 20; | ||
|
|
||
| // Buffer size for hostname | ||
| const size_t HOSTNAME_BUFFER_SIZE = 256; | ||
|
|
||
| // Table formatting constants for timing export | ||
| const int COLUMN_WIDTH_CATEGORY = 10; | ||
| const int COLUMN_WIDTH_NAME = 32; | ||
| const int COLUMN_WIDTH_TYPE = 14; | ||
| const int COLUMN_WIDTH_TIME = 17; | ||
| const int COLUMN_WIDTH_DURATION = 13; | ||
|
|
||
| } // namespace EnergyProfiler | ||
| } // namespace KokkosTools |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you seen https://docs.nvidia.com/nsight-systems/UserGuide/index.html#nvml-power-and-temperature-metrics-preview ?
It would be great to actually better document this tool and compare it to what
nsigh-systemsmay provide. Does your tool make anything special to help correlateKokkosregions with consumption, trigger anything special ? Or is it "just" launching a thread that samples the energy consumption ?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, what about asynchronicity ? I mean:
{ Kokkos::Profiling::ScopeRegion region("my region"); Kokkos::parallel_for(Kokkos::RangePolicy(exec, 0, N), ...); // async }If the tool reports the consumption of the my region region, if you're not
Kokkos::fenceing, what is the meaning of the measurements reported byKokkos Tools, especially when the kernel is actually running after the scoped region ends ?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello! I've seen this page one time but since I focused more on Variorum support at some point I haven't had the chance of reading it further. As of now, the tool is only launching a thread that samples the energy consumption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, the system doesn't do anything special to correlate power consumption with a specific region or kernel. The timing system is meant to help visualize what the current situation is, so there's room for improvement (meaning adding more correlation using multiple metrics), especially since the daemon system would still allow for method/data propagation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@romintomasetti we will check if Nsight does actually do something useful. It at least sounds like the exact same thing we are doing.
About the fencing: Nvidia at the moment allows power measurements every 100ms. But it seems to return only the power average of the last 25ms of that window, see http://arxiv.org/abs/2312.02741. Thus the tool will at the current state of Nvidas tools only be useful for measuring entire regions and even then it will need repetitions with shifts and post processing in order to get anything that is relatable to an algorithm. Due to these problems the tool currently does not require fencing, this should be done by the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be good to add the AMD tool to the mix:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on our ToDo list :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JBludau Do you still have plans to look at the AMD Profiling metrics, as mentioned above? Do you have updates you can share - particularly those that are pertinent to this PR - from your end?
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pr has been split into smaller ones to make it easier to review. Once we have these in, we can think about adding something for amd