-
Notifications
You must be signed in to change notification settings - Fork 37
WIP: Pydarshan log_based sorting #949
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
tylerjereddy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried playing with it a bit this morning and thought I'd make a few suggestions:
I tried this command: python -m darshan job_stats ~/github_projects/darshan-logs/darshan_logs/mpi_io_test_with_dxt/treddy_mpi-io-test_id4373053_6-2-60198-9815401321915095332_1.darshan STDIO total_bytes 5
And got:
Statistical data of jobs: agg_perf_by_slowest agg_time_by_slowest category_counters ... unique_io_total_time_by_slowest unique_md_only_time_by_slowest unique_rw_only_time_by_slowest
0 16.47327 0.000094 [<cdata 'struct darshan_file_category_counters... ... 0.000094 0.0 0.000094
Where the columns look a bit misaligned with the data? That's an easy enough fix I think:
--- a/darshan-util/pydarshan/darshan/cli/job_stats.py
+++ b/darshan-util/pydarshan/darshan/cli/job_stats.py
@@ -162,7 +162,7 @@ def main(args: Union[Any, None] = None):
com_dfs = combined_dfs(list_dfs)
combined_dfs_sort = sort_data_desc(com_dfs, order_by_colname)
combined_dfs_selected = first_n_recs(combined_dfs_sort, n)
- print("Statistical data of jobs:", combined_dfs_selected)
+ print("Statistical data of jobs:\n", combined_dfs_selected)Statistical data of jobs:
agg_perf_by_slowest agg_time_by_slowest category_counters ... unique_io_total_time_by_slowest unique_md_only_time_by_slowest unique_rw_only_time_by_slowest
0 16.47327 0.000094 [<cdata 'struct darshan_file_category_counters... ... 0.000094 0.0 0.000094
That looks better!
I also tried running on all of the logs in the logs repo, just to cause trouble:
python -m darshan job_stats ~/github_projects/darshan-logs/darshan_logs/**/*.darshan STDIO total_bytes 5
And ran into an error:
usage: darshan <command> job_stats [-h] log_path module order_by_colname number_of_rows
darshan <command> job_stats: error: argument number_of_rows: invalid int value: '/Users/treddy/github_projects/darshan-logs/darshan_logs/e3sm_io_heatmaps_and_dxt/e3sm_io_heatmap_and_dxt.darshan'
Curiously, if I run on that file directly it seems to work ok, so may be related to the ** glob pattern vs. command line stuff?
Although I don't know Shane/your full vision for the use cases just yet, here's a suggestion to get the regression testing started:
--- /dev/null
+++ b/darshan-util/pydarshan/darshan/tests/test_job_stats.py
@@ -0,0 +1,32 @@
+import argparse
+from unittest import mock
+
+from darshan.log_utils import get_log_path
+from darshan.cli import job_stats
+import pytest
+
+
+@pytest.mark.parametrize(
+ "argv", [
+ [get_log_path("e3sm_io_heatmap_only.darshan"),
+ "STDIO",
+ "total_bytes",
+ "5"],
+ ]
+)
+def test_job_stats(argv, capsys):
+ with mock.patch("sys.argv", argv):
+ # initialize the parser
+ parser = argparse.ArgumentParser(description="")
+ # run through setup_parser()
+ job_stats.setup_parser(parser=parser)
+ # parse the input arguments
+ args = parser.parse_args(argv)
+ job_stats.main(args=args)
+ captured = capsys.readouterr()
+ assert "3.258853" in captured.out
+ # NOTE: probably much better to use i.e.,
+ # pd.read_csv or similar on the string data
+ # and reconstitute a DataFrame where you could
+ # use i.e., assert_frame_equal for floating point
+ # comparisons, etc.Also, if printing to the terminal is the "final medium" for output, we could look at libraries like https://github.com/Textualize/rich or pandas-related things built on top of that for some nice formatting I suppose.
|
I agree with Tyler's suggestions above, so if we could incorporate those I think that'd be great.
Same here. I think this is being caused by shell wildcard expansion -- your shell will automatically expand the pattern above into all of the possible Darshan log files. To this script then, there will be a ton of log files followed by the remainder of the command line arguments which causes the rest of the command line parsing to blow up, obviously. If you enclose your pattern above in quotes (and use your actual home directory name rather than Thinking about it more, maybe we should just lean totally on the shell wildcard expansion and not worry about anything glob related in the python script. I think users will be tricked by this too, figuring out the right way to express a pattern to get through the shell all the way to python. This means the script would instead just need to accept a list of log files on the command line, with the user responsible for creating that list (using a shell glob or by manually specifying them). Any objections to that? (Sorry @Yanlilyu, I know this is what you originally started with, but this may have been a bad suggestion on my part...) If we go that direction, I'd suggest switching the remainder of the command line arguments from positional arguments to named arguments (e.g.,
I agree, this would be nice, but we can always explore this later. I think the ultimate output format for this tool remains just shell output (it's really just a quick way to scan a bunch of log files for interesting jobs, it doesn't really need a super formal output format), so maybe that would be something like Rich rather than Pandas styling? |
|
Oh, and BTW, after figuring out how to properly format the glob to account for all logs in the Darshan logs repo, I do run into an error that should be properly guarded against: Not every log is guaranteed to have a given module included in the log file. If a module isn't found in a log file, we should assign it "N/A" values or something like that and put it at the end of the list. Does that seem reasonable? |
|
One final comment after looking this over: Can we drop all other columns from the derived statistics dataframe other than the ones we are concerned with (total_bytes, agg_perf_by_slowest, agg_time_by_slowest)? I don't think they are useful to include and will help make the output more digestible. |
|
We moved this all into one PR: #954 |
Description:
This branch adds a log_based routine to help user find the most I/O intensive jobs. The file is job_stats.py in PyDarshan CLI tools.
It combines the aggregated data from multiple log files to a DataFrame, sorts data by the column name the user inputs in a descending order, and then filters the data with the first n (number_of_rows from user input) records. It returns a DataFrame with the n most I/O intensive jobs
User input includes log_path, module, order_by_colname, number_of_rows.
log_path should be a path glob pattern.
order_by_colname should be "agg_perf_by_slowest", "agg_time_by_slowest", or "total_bytes".
Example usage: $ python -m darshan job_stats darshan_logs/nonmpi_workflow/"worker*.darshan" STDIO total_bytes 5