You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
"# Align GPU timestamps across profiles collected by different Nsight Systems processes\n",
@@ -82,16 +95,14 @@
82
95
"source": [
83
96
"This data frame has a three-level index:\n",
84
97
"- `ProgramId` is an integer ID that uniquely identifies the XLA module\n",
85
-
"- This is the `ProgramExecution`-th execution of the module within the profiles. You may see this starting from 1, not 0, because of the `warmup_removal_heuristics` option passed to `load_profiler_data`.\n",
98
+
"- This is the `ProgramExecution`-th execution of the module within the profiles. You may see this starting from 2, not 0, because of the `warmup_removal_heuristics` option passed to `load_profiler_data`.\n",
86
99
"- `Device` is the global (across multiple nodes and processes) index of the GPU on which the module execution took place\n",
87
100
"\n",
88
101
"The columns are as follows:\n",
89
102
"- `Name`: the name of the XLA module; this should always be the same for a given `ProgramId`\n",
90
103
"- `NumThunks`: the number of thunks executed inside this module execution\n",
91
104
"- `ProjStartMs`: the timestamp of the start of the module execution on the GPU, in milliseconds\n",
92
105
"- `ProjDurMs`: the duration of the module execution on the GPU, in milliseconds\n",
93
-
"- `OrigStartMs`: the timestamp of the start of the module launch **on the host**, in milliseconds. *i.e.* `ProjStartMs-OrigStartMs` is something like the launch latency of the first kernel\n",
94
-
"- `OrigDurMs`: the duration of the module launch **on the host**, in milliseconds\n",
95
106
"- `LocalDevice`: the index within the node/slice of the GPU on which the module execution took place\n",
96
107
"- `Process`: the global (across multiple nodes) index of the process\n",
97
108
"- `Slice`: the global index of the node/slice; devices within the same node/slice should have faster interconnects than to devices in different slices\n",
@@ -117,13 +128,13 @@
117
128
"id": "7727d800-13d3-4505-89e8-80a5fed63512",
118
129
"metadata": {},
119
130
"source": [
120
-
"Here the index has four levels. `ProgramId`, `ProgramExecution` and `Device` have the same meanings as in `module_df`.\n",
131
+
"Here the index has four levels. `ProgramId`, `ProgramExecution` and `Device` have the same meanings as in `steady_state.module`.\n",
121
132
"The fourth level (in the 3rd position) shows that this row is the `ThunkIndex`-th thunk within the `ProgramExecution`-th execution of XLA module `ProgramId`.\n",
122
133
"Note that a given thunk can be executed multiple times within the same module, so indexing on the thunk name would not be unique.\n",
123
134
"\n",
124
135
"The columns are as follows:\n",
125
136
"- `Name`: the name of the thunk; this should be unique within a given `ProgramId` and can be used as a key to look up XLA metadata\n",
126
-
"- `ProjStartMs`, `OrigStartMs`, `OrigDurMs`: see above, same meaning as in `module_df`.\n",
137
+
"- `ProjStartMs`: see above, same meaning as in `steady_state.module`.\n",
127
138
"- `Communication`: does this thunk represent communication between GPUs (*i.e.* a NCCL collective)? XLA overlaps communication and computation kernels, and `load_profiler_data` triggers an overlap calculation. `ProjDurMs` for a communication kernel shows only the duration that was **not** overlapped with computation kernels, while `ProjDurHiddenMs` shows the duration that **was** overlapped.\n",
128
139
"- This is the `ThunkExecution`-th execution of this thunk for this `(ProgramId, ProgramExecution, Device)`\n",
129
140
"\n",
@@ -299,7 +310,7 @@
299
310
"# Print out the largest entries adding up to at least this fraction of the total\n",
0 commit comments