Skip to content

Commit 9ffcca7

Browse files
sraikund16pytorchmergebot
authored andcommitted
[Profiler] Handle Tensor Sizes/Strides Parsing Error (pytorch#134862)
Summary: Currently some jobs are encountering the following trace, P1539415198. This suggests that when we are parsing through tensors the path is prone to encountering an invalid address. This is is possibly occurring because for some reason the sizes() and strides() of a Tensor seem to not be of the same dimensions. We assume such when iterating through the shapes to get the Ivalue generator. When browsing some of the tensor implementations, I found that some of the size and stride paths are different which could be the cause of this issue. Regardless, the profiler should be flexible enough to handle such issues without bringing down the whole main thread. If the crashes still persist, it will still give us a data point as to where they are occurring and we can rule out the strides/sizes as the culprit Test Plan: This change doesn't break anything in the happy path, just makes sure the bad path is not exited abruptly. We should use this in order to debug what the events are having mismatching dimensions between sizes and strides. Differential Revision: D62008788 Pull Request resolved: pytorch#134862 Approved by: https://github.com/aaronenyeshi
1 parent f05b716 commit 9ffcca7

File tree

3 files changed

+29
-6
lines changed

3 files changed

+29
-6
lines changed

torch/csrc/profiler/collection.cpp

+26-4
Original file line numberDiff line numberDiff line change
@@ -33,11 +33,18 @@ RawTensorMetadataBase::RawTensorMetadataBase(const at::Tensor& t)
3333
: data_{t.has_storage() ? t.storage().data() : nullptr},
3434
dtype_{t.scalar_type()},
3535
layout_{t.layout()},
36-
dim_{static_cast<uint32_t>(t.sizes().size())} {
36+
size_dim_{static_cast<uint32_t>(t.sizes().size())},
37+
stride_dim_{static_cast<uint32_t>(t.strides().size())} {
3738
TORCH_INTERNAL_ASSERT_DEBUG_ONLY(
3839
t.sizes().size() <= std::numeric_limits<uint32_t>::max(),
3940
"Cannot profile Tensors of size > uint32 max. Got dim: ",
4041
t.sizes().size());
42+
TORCH_INTERNAL_ASSERT_DEBUG_ONLY(
43+
t.sizes().size() != t.strides().size(),
44+
"Tensor has mismatching sizes and strides. Sizes: ",
45+
t.sizes().size(),
46+
" Strides: ",
47+
t.strides().size());
4148
}
4249

4350
RawTensorMetadata::RawTensorMetadata(const at::Tensor& t)
@@ -181,14 +188,29 @@ auto InputOutputEncoder::getIValueGenerator(const IOType& io_type) {
181188
ivals_it = ivalues_.begin(),
182189
io_type]() mutable {
183190
auto decode_tensor = [&]() -> TensorMetadata {
184-
const auto& raw_metadata = *tensor_metadata_it++;
185191
std::vector<int64_t> sizes;
186192
std::vector<int64_t> strides;
187-
for (C10_UNUSED const auto _ : c10::irange(raw_metadata.dim_)) {
193+
if (tensor_metadata_it.exhausted()) {
194+
LOG(WARNING)
195+
<< "Tensor metadata exhausted prematurely. Reported shapes may be inaccurate!";
196+
return {RawTensorMetadata(), sizes, strides};
197+
}
198+
const auto& raw_metadata = *tensor_metadata_it++;
199+
for (C10_UNUSED const auto _ : c10::irange(raw_metadata.size_dim_)) {
200+
if (tensor_size_strides_it.exhausted()) {
201+
LOG(WARNING)
202+
<< "Expected Tensor Size mismatch with raw Tensor metadata. Reported shapes may be inaccurate!";
203+
return {raw_metadata, sizes, strides};
204+
}
188205
sizes.push_back(*tensor_size_strides_it++);
189206
}
190207
if (raw_metadata.layout_ == at::kStrided) {
191-
for (C10_UNUSED const auto _ : c10::irange(raw_metadata.dim_)) {
208+
for (C10_UNUSED const auto _ : c10::irange(raw_metadata.stride_dim_)) {
209+
if (tensor_size_strides_it.exhausted()) {
210+
LOG(WARNING)
211+
<< "Expected Tensor Strides mismatch with raw Tensor metadata. Reported shapes may be inaccurate!";
212+
return {raw_metadata, sizes, strides};
213+
}
192214
strides.push_back(*tensor_size_strides_it++);
193215
}
194216
}

torch/csrc/profiler/collection.h

+2-1
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,8 @@ struct TORCH_API RawTensorMetadataBase {
4747
StorageImplData data_;
4848
c10::ScalarType dtype_{c10::ScalarType::Undefined};
4949
c10::Layout layout_{c10::Layout::Strided};
50-
uint32_t dim_{0};
50+
uint32_t size_dim_{0};
51+
uint32_t stride_dim_{0};
5152
};
5253

5354
// Collected during profiling.

torch/csrc/profiler/python/init.cpp

+1-1
Original file line numberDiff line numberDiff line change
@@ -441,7 +441,7 @@ void initPythonBindings(PyObject* module) {
441441
return py::reinterpret_borrow<py::object>(
442442
torch::autograd::utils::wrap(metadata.dtype_));
443443
})
444-
.def_readonly("dim", &TensorMetadata::dim_)
444+
.def_readonly("dim", &TensorMetadata::size_dim_)
445445
.def_readonly("sizes", &TensorMetadata::sizes_)
446446
.def_readonly("strides", &TensorMetadata::strides_);
447447

0 commit comments

Comments
 (0)