-
Notifications
You must be signed in to change notification settings - Fork 13.3k
[Format] Do not crash on non-null terminated strings #131299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
The `format` API receives a StringRef, but crashes whenever it is non-null-terminated with the corresponding assertion: ``` FormatTests: llvm/lib/Support/MemoryBuffer.cpp:53: void llvm::MemoryBuffer::init(const char *, const char *, bool): Assertion `(!RequiresNullTerminator || BufEnd[0] == 0) && "Buffer is not null terminated!"' failed. ``` Ensure this does not happen by storing a copy of the inputs in `std::string` that does have a null terminator. This changes requires an extra copy of the `Content` in the `SourceManagerForFile` APIs, but the costs of that copy should be negligible in practice as the API is designed for convenience rather than performance in the first place. E.g. running clang-format over `Content` is much more expensive than the copy of the Content itself. This copy could be avoided in most cases if we provide a constructor that accepts `std::string` or null-terminated strings directly, but it does not seem worth the effort. An alternative fix would be to teach `SourceManager` to work with non-null-terminated buffers, but given how much it is used, this would be very complicated and is likely to incur some performance cost.
@llvm/pr-subscribers-clang @llvm/pr-subscribers-clang-format Author: Ilya Biryukov (ilya-biryukov) ChangesThe
Ensure this does not happen by storing a copy of the inputs in This changes requires an extra copy of the This copy could be avoided in most cases if we provide a constructor that accepts An alternative fix would be to teach Full diff: https://github.com/llvm/llvm-project/pull/131299.diff 3 Files Affected:
diff --git a/clang/include/clang/Basic/SourceManager.h b/clang/include/clang/Basic/SourceManager.h
index e0f1ea435d54e..ec5803ff46290 100644
--- a/clang/include/clang/Basic/SourceManager.h
+++ b/clang/include/clang/Basic/SourceManager.h
@@ -2031,6 +2031,7 @@ class SourceManagerForFile {
// The order of these fields are important - they should be in the same order
// as they are created in `createSourceManagerForFile` so that they can be
// deleted in the reverse order as they are created.
+ std::string ContentBuffer;
std::unique_ptr<FileManager> FileMgr;
std::unique_ptr<DiagnosticsEngine> Diagnostics;
std::unique_ptr<SourceManager> SourceMgr;
diff --git a/clang/lib/Basic/SourceManager.cpp b/clang/lib/Basic/SourceManager.cpp
index b1f2180c1d462..4e351ec9089a9 100644
--- a/clang/lib/Basic/SourceManager.cpp
+++ b/clang/lib/Basic/SourceManager.cpp
@@ -2382,14 +2382,20 @@ size_t SourceManager::getDataStructureSizes() const {
SourceManagerForFile::SourceManagerForFile(StringRef FileName,
StringRef Content) {
+ // We copy to `std::string` for Context instead of StringRef because the
+ // SourceManager::getBufferData() works only with null-terminated buffers.
+ // And we still want to keep the API convenient.
+ ContentBuffer = Content.str();
+
// This is referenced by `FileMgr` and will be released by `FileMgr` when it
// is deleted.
IntrusiveRefCntPtr<llvm::vfs::InMemoryFileSystem> InMemoryFileSystem(
new llvm::vfs::InMemoryFileSystem);
+
InMemoryFileSystem->addFile(
FileName, 0,
- llvm::MemoryBuffer::getMemBuffer(Content, FileName,
- /*RequiresNullTerminator=*/false));
+ llvm::MemoryBuffer::getMemBuffer(ContentBuffer, FileName,
+ /*RequiresNullTerminator=*/true));
// This is passed to `SM` as reference, so the pointer has to be referenced
// in `Environment` so that `FileMgr` can out-live this function scope.
FileMgr =
diff --git a/clang/unittests/Format/FormatTest.cpp b/clang/unittests/Format/FormatTest.cpp
index 9864e7ec1b2ec..54d0b13ab35c0 100644
--- a/clang/unittests/Format/FormatTest.cpp
+++ b/clang/unittests/Format/FormatTest.cpp
@@ -29096,6 +29096,17 @@ TEST_F(FormatTest, BreakBeforeClassName) {
" ArenaSafeUniquePtr {};");
}
+TEST_F(FormatTest, DoesNotCrashOnNonNullTerminatedStringRefs) {
+ llvm::StringRef TwoLines = "namespace foo {}\n"
+ "namespace bar {}";
+ llvm::StringRef FirstLine =
+ TwoLines.take_until([](char c) { return c == '\n'; });
+
+ // The internal API used to crash when passed a non-null-terminated StringRef.
+ // Check this does not happen anymore.
+ verifyFormat(FirstLine);
+}
+
} // namespace
} // namespace test
} // namespace format
|
I am going with the path of least resistance in this change and I would welcome any concerns or alternative suggestions. |
It would be useful to have a repro or a stack trace here. |
The repro is attached to the commit. Here is the full stack trace: Stacktrace
It is not PS I am not sure why we need |
// We copy to `std::string` for Context instead of StringRef because the | ||
// SourceManager::getBufferData() works only with null-terminated buffers. | ||
// And we still want to keep the API convenient. | ||
ContentBuffer = Content.str(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a potentially huge copy (consider use cases like unity builds), which makes me pretty uncomfortable. Have you measured the impact of this change on large source files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking through the uses of SourceManagerForFile
, we use it for:
- clang-format,
- some functions in Clangd reading the current file,
- tests.
I would not worry about tests for obvious reasons.
Re Clangd I can also say with high confidence it would never make a difference. If someone opens a large file there, we'll see a ton of copies when LSP passes us this file and anything will bleak in comparison with the memory and compute requirements of Clang itself.
Wrt to clang-format, I am also fairly confident that any large file will take a lot more time and memory to process and this copy will not be noticeable. I will get some large mock file and run some benchmarks for sure and get back to you.
I am not sure in which context unity builds would be a useful use-case to support for either clang-format or Clangd, do you have any use-cases in mind that I am missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried benchmarking clang-format on it and realized the clang-format binary is also using a different function here and this change makes no difference to it at all.
So we're left only with clangd and various other source tools that use clang-format as the final step of source code transformations, as well as downstream uses of the reformat
API. I can't come up with an easy way to benchmark those, but also don't expect this would make a difference.
If you could give an example use-case that you are worried about, I can spend some effort benchmarking it. But I personally feel changing this particular API to something that does an extra copy is a very low risk change and fits with the purpose of the API (i.e. convenience costs some performance, so being a little slow is preferable to crashing)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My IDE (QtCreator) does uses libformat regularly while typing. I would also prefer not to perform copies all over the place.
How about storing a StringRef and assign Content
if it ends with a zero, and only perform the copy if it doesn't?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about storing a StringRef and assign Content if it ends with a zero, and only perform the copy if it doesn't?
But we cannot check if StringRef
is null-terminated because given StringRef S
, accessing S[S.length()]
is UB in the general case. If there's a way to do this that's not UB, we could do that.
I think the easiest way to avoid copies would be to switch to const char *
or std::string
in the APIs to make it explicit we need null-terminated strings. That requires a refactoring of the callers, but ends up being as efficient as it is now.
**My IDE (QtCreator) does uses libformat regularly while typing. I would also prefer not to perform copies all over the place.
But what are the actual costs of this particular copy for the overall performance of the IDE?
I expect it to be negligible, even though I fully sympathize with the idea that we want to avoid unnecessary copies. I just don't think we should do this at the expense of exposing API that have UB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about storing a StringRef and assign Content if it ends with a zero, and only perform the copy if it doesn't?
But we cannot check if
StringRef
is null-terminated because givenStringRef S
, accessingS[S.length()]
is UB in the general case. If there's a way to do this that's not UB, we could do that.I think the easiest way to avoid copies would be to switch to
const char *
orstd::string
in the APIs to make it explicit we need null-terminated strings. That requires a refactoring of the callers, but ends up being as efficient as it is now.
Noted.
**My IDE (QtCreator) does uses libformat regularly while typing. I would also prefer not to perform copies all over the place.
But what are the actual costs of this particular copy for the overall performance of the IDE? I expect it to be negligible, even though I fully sympathize with the idea that we want to avoid unnecessary copies. I just don't think we should do this at the expense of exposing API that have UB.
I mainly think about my battery, when I code while being mobile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mainly think about my battery, when I code while being mobile.
But this extra copy is likely a really-really small portion of the reformatting cost. Performance and battery go together here.
Is there anything I'm missing why it would not be the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any particular direction folks want to take this?
Having an API that breaks on non-null-terminated StringRef
looks bad, but and I don't see easy options to fix this other than a copy.
I don't expect this to give any noticeable performance regressions, and I don't have a clear sense of whether the folks in the comment threads agree or disagree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like it, but it's better than potentially crashing software.
But I'm not comfortable enough to approve something outside libFormat or clang-format.
@@ -2031,6 +2031,7 @@ class SourceManagerForFile { | |||
// The order of these fields are important - they should be in the same order | |||
// as they are created in `createSourceManagerForFile` so that they can be | |||
// deleted in the reverse order as they are created. | |||
std::string ContentBuffer; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not go after the comment.
Or?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FileManager and SourceManager will be referencing the contents of this buffer, so I think it's best to destroy them before this string
.
Hence, the order is important, but maybe I should clarify the comment and mention the Buffer
explicitly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would be useful.
// We copy to `std::string` for Context instead of StringRef because the | ||
// SourceManager::getBufferData() works only with null-terminated buffers. | ||
// And we still want to keep the API convenient. | ||
ContentBuffer = Content.str(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My IDE (QtCreator) does uses libformat regularly while typing. I would also prefer not to perform copies all over the place.
How about storing a StringRef and assign Content
if it ends with a zero, and only perform the copy if it doesn't?
Use `verifyNoCrash`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a reproducer?
I'm not sure I can do better than the test I've added. |
The
format
API receives a StringRef, but crashes whenever it is non-null-terminated with the corresponding assertion:Ensure this does not happen by storing a copy of the inputs in
std::string
that does have a null terminator.This changes requires an extra copy of the
Content
in theSourceManagerForFile
APIs, but the costs of that copy should be negligible in practice as the API is designed for convenience rather than performance in the first place. E.g. running clang-format overContent
is much more expensive than the copy of the Content itself.This copy could be avoided in most cases if we provide a constructor that accepts
std::string
or null-terminated strings directly, but it does not seem worth the effort.An alternative fix would be to teach
SourceManager
to work with non-null-terminated buffers, but given how much it is used, this would be very complicated and is likely to incur some performance cost.