Skip to content

[Format] Do not crash on non-null terminated strings #131299

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

ilya-biryukov
Copy link
Contributor

The format API receives a StringRef, but crashes whenever it is non-null-terminated with the corresponding assertion:

FormatTests: llvm/lib/Support/MemoryBuffer.cpp:53:
void llvm::MemoryBuffer::init(const char *, const char *, bool):
Assertion `(!RequiresNullTerminator || BufEnd[0] == 0) && "Buffer is not null terminated!"' failed.

Ensure this does not happen by storing a copy of the inputs in std::string that does have a null terminator.

This changes requires an extra copy of the Content in the SourceManagerForFile APIs, but the costs of that copy should be negligible in practice as the API is designed for convenience rather than performance in the first place. E.g. running clang-format over Content is much more expensive than the copy of the Content itself.

This copy could be avoided in most cases if we provide a constructor that accepts std::string or null-terminated strings directly, but it does not seem worth the effort.

An alternative fix would be to teach SourceManager to work with non-null-terminated buffers, but given how much it is used, this would be very complicated and is likely to incur some performance cost.

The `format` API receives a StringRef, but crashes whenever it is
non-null-terminated with the corresponding assertion:

```
FormatTests: llvm/lib/Support/MemoryBuffer.cpp:53:
void llvm::MemoryBuffer::init(const char *, const char *, bool):
Assertion `(!RequiresNullTerminator || BufEnd[0] == 0) && "Buffer is not null terminated!"' failed.
```

Ensure this does not happen by storing a copy of the inputs in
`std::string` that does have a null terminator.

This changes requires an extra copy of the `Content` in the
`SourceManagerForFile` APIs, but the costs of that copy should be
negligible in practice as the API is designed for convenience rather
than performance in the first place. E.g. running clang-format over
`Content` is much more expensive than the copy of the Content itself.

This copy could be avoided in most cases if we provide a constructor
that accepts `std::string` or null-terminated strings directly, but it
does not seem worth the effort.

An alternative fix would be to teach `SourceManager` to work with
non-null-terminated buffers, but given how much it is used, this would
be very complicated and is likely to incur some performance cost.
@ilya-biryukov ilya-biryukov requested a review from owenca March 14, 2025 10:17
@llvmbot llvmbot added clang Clang issues not falling into any other category clang:frontend Language frontend issues, e.g. anything involving "Sema" labels Mar 14, 2025
@llvmbot
Copy link
Member

llvmbot commented Mar 14, 2025

@llvm/pr-subscribers-clang

@llvm/pr-subscribers-clang-format

Author: Ilya Biryukov (ilya-biryukov)

Changes

The format API receives a StringRef, but crashes whenever it is non-null-terminated with the corresponding assertion:

FormatTests: llvm/lib/Support/MemoryBuffer.cpp:53:
void llvm::MemoryBuffer::init(const char *, const char *, bool):
Assertion `(!RequiresNullTerminator || BufEnd[0] == 0) && "Buffer is not null terminated!"' failed.

Ensure this does not happen by storing a copy of the inputs in std::string that does have a null terminator.

This changes requires an extra copy of the Content in the SourceManagerForFile APIs, but the costs of that copy should be negligible in practice as the API is designed for convenience rather than performance in the first place. E.g. running clang-format over Content is much more expensive than the copy of the Content itself.

This copy could be avoided in most cases if we provide a constructor that accepts std::string or null-terminated strings directly, but it does not seem worth the effort.

An alternative fix would be to teach SourceManager to work with non-null-terminated buffers, but given how much it is used, this would be very complicated and is likely to incur some performance cost.


Full diff: https://github.com/llvm/llvm-project/pull/131299.diff

3 Files Affected:

  • (modified) clang/include/clang/Basic/SourceManager.h (+1)
  • (modified) clang/lib/Basic/SourceManager.cpp (+8-2)
  • (modified) clang/unittests/Format/FormatTest.cpp (+11)
diff --git a/clang/include/clang/Basic/SourceManager.h b/clang/include/clang/Basic/SourceManager.h
index e0f1ea435d54e..ec5803ff46290 100644
--- a/clang/include/clang/Basic/SourceManager.h
+++ b/clang/include/clang/Basic/SourceManager.h
@@ -2031,6 +2031,7 @@ class SourceManagerForFile {
   // The order of these fields are important - they should be in the same order
   // as they are created in `createSourceManagerForFile` so that they can be
   // deleted in the reverse order as they are created.
+  std::string ContentBuffer;
   std::unique_ptr<FileManager> FileMgr;
   std::unique_ptr<DiagnosticsEngine> Diagnostics;
   std::unique_ptr<SourceManager> SourceMgr;
diff --git a/clang/lib/Basic/SourceManager.cpp b/clang/lib/Basic/SourceManager.cpp
index b1f2180c1d462..4e351ec9089a9 100644
--- a/clang/lib/Basic/SourceManager.cpp
+++ b/clang/lib/Basic/SourceManager.cpp
@@ -2382,14 +2382,20 @@ size_t SourceManager::getDataStructureSizes() const {
 
 SourceManagerForFile::SourceManagerForFile(StringRef FileName,
                                            StringRef Content) {
+  // We copy to `std::string` for Context instead of StringRef because the
+  // SourceManager::getBufferData() works only with null-terminated buffers.
+  // And we still want to keep the API convenient.
+  ContentBuffer = Content.str();
+
   // This is referenced by `FileMgr` and will be released by `FileMgr` when it
   // is deleted.
   IntrusiveRefCntPtr<llvm::vfs::InMemoryFileSystem> InMemoryFileSystem(
       new llvm::vfs::InMemoryFileSystem);
+
   InMemoryFileSystem->addFile(
       FileName, 0,
-      llvm::MemoryBuffer::getMemBuffer(Content, FileName,
-                                       /*RequiresNullTerminator=*/false));
+      llvm::MemoryBuffer::getMemBuffer(ContentBuffer, FileName,
+                                       /*RequiresNullTerminator=*/true));
   // This is passed to `SM` as reference, so the pointer has to be referenced
   // in `Environment` so that `FileMgr` can out-live this function scope.
   FileMgr =
diff --git a/clang/unittests/Format/FormatTest.cpp b/clang/unittests/Format/FormatTest.cpp
index 9864e7ec1b2ec..54d0b13ab35c0 100644
--- a/clang/unittests/Format/FormatTest.cpp
+++ b/clang/unittests/Format/FormatTest.cpp
@@ -29096,6 +29096,17 @@ TEST_F(FormatTest, BreakBeforeClassName) {
                "    ArenaSafeUniquePtr {};");
 }
 
+TEST_F(FormatTest, DoesNotCrashOnNonNullTerminatedStringRefs) {
+  llvm::StringRef TwoLines = "namespace foo {}\n"
+                             "namespace bar {}";
+  llvm::StringRef FirstLine =
+      TwoLines.take_until([](char c) { return c == '\n'; });
+
+  // The internal API used to crash when passed a non-null-terminated StringRef.
+  // Check this does not happen anymore.
+  verifyFormat(FirstLine);
+}
+
 } // namespace
 } // namespace test
 } // namespace format

@ilya-biryukov
Copy link
Contributor Author

I am going with the path of least resistance in this change and I would welcome any concerns or alternative suggestions.
It definitely has trade-offs as we now incur an extra copy, but this seems acceptable for a convenience API that SourceManagerForFile aims to be.

@cor3ntin
Copy link
Contributor

It would be useful to have a repro or a stack trace here.
In particular, in SourceManagerForFile, RequiresNullTerminator is false, so the assert should not fire on a non-null terminated file

@ilya-biryukov
Copy link
Contributor Author

It would be useful to have a repro or a stack trace here. In particular, in SourceManagerForFile, RequiresNullTerminator is false, so the assert should not fire on a non-null terminated file

The repro is attached to the commit. Here is the full stack trace:

Stacktrace
FormatTests: /usr/local/google/home/ibiryukov/code/llvm-project/llvm/lib/Support/MemoryBuffer.cpp:53: void llvm::MemoryBuffer::init(const char *, const char *, bool): Assertion `(!RequiresNullTerminator || BufEnd[0] == 0) && "Buffer is not null terminated!"' failed.
 #0 0x00005650ca150618 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) /usr/local/google/home/ibiryukov/code/llvm-project/llvm/lib/Support/Unix/Signals.inc:804:13
 #1 0x00005650ca14e5ac llvm::sys::RunSignalHandlers() /usr/local/google/home/ibiryukov/code/llvm-project/llvm/lib/Support/Signals.cpp:106:18
 #2 0x00005650ca150de1 SignalHandler(int, siginfo_t*, void*) /usr/local/google/home/ibiryukov/code/llvm-project/llvm/lib/Support/Unix/Signals.inc:0:3
 #3 0x00007f7861e49e20 (/lib/x86_64-linux-gnu/libc.so.6+0x3fe20)
 #4 0x00007f7861e9de5c __pthread_kill_implementation ./nptl/pthread_kill.c:44:76
 #5 0x00007f7861e49d82 raise ./signal/../sysdeps/posix/raise.c:27:6
 #6 0x00007f7861e324f0 abort ./stdlib/abort.c:81:7
 #7 0x00007f7861e32418 _nl_load_domain ./intl/loadmsgcat.c:1177:9
 #8 0x00007f7861e42692 (/lib/x86_64-linux-gnu/libc.so.6+0x38692)
 #9 0x00005650ca10b341 (tools/clang/unittests/Format/FormatTests+0x904341)
#10 0x00005650ca1264d1 ErrorOr<std::unique_ptr<llvm::MemoryBuffer, std::default_delete<llvm::MemoryBuffer> > > /usr/local/google/home/ibiryukov/code/llvm-project/llvm/include/llvm/Support/ErrorOr.h:89:9
#11 0x00005650ca1264d1 llvm::vfs::detail::(anonymous namespace)::InMemoryFileAdaptor::getBuffer(llvm::Twine const&, long, bool, bool) /usr/local/google/home/ibiryukov/code/llvm-project/llvm/lib/Support/VirtualFileSystem.cpp:753:12
#12 0x00005650ca120166 ~ErrorOr /usr/local/google/home/ibiryukov/code/llvm-project/llvm/include/llvm/Support/ErrorOr.h:140:10
#13 0x00005650ca120166 llvm::vfs::FileSystem::getBufferForFile(llvm::Twine const&, long, bool, bool, bool) /usr/local/google/home/ibiryukov/code/llvm-project/llvm/lib/Support/VirtualFileSystem.cpp:126:1
#14 0x00005650ca18cbff clang::FileManager::getBufferForFileImpl(llvm::StringRef, long, bool, bool, bool) const /usr/local/google/home/ibiryukov/code/llvm-project/clang/lib/Basic/FileManager.cpp:578:1
#15 0x00005650ca18caa8 clang::FileManager::getBufferForFile(clang::FileEntryRef, bool, bool, std::optional<long>, bool) /usr/local/google/home/ibiryukov/code/llvm-project/clang/lib/Basic/FileManager.cpp:564:1
#16 0x00005650ca198fb6 operator bool /usr/local/google/home/ibiryukov/code/llvm-project/llvm/include/llvm/Support/ErrorOr.h:146:13
#17 0x00005650ca198fb6 clang::SrcMgr::ContentCache::getBufferOrNone(clang::DiagnosticsEngine&, clang::FileManager&, clang::SourceLocation) const /usr/local/google/home/ibiryukov/code/llvm-project/clang/lib/Basic/SourceManager.cpp:133:8
#18 0x00005650ca19cd87 _M_is_engaged /usr/bin/../lib/gcc/x86_64-linux-gnu/14/../../../../include/c++/14/optional:469:58
#19 0x00005650ca19cd87 operator bool /usr/bin/../lib/gcc/x86_64-linux-gnu/14/../../../../include/c++/14/optional:983:22
#20 0x00005650ca19cd87 clang::SourceManager::getBufferDataOrNone(clang::FileID) const /usr/local/google/home/ibiryukov/code/llvm-project/clang/lib/Basic/SourceManager.cpp:783:14
#21 0x00005650ca19ccb7 clang::SourceManager::getBufferData(clang::FileID, bool*) const /usr/local/google/home/ibiryukov/code/llvm-project/clang/lib/Basic/SourceManager.cpp:0:0
#22 0x00005650ca204612 fatalError /usr/local/google/home/ibiryukov/code/llvm-project/clang/lib/Format/TokenAnalyzer.cpp:52:36
#23 0x00005650ca204612 clang::format::Environment::make(llvm::StringRef, llvm::StringRef, llvm::ArrayRef<clang::tooling::Range>, unsigned int, unsigned int, unsigned int) /usr/local/google/home/ibiryukov/code/llvm-project/clang/lib/Format/TokenAnalyzer.cpp:74:13
#24 0x00005650ca1adf03 operator bool /usr/bin/../lib/gcc/x86_64-linux-gnu/14/../../../../include/c++/14/bits/unique_ptr.h:481:22
#25 0x00005650ca1adf03 clang::format::internal::reformat(clang::format::FormatStyle const&, llvm::StringRef, llvm::ArrayRef<clang::tooling::Range>, unsigned int, unsigned int, unsigned int, llvm::StringRef, clang::format::FormattingAttemptStatus*) /usr/local/google/home/ibiryukov/code/llvm-project/clang/lib/Format/Form
at.cpp:3756:8

It is not SourceManagerForFile that is a problem here, rather SourceManager::getBufferDataOrNone always passes RequiresNullTerminator = true in this call.
Changing that seems hard as I've noted in the PR comments.

PS I am not sure why we need RequiresNullTerminator in the FileManager interface in the first place, but it is quite contagious and complicating other APIs with this parameter is something that I would really aim to avoid (including the SourceManager API).

// We copy to `std::string` for Context instead of StringRef because the
// SourceManager::getBufferData() works only with null-terminated buffers.
// And we still want to keep the API convenient.
ContentBuffer = Content.str();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a potentially huge copy (consider use cases like unity builds), which makes me pretty uncomfortable. Have you measured the impact of this change on large source files?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking through the uses of SourceManagerForFile, we use it for:

  • clang-format,
  • some functions in Clangd reading the current file,
  • tests.

I would not worry about tests for obvious reasons.
Re Clangd I can also say with high confidence it would never make a difference. If someone opens a large file there, we'll see a ton of copies when LSP passes us this file and anything will bleak in comparison with the memory and compute requirements of Clang itself.

Wrt to clang-format, I am also fairly confident that any large file will take a lot more time and memory to process and this copy will not be noticeable. I will get some large mock file and run some benchmarks for sure and get back to you.

I am not sure in which context unity builds would be a useful use-case to support for either clang-format or Clangd, do you have any use-cases in mind that I am missing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried benchmarking clang-format on it and realized the clang-format binary is also using a different function here and this change makes no difference to it at all.

So we're left only with clangd and various other source tools that use clang-format as the final step of source code transformations, as well as downstream uses of the reformat API. I can't come up with an easy way to benchmark those, but also don't expect this would make a difference.

If you could give an example use-case that you are worried about, I can spend some effort benchmarking it. But I personally feel changing this particular API to something that does an extra copy is a very low risk change and fits with the purpose of the API (i.e. convenience costs some performance, so being a little slow is preferable to crashing)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My IDE (QtCreator) does uses libformat regularly while typing. I would also prefer not to perform copies all over the place.

How about storing a StringRef and assign Content if it ends with a zero, and only perform the copy if it doesn't?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about storing a StringRef and assign Content if it ends with a zero, and only perform the copy if it doesn't?

But we cannot check if StringRef is null-terminated because given StringRef S, accessing S[S.length()] is UB in the general case. If there's a way to do this that's not UB, we could do that.

I think the easiest way to avoid copies would be to switch to const char * or std::string in the APIs to make it explicit we need null-terminated strings. That requires a refactoring of the callers, but ends up being as efficient as it is now.

**My IDE (QtCreator) does uses libformat regularly while typing. I would also prefer not to perform copies all over the place.

But what are the actual costs of this particular copy for the overall performance of the IDE?
I expect it to be negligible, even though I fully sympathize with the idea that we want to avoid unnecessary copies. I just don't think we should do this at the expense of exposing API that have UB.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about storing a StringRef and assign Content if it ends with a zero, and only perform the copy if it doesn't?

But we cannot check if StringRef is null-terminated because given StringRef S, accessing S[S.length()] is UB in the general case. If there's a way to do this that's not UB, we could do that.

I think the easiest way to avoid copies would be to switch to const char * or std::string in the APIs to make it explicit we need null-terminated strings. That requires a refactoring of the callers, but ends up being as efficient as it is now.

Noted.

**My IDE (QtCreator) does uses libformat regularly while typing. I would also prefer not to perform copies all over the place.

But what are the actual costs of this particular copy for the overall performance of the IDE? I expect it to be negligible, even though I fully sympathize with the idea that we want to avoid unnecessary copies. I just don't think we should do this at the expense of exposing API that have UB.

I mainly think about my battery, when I code while being mobile.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mainly think about my battery, when I code while being mobile.

But this extra copy is likely a really-really small portion of the reformatting cost. Performance and battery go together here.
Is there anything I'm missing why it would not be the case?

Copy link
Contributor Author

@ilya-biryukov ilya-biryukov Mar 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any particular direction folks want to take this?

Having an API that breaks on non-null-terminated StringRef looks bad, but and I don't see easy options to fix this other than a copy.

I don't expect this to give any noticeable performance regressions, and I don't have a clear sense of whether the folks in the comment threads agree or disagree.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like it, but it's better than potentially crashing software.

But I'm not comfortable enough to approve something outside libFormat or clang-format.

@@ -2031,6 +2031,7 @@ class SourceManagerForFile {
// The order of these fields are important - they should be in the same order
// as they are created in `createSourceManagerForFile` so that they can be
// deleted in the reverse order as they are created.
std::string ContentBuffer;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not go after the comment.
Or?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FileManager and SourceManager will be referencing the contents of this buffer, so I think it's best to destroy them before this string.

Hence, the order is important, but maybe I should clarify the comment and mention the Buffer explicitly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would be useful.

// We copy to `std::string` for Context instead of StringRef because the
// SourceManager::getBufferData() works only with null-terminated buffers.
// And we still want to keep the API convenient.
ContentBuffer = Content.str();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My IDE (QtCreator) does uses libformat regularly while typing. I would also prefer not to perform copies all over the place.

How about storing a StringRef and assign Content if it ends with a zero, and only perform the copy if it doesn't?

Copy link
Contributor

@owenca owenca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a reproducer?

@ilya-biryukov
Copy link
Contributor Author

Can you add a reproducer?

I'm not sure I can do better than the test I've added.
My colleague caught this by accidentally getting an assertion failure when using this API downstream.
The test illustrates how it got used and I'm not sure if clang-format binary or any other upstream tool currently crash with this. (clang-format itself is using a separate codepath entirely, other tools I checked always pass null-terminated strings).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clang:frontend Language frontend issues, e.g. anything involving "Sema" clang Clang issues not falling into any other category clang-format
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants