Skip to content

[Format] Do not crash on non-null terminated strings #131299

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions clang/include/clang/Basic/SourceManager.h
Original file line number Diff line number Diff line change
Expand Up @@ -2031,6 +2031,7 @@ class SourceManagerForFile {
// The order of these fields are important - they should be in the same order
// as they are created in `createSourceManagerForFile` so that they can be
// deleted in the reverse order as they are created.
std::string ContentBuffer;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not go after the comment.
Or?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FileManager and SourceManager will be referencing the contents of this buffer, so I think it's best to destroy them before this string.

Hence, the order is important, but maybe I should clarify the comment and mention the Buffer explicitly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would be useful.

std::unique_ptr<FileManager> FileMgr;
std::unique_ptr<DiagnosticsEngine> Diagnostics;
std::unique_ptr<SourceManager> SourceMgr;
Expand Down
10 changes: 8 additions & 2 deletions clang/lib/Basic/SourceManager.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2382,14 +2382,20 @@ size_t SourceManager::getDataStructureSizes() const {

SourceManagerForFile::SourceManagerForFile(StringRef FileName,
StringRef Content) {
// We copy to `std::string` for Context instead of StringRef because the
// SourceManager::getBufferData() works only with null-terminated buffers.
// And we still want to keep the API convenient.
ContentBuffer = Content.str();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a potentially huge copy (consider use cases like unity builds), which makes me pretty uncomfortable. Have you measured the impact of this change on large source files?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking through the uses of SourceManagerForFile, we use it for:

  • clang-format,
  • some functions in Clangd reading the current file,
  • tests.

I would not worry about tests for obvious reasons.
Re Clangd I can also say with high confidence it would never make a difference. If someone opens a large file there, we'll see a ton of copies when LSP passes us this file and anything will bleak in comparison with the memory and compute requirements of Clang itself.

Wrt to clang-format, I am also fairly confident that any large file will take a lot more time and memory to process and this copy will not be noticeable. I will get some large mock file and run some benchmarks for sure and get back to you.

I am not sure in which context unity builds would be a useful use-case to support for either clang-format or Clangd, do you have any use-cases in mind that I am missing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried benchmarking clang-format on it and realized the clang-format binary is also using a different function here and this change makes no difference to it at all.

So we're left only with clangd and various other source tools that use clang-format as the final step of source code transformations, as well as downstream uses of the reformat API. I can't come up with an easy way to benchmark those, but also don't expect this would make a difference.

If you could give an example use-case that you are worried about, I can spend some effort benchmarking it. But I personally feel changing this particular API to something that does an extra copy is a very low risk change and fits with the purpose of the API (i.e. convenience costs some performance, so being a little slow is preferable to crashing)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My IDE (QtCreator) does uses libformat regularly while typing. I would also prefer not to perform copies all over the place.

How about storing a StringRef and assign Content if it ends with a zero, and only perform the copy if it doesn't?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about storing a StringRef and assign Content if it ends with a zero, and only perform the copy if it doesn't?

But we cannot check if StringRef is null-terminated because given StringRef S, accessing S[S.length()] is UB in the general case. If there's a way to do this that's not UB, we could do that.

I think the easiest way to avoid copies would be to switch to const char * or std::string in the APIs to make it explicit we need null-terminated strings. That requires a refactoring of the callers, but ends up being as efficient as it is now.

**My IDE (QtCreator) does uses libformat regularly while typing. I would also prefer not to perform copies all over the place.

But what are the actual costs of this particular copy for the overall performance of the IDE?
I expect it to be negligible, even though I fully sympathize with the idea that we want to avoid unnecessary copies. I just don't think we should do this at the expense of exposing API that have UB.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about storing a StringRef and assign Content if it ends with a zero, and only perform the copy if it doesn't?

But we cannot check if StringRef is null-terminated because given StringRef S, accessing S[S.length()] is UB in the general case. If there's a way to do this that's not UB, we could do that.

I think the easiest way to avoid copies would be to switch to const char * or std::string in the APIs to make it explicit we need null-terminated strings. That requires a refactoring of the callers, but ends up being as efficient as it is now.

Noted.

**My IDE (QtCreator) does uses libformat regularly while typing. I would also prefer not to perform copies all over the place.

But what are the actual costs of this particular copy for the overall performance of the IDE? I expect it to be negligible, even though I fully sympathize with the idea that we want to avoid unnecessary copies. I just don't think we should do this at the expense of exposing API that have UB.

I mainly think about my battery, when I code while being mobile.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mainly think about my battery, when I code while being mobile.

But this extra copy is likely a really-really small portion of the reformatting cost. Performance and battery go together here.
Is there anything I'm missing why it would not be the case?

Copy link
Contributor Author

@ilya-biryukov ilya-biryukov Mar 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any particular direction folks want to take this?

Having an API that breaks on non-null-terminated StringRef looks bad, but and I don't see easy options to fix this other than a copy.

I don't expect this to give any noticeable performance regressions, and I don't have a clear sense of whether the folks in the comment threads agree or disagree.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like it, but it's better than potentially crashing software.

But I'm not comfortable enough to approve something outside libFormat or clang-format.


// This is referenced by `FileMgr` and will be released by `FileMgr` when it
// is deleted.
IntrusiveRefCntPtr<llvm::vfs::InMemoryFileSystem> InMemoryFileSystem(
new llvm::vfs::InMemoryFileSystem);

InMemoryFileSystem->addFile(
FileName, 0,
llvm::MemoryBuffer::getMemBuffer(Content, FileName,
/*RequiresNullTerminator=*/false));
llvm::MemoryBuffer::getMemBuffer(ContentBuffer, FileName,
/*RequiresNullTerminator=*/true));
// This is passed to `SM` as reference, so the pointer has to be referenced
// in `Environment` so that `FileMgr` can out-live this function scope.
FileMgr =
Expand Down
11 changes: 11 additions & 0 deletions clang/unittests/Format/FormatTest.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -29096,6 +29096,17 @@ TEST_F(FormatTest, BreakBeforeClassName) {
" ArenaSafeUniquePtr {};");
}

TEST_F(FormatTest, DoesNotCrashOnNonNullTerminatedStringRefs) {
llvm::StringRef TwoLines = "namespace foo {}\n"
"namespace bar {}";
llvm::StringRef FirstLine =
TwoLines.take_until([](char c) { return c == '\n'; });

// The internal API used to crash when passed a non-null-terminated StringRef.
// Check this does not happen anymore.
verifyNoCrash(FirstLine);
}

} // namespace
} // namespace test
} // namespace format
Expand Down