Skip to content

Protect patching with an open handle #1264

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 77 commits into
base: master
Choose a base branch
from

Conversation

fandreuz
Copy link
Contributor

Description

In this PR I introduce additional protection during patches. The main change is a new function isSafeToPatch, which combines the checks introduced in #1261, #1263, #1243, and can be used in other parts of the code like MallocTracer.

Thanks @Baraa-Hasheesh for the discussion and feedback on the code.

Related issues

#1250, #1256

Motivation and context

Keeping an open handle makes sure we the library cannot be unloaded while the patch is being made, thus providing additional protection against the failures in the linked issues.

How has this been tested?

make test


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@fandreuz
Copy link
Contributor Author

Tested that after this PR the test in asprof-reproducers does not fail anymore, and fails in master.

@fandreuz
Copy link
Contributor Author

Also tested the crash in the stress test by @Baraa-Hasheesh, and could not reproduce the issue anymore after this PR

Co-authored-by: Bara' Hasheesh <[email protected]>
Signed-off-by: Francesco Andreuzzi <[email protected]>
@krk
Copy link
Contributor

krk commented Apr 25, 2025

Let's add minimal reproducers as cpp unit tests.

@fandreuz
Copy link
Contributor Author

fandreuz commented Apr 25, 2025

Let's add minimal reproducers as cpp unit tests.

@krk Adapted test from @Baraa-Hasheesh's repo

@fandreuz fandreuz force-pushed the sim-294-keep-handle-while-patching branch from 6877914 to 5cff23d Compare May 2, 2025 09:09
@fandreuz fandreuz force-pushed the sim-294-keep-handle-while-patching branch 4 times, most recently from a29f8ab to 28d9f78 Compare May 2, 2025 11:09
@fandreuz fandreuz force-pushed the sim-294-keep-handle-while-patching branch from 28d9f78 to 0eacfab Compare May 2, 2025 11:19
@fandreuz fandreuz force-pushed the sim-294-keep-handle-while-patching branch from cac4c90 to b310dba Compare May 2, 2025 11:25
@fandreuz
Copy link
Contributor Author

fandreuz commented May 2, 2025

All comments addressed @apangin

struct link_map* map;
// validate that the current loaded library is the same library that was observed during the /proc/self/maps processing
if (handle != NULL && dlinfo(handle, RTLD_DI_LINKMAP, &map) == 0) {
return cc->imageBase() == (const char*)map->l_addr;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this apply to shared objects whose virtual load address is non-zero?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, I think @Baraa-Hasheesh looked into this as part of #1261. Can you weigh in?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apangin yes.
Is there a case where the virtual load address is zero?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm interested in a situation where a shared object has PT_LOAD program header with p_offset == 0 and p_vaddr != 0. Can you provide me with an evidence that this case is covered or that it never happens?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apangin I will investigate this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we check if this situation applies, and if it does we simply skip the check and assume the handle is valid? The situation highlighted by @Baraa-Hasheesh seems to be an edge case, and I think it's still valuable to have this check until we find a proper fix.

Having this will be give us more protection than what we currently have, and won't make things worse than they are now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dlinfo protection is only really needed if library is re-opened using a different thread after close

No, the problem can happen even in a single-threaded cases, for instance, if we attempt to do native memory profiling long after a library was first parsed.

I do think we need some level of protection against this situation, even if it is not 100% accurate. If comparing l_addr does not work, maybe we can check l_ld instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apangin technically that's true but it will require that the library is reloaded into a different location in memory (could happen in JIT from what I'm reading)

But this raises the concern that the library will not be patched (skipped) or loose it's patching during these long profile sessions (profiling accuracy degrades with time) while being a valid behavior of the profiled process (Different issue that can be tracked separately)

Copy link
Contributor

@Baraa-Hasheesh Baraa-Hasheesh May 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fandreuz @apangin

How about we drop the dlinfo & switch to dladdr
https://man7.org/linux/man-pages/man3/dladdr.3.html

void *dli_fbase;  /* Base address at which shared object is loaded */

We can use the _imports inside the CodeCache to find an address to be used in the dladdr call
I.E. when we save imports during symbol parsing we set a pointer to the first symbol we find (Avoid double looping)

We use that pointer to call dladdr

Dl_info dl_info;
if (dladdr(entry, &dl_info)) {
 // Check that dl_info. dli_fbase == image_base
}

This solution should be also applicable to MacOs

Note: I initially didn't consider dladdr as you need an address in the lib to call it & missed the fact we already have the address inside _imports.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just note that this is only valid on the patching side, while the mentioned case here will still be possible if lib was relocated after it was detected but before it was parsed

I think it's best if we have the dladdr side on the patching side as it should also protect MacOs

For the parsing side, for now I think it should be an acceptable risk & we can create a different thread to track & reproduce the possible crash

src/codeCache.h Outdated
private:
char* _name;
// E.g. remove (deleted) so that removed library can be reopened
char* _clean_name;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't need to be stored here - it can be materialized where needed.
Having two name fields is confusing. Furthermore, " (deleted)" is Linux-specific thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be materialized when needed, but then somebody has to free it. If we return a simple char* that's going to make things more complicated. How about returning std::string?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean it's better for CodeCache not to deal with cleanName(). Having both name() and cleanName() in the same interface is not really clean (pun intended).
AFAICS, there is only one place in Linux-specific code where you need a name stripped off "(deleted)" suffix, when re-opening a library in UnloadProtection. Why not process the name there? It can be even stored in a stack allocated buffer, so no free is required. Though dynamic allocation is also fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(deleted)" is Linux-specific thing.

This shouldn't be part of CodeCache.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 8974647

Do we ever need to have the unstripped version? If not, we could just store the stripped version in CodeCache, and skip the additional copy in UnloadProtection ctor @apangin @krk

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it will be helpful to let users know that the orignal file has been deleted and therefore symbol infromation can be incomplete.

struct link_map* map;
// validate that the current loaded library is the same library that was observed during the /proc/self/maps processing
if (handle != NULL && dlinfo(handle, RTLD_DI_LINKMAP, &map) == 0) {
return cc->imageBase() == (const char*)map->l_addr;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm interested in a situation where a shared object has PT_LOAD program header with p_offset == 0 and p_vaddr != 0. Can you provide me with an evidence that this case is covered or that it never happens?

src/codeCache.h Outdated
private:
char* _name;
// E.g. remove (deleted) so that removed library can be reopened
char* _clean_name;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean it's better for CodeCache not to deal with cleanName(). Having both name() and cleanName() in the same interface is not really clean (pun intended).
AFAICS, there is only one place in Linux-specific code where you need a name stripped off "(deleted)" suffix, when re-opening a library in UnloadProtection. Why not process the name there? It can be even stored in a stack allocated buffer, so no free is required. Though dynamic allocation is also fine.

_name = NativeFunc::create(name, -1);

// Strip " (deleted)" suffix so that removed library can be reopened
size_t len = strlen(name);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is it used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

struct link_map* map;
// validate that the current loaded library is the same library that was observed during the /proc/self/maps processing
if (handle != NULL && dlinfo(handle, RTLD_DI_LINKMAP, &map) == 0) {
return cc->imageBase() == (const char*)map->l_addr;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dlinfo protection is only really needed if library is re-opened using a different thread after close

No, the problem can happen even in a single-threaded cases, for instance, if we attempt to do native memory profiling long after a library was first parsed.

I do think we need some level of protection against this situation, even if it is not 100% accurate. If comparing l_addr does not work, maybe we can check l_ld instead.

return;
}

char* nameCopy = strdup(cc->name());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Local variables follow snake_case naming convention.
Can we avoid strdup if there is no deleted suffix? Alternatively, we can use a stack-allocated buffer.

Copy link
Contributor Author

@fandreuz fandreuz May 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lib.file[len - 10] = 0;
}
CodeCache* cc = new CodeCache(lib.file, array->count(), false, lib.map_start, lib.map_end, lib.image_base);
stripDeletedSuffix(lib.file);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the original file has been deleted, I don't think we should attempt to parse a file with the similar name, because it can be a different file with different symbol layout.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand, this seems to be replicating the original logic. Do you propose to skip parsing the library if the original file was deleted?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants