Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copy callstack API #4033

Merged
merged 32 commits into from
Mar 11, 2025
Merged

Conversation

g0djan
Copy link
Contributor

@g0djan g0djan commented Jan 17, 2025

New WAMR public API to copy runtime call stack frames.

CAUTION: this APIs is not thread safe, that's why it's hidden behind feature flag for now. If you need to call it from another thread ensure the passed exec_env is suspended.

Our use case

Sometimes WAMR runtime gets stuck in production and we have no data where in the code compiled to WASM it happens. We currently only track such situations in a separate native thread. To increase visibility into the problem we developed internal solution that requires presence of this API in WAMR. If a separate thread finds that the WASM VM thread has stuck, it interrupts it with a user defined signal and calls this API to collect callstack. The main complexity is maintaining async-signal-safety and avoiding segfaults. For that we're maintaining atomic copies of exec_env, exec_env->module_inst, exec_env->module_inst->module. Those copies are always set to NULL before the referenced memory is freed. Before a call to this API those copies are always checked for validity. In our use case scenario we guarantee ourselves only absence of crashes but we realize that the frame data that we collect might be invalidated due to a signal interruption. However it's highly unlikely and is not a concern for us.

Have we tried existing WAMR APIs for our usecase?

Yes, we've tried suggested by maintainers wasm_cluster_suspend_thread and wasm_runtime_terminate.

  1. In our production runtime often recovers from being stuck, so wasm_runtime_terminate is not a good option for us to report the call stack
  2. The wasm_cluster_suspend_thread doesn't suit us either. Even if it did we'd still need API to iterate over stackframes.

@g0djan
Copy link
Contributor Author

g0djan commented Jan 28, 2025

@loganek addressed all your comments and rebased to fix the checks. The last failing check is CI issue and there's another PR that will fix it.
Let me know if you have more comments

Copy link
Collaborator

@loganek loganek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, if possible, consider adding tests.

Copy link
Collaborator

@lum1n0us lum1n0us left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few questions regarding the design of the APIs

Copy link
Collaborator

@lum1n0us lum1n0us left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a few comments.

@g0djan
Copy link
Contributor Author

g0djan commented Feb 27, 2025

just a few comments.

addressed all

@g0djan g0djan requested a review from lum1n0us February 27, 2025 15:31
@g0djan g0djan requested a review from lum1n0us March 3, 2025 15:52
@g0djan
Copy link
Contributor Author

g0djan commented Mar 3, 2025

Addressed

Copy link
Collaborator

@lum1n0us lum1n0us left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last round, keep fighting 💪

@g0djan g0djan requested a review from lum1n0us March 5, 2025 09:12
@g0djan
Copy link
Contributor Author

g0djan commented Mar 5, 2025

@lum1n0us addressed last comments

@g0djan
Copy link
Contributor Author

g0djan commented Mar 9, 2025

@lum1n0us can you please take a look?

Copy link
Collaborator

@lum1n0us lum1n0us left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lum1n0us
Copy link
Collaborator

@xujuntwt95329 @TianlongLiang @wenyongh Please take a look and feel free to leave your comments.

Copy link
Collaborator

@TianlongLiang TianlongLiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lum1n0us lum1n0us merged commit 766f378 into bytecodealliance:main Mar 11, 2025
386 checks passed
@g0djan
Copy link
Contributor Author

g0djan commented Mar 11, 2025

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants