Skip to content

fix(python): handle cold interpreter func ranges #414

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

korniltsev
Copy link
Contributor

@korniltsev korniltsev commented Mar 21, 2025

alpine:3.20 python3.12 has a separate _PyEval_EvalFrameDefault _PyEval_EvalFrameDefault.cold

readelf -s -W /usr/lib/debug/usr/lib/libpython3.12.so.1.0.debug | grep EvalFrameDef
  2143: 0000000000171780   362 FUNC    LOCAL  DEFAULT    9 _PyEval_EvalFrameDefault.localalias
  2234: 0000000000088a9a 56543 FUNC    LOCAL  DEFAULT    9 _PyEval_EvalFrameDefault.cold
 43791: 0000000000171780   362 FUNC    GLOBAL DEFAULT    9 _PyEval_EvalFrameDefault

Therefore ebpf code only assumes the interpreter function is only 362 bytes long and fails to select python unwinder as the meat of the interpreter is in the cold one.

In this PR I fix this issue in two ways.

  1. A band-aid table for known binaries with a separate cold function.
  2. Debug file lookup (at least gives someone an opportunity to install a debug package and allow profiling). The downside is that we run this check even for interpreters who has no cold functions. Maybe we can come up with some heuristic to not waste resources. Edit: I've reverted the change to keep the PR smaller and not introduce unnecesary resource consumption. I've created a tracking issue python: handle cold interpreter func ranges #416
  3. Update the kernel code to handle cases when an interpreter has two chunks
  4. Add coredumps for both cases (the table, the debug file lookup EDIT: the second test alpine320-nobuildid.json is skipped for now)

- handle cold interpreter func chunks
- add coredump alpine320, alpine320-nobuildid
@korniltsev
Copy link
Contributor Author

korniltsev commented Mar 21, 2025

I don't really like the debug-file symbol lookup. I'm thinking to rollback this change (and keep only the hardcoded table for now) until we find a better fix.
Curious what others think.

@korniltsev korniltsev marked this pull request as ready for review March 21, 2025 15:56
@korniltsev korniltsev requested review from a team as code owners March 21, 2025 15:56
@gnurizen
Copy link
Contributor

gnurizen commented Mar 21, 2025

Hacky but I solved a similar problem for luajit by using the size of the function by looking at the stack deltas.

https://github.com/parca-dev/opentelemetry-ebpf-profiler/blob/main/interpreter/luajit/luajit.go#L173

So maybe if the size of the function is less than some cut off (ie 1000) decide we have a .cold situation and walk the stack deltas and assume the largest function in the binary is the rest of the interpreter.

The 3 largest functions in python:

  1024: 0000000000353e30  9730 FUNC    GLOBAL DEFAULT   11 PyUnicode_Format
  1974: 000000000013a640 12000 FUNC    GLOBAL DEFAULT   11 _PyUnicode_ToNumeric
  1799: 0000000000108f90 66171 FUNC    GLOBAL DEFAULT   11 _PyEval_EvalFrameDefault

So seems like there's a large margin for error.

@korniltsev
Copy link
Contributor Author

I don't think we can use the size of the function unfortunately. This is from alpine 3.21

  2270: 0000000000089228 13785 FUNC    LOCAL  DEFAULT    9 _PyEval_EvalFrameDefault.cold
 43735: 0000000000167f60 45799 FUNC    GLOBAL DEFAULT    9 _PyEval_EvalFrameDefault

Hacky but I solved a similar problem for luajit by using the size of the function by looking at the stack deltas.

Nice one, I will take a look.

@korniltsev
Copy link
Contributor Author

korniltsev commented Mar 22, 2025

dumping another idea for validation:
maybe we can inspect contents of sysconfig.get_config_vars() to decide if we need searching for the cold range (not exactly sure what flags to search for, maybe -fprofile-use ?)

"OPT": "-DNDEBUG -g -O2 -Wall",
  "CONFIG_ARGS": "'--enable-shared' '--prefix=/usr' '--libdir=/usr/lib/x86_64-linux-gnu' '--enable-ipv6' '--enable-loadable-sqlite-extensions' '--with-dbmliborder=bdb:gdbm' '--with-computed-gotos' '--without-ensurepip' '--with-system-expat' '--with-dtrace' '--with-ssl-default-suites=openssl' '--with-wheel-pkg-dir=/usr/share/python-wheels/' 'MKDIR_P=/bin/mkdir -p' 'CC=x86_64-linux-gnu-gcc'",
"LDSHARED": "x86_64-linux-gnu-gcc -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions  -Wl,-z,relro -g -fwrapv -O2   ",
  "LDVERSION": "3.12",
...

@gnurizen
Copy link
Contributor

  2270: 0000000000089228 13785 FUNC    LOCAL  DEFAULT    9 _PyEval_EvalFrameDefault.cold
 43735: 0000000000167f60 45799 FUNC    GLOBAL DEFAULT    9 _PyEval_EvalFrameDefault

Interesting. Does the hot function call the cold function as a normal function call or does it just jmp between the two? I guess it must be a jmp or this wouldn't be an issue?

@korniltsev
Copy link
Contributor Author

  2270: 0000000000089228 13785 FUNC    LOCAL  DEFAULT    9 _PyEval_EvalFrameDefault.cold
 43735: 0000000000167f60 45799 FUNC    GLOBAL DEFAULT    9 _PyEval_EvalFrameDefault

Interesting. Does the hot function call the cold function as a normal function call or does it just jmp between the two? I guess it must be a jmp or this wouldn't be an issue?

It jumps. Either direct/conditional jump or indirect through switch/case jump table.

korniltsev added a commit to grafana/opentelemetry-ebpf-profiler that referenced this pull request Mar 24, 2025
merging open-telemetry#414

Squashed commit of the following:

commit 43b26dc
Author: Tolya Korniltsev <[email protected]>
Date:   Mon Mar 24 11:18:47 2025 +0700

    revert debug-file-lookup changes

commit 71ce508
Author: Tolya Korniltsev <[email protected]>
Date:   Fri Mar 21 18:41:36 2025 +0700

    update arm kernel blobs

commit 1346f70
Author: Tolya Korniltsev <[email protected]>
Date:   Fri Mar 21 18:36:21 2025 +0700

    Lint

commit 85861b1
Author: Tolya Korniltsev <[email protected]>
Date:   Fri Mar 21 16:16:16 2025 +0700

    fix(python): handle cold interpreter func chunks

    - handle cold interpreter func chunks
    - add coredump alpine320, alpine320-nobuildid
@korniltsev
Copy link
Contributor Author

korniltsev commented Mar 24, 2025

Well, that escalated quickly. I found the following images with cold function chunks

python:3.12.1-alpine3.19
python:3.12.2-alpine3.19
python:3.12.3-alpine3.19
python:3.12.3-alpine3.20
python:3.12.4-alpine3.19
python:3.12.4-alpine3.20
python:3.12.5-alpine3.19
python:3.12.5-alpine3.20
python:3.12.6-alpine3.19
python:3.12.6-alpine3.20
python:3.12.7-alpine3.19
python:3.12.7-alpine3.20
python:3.12.8-alpine3.19
python:3.12.8-alpine3.20
python:3.12.8-alpine3.21
python:3.12.9-alpine3.20
python:3.12.9-alpine3.21
python:3.11.5-alpine3.18
python:3.11.6-alpine3.18
python:3.11.7-alpine3.18
python:3.11.8-alpine3.18
python:3.11.9-alpine3.18
python:3.12.0-alpine3.18
python:3.12.1-alpine3.18
python:3.12.2-alpine3.18
python:3.12.3-alpine3.18
python:3.11.3-alpine3.18

python:3.12.6
... did not search for other debian based images

I will mark the PR as a draft as I don't believe this should be merged in this form.

@korniltsev
Copy link
Contributor Author

I propose to split this PR in two pieces.

  1. Add support for handling interpreter functions with two ranges. Both in userspace and kernel. Without actually providing second range just yet.
  2. Actually find and provide second range for interpreter function.

I will close this and submit a separate PR's for the above.

@korniltsev korniltsev closed this Apr 17, 2025
korniltsev added a commit to grafana/opentelemetry-ebpf-profiler that referenced this pull request Apr 17, 2025
Squashed commit of the following:

commit 69d4ab0e79a6073b7e67ef3e0fa732647f928c1f
Author: Tolya Korniltsev <[email protected]>
Date:   Tue Mar 25 13:13:01 2025 +0700

    add more alpines

commit 43b26dc
Author: Tolya Korniltsev <[email protected]>
Date:   Mon Mar 24 11:18:47 2025 +0700

    revert debug-file-lookup changes

commit 71ce508
Author: Tolya Korniltsev <[email protected]>
Date:   Fri Mar 21 18:41:36 2025 +0700

    update arm kernel blobs

commit 1346f70
Author: Tolya Korniltsev <[email protected]>
Date:   Fri Mar 21 18:36:21 2025 +0700

    Lint

commit 85861b1
Author: Tolya Korniltsev <[email protected]>
Date:   Fri Mar 21 16:16:16 2025 +0700

    fix(python): handle cold interpreter func chunks

    - handle cold interpreter func chunks
    - add coredump alpine320, alpine320-nobuildid

# Conflicts:
#	interpreter/loaderinfo.go
korniltsev added a commit to grafana/opentelemetry-ebpf-profiler that referenced this pull request Apr 17, 2025
cherry-pick open-telemetry#414 but without hardcoded cold ranges
korniltsev added a commit to grafana/opentelemetry-ebpf-profiler that referenced this pull request Apr 17, 2025
cherry-pick open-telemetry#414 but without hardcoded cold ranges
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants