-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-51566][PYTHON] Python UDF traceback improvement #50313
base: master
Are you sure you want to change the base?
Conversation
721f428
to
463f703
Compare
tb = Traceback.from_string(python_exception_string) | ||
tb.populate_linecache() | ||
return e.with_traceback(tb.as_traceback()) | ||
except Exception: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need a configuration, or at least environment variable if it's difficult to add a configuration here (for the cases when exceptions are thrown without JVM). Parsing exceptions can potentially cause a lot of performance overhead, e.g., if users are relaying on a lot of exceptions.
In addition, it would have to be BaseException
if you absolutely want to catch all the exceptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parsing exceptions can potentially cause a lot of performance overhead
Parsing should take time linear to the size of the error message so I don't worry much about performance. Let me do a quick benchmark to validate this.
In addition, it would have to be BaseException if you absolutely want to catch all the exceptions.
Good point 👍 (although I don't think it could possibly throw stuff like KeyboardInterrupt
or SystemExit
, but let's be safe)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a quick benchmark with different traceback sizes (1, 10, 100, 500, 1000)
https://perfpy.com/986
Benchmark | Runtime |
---|---|
1 frame | 297 us |
10 frames | 335 us |
100 frames | 2.41 ms |
500 frames | 12.4 ms |
1000 frames | 28.7 ms |
1000 frames, failed to parse | 3.13 ms |
The default recursion limit is 1000 so tracebacks with more than 1000 frames should be unlikely.
Also, the results I see when I benchmark locally is around 1/5 of the above. Therefore performance shouldn't be a concern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine to me otherwise
Motivation
Currently, when a Python UDF raises an error, the traceback only includes the file name and the line number, but doesn't include the content of that specific line. This behavior is different from local code tracebacks that show the line content. See following example.
Error inside UDF.

Local error. Notice that IPython additionally includes more more lines around the line where error happens, with links to the notebook cell, making it even easier to understand.

What changes were proposed in this pull request?
This PR changes
convert_exception
to detect Python tracebacks in the JVM error message, parse them back to a traceback object, and include them in the converted exception. This way, these frames will be included in the exception traceback as if they were part of the call stack.If we fail to parse the traceback, we will silently ignore the failure, keeping the original behavior. In either case, the original error message, with the traceback in string form, will always be included in the exception so that we don't lose any information.
This PR also introduces
tblib
, a lightweight library that allows parsing traceback from string. The library is included as a source file, with modifications to make it preserve the original line content when the file is not available anymore.Example of improved traceback as displayed in IPython. Notice that the Python worker frames are concatenated to the exception so that IPython ultratb shows the code around the error line.
Why are the changes needed?
To improve debuggability of Python UDFs (and UDTFs, Data Sources, etc.) by recovering the traceback when calling the UDF using PySpark.
Does this PR introduce any user-facing change?
Yes. Exceptions converted from JVM will include additional traceback frames if the JVM error message includes a Python traceback.
How was this patch tested?
Unit tests and end to end tests in
python/pyspark/errors/tests/test_traceback.py
Was this patch authored or co-authored using generative AI tooling?
No
What are the risks?
Why do the worker tracebacks not include the line content?
When Python turns a traceback into string, it looks up the line content from the file and the line number using the
linecache
module. This module contains a in memory cache from file name to lines, and when there's a miss, it reads the file from module globals or from the file system. IPython adds the cell content to the cache when the cell is executed.When we run a UDF the function is pickled and sent to the driver JVM. The pickle doesn't include the source code nor the line cache so
linecache
will generally not be able to find the source code when it generates the traceback on the Python worker, unless the client code is in a.py
file on the same host as the worker.What alternative solutions did we consider?
This solution may be brittle since it depends on the specific traceback string format. And it also introduces a new 3rd party dependency. Here's some alternatives and why we didn't choose them.
Pass linecache to Python worker
If we can pass the linecache content to the worker, then the traceback on the worker will include the line content. To do this, we need to find which files are part of the pickle, and pass the content of these files to the worker. I think this is not feasible because it can potentially cause a lot of overhead when calling UDFs. And the user would still not be able to benefit from rich tracebacks like the one provided by IPython.
Send serialized traceback from Python worker to JVM
To avoid the brittle string parsing, we could send the traceback object from the worker to the JVM, then deserialize in Python client. Unfortunately Python tracebacks are not pickleable, so we would still need to use a 3rd party library (
tblib
again) to serialize the traceback. Also, this would require a lot more changes across the codebase.