Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-51566][PYTHON] Python UDF traceback improvement #50313

Open
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

wengh
Copy link
Contributor

@wengh wengh commented Mar 18, 2025

Motivation

Currently, when a Python UDF raises an error, the traceback only includes the file name and the line number, but doesn't include the content of that specific line. This behavior is different from local code tracebacks that show the line content. See following example.

Error inside UDF.
image

Local error. Notice that IPython additionally includes more more lines around the line where error happens, with links to the notebook cell, making it even easier to understand.
image

What changes were proposed in this pull request?

This PR changes convert_exception to detect Python tracebacks in the JVM error message, parse them back to a traceback object, and include them in the converted exception. This way, these frames will be included in the exception traceback as if they were part of the call stack.

If we fail to parse the traceback, we will silently ignore the failure, keeping the original behavior. In either case, the original error message, with the traceback in string form, will always be included in the exception so that we don't lose any information.

This PR also introduces tblib, a lightweight library that allows parsing traceback from string. The library is included as a source file, with modifications to make it preserve the original line content when the file is not available anymore.

Example of improved traceback as displayed in IPython. Notice that the Python worker frames are concatenated to the exception so that IPython ultratb shows the code around the error line.

---------------------------------------------------------------------------
PythonException                           Traceback (most recent call last)
Cell In[3], line 8
      4 @udf(returnType=StringType())
      5 def foo(value):
      6     1 / 0
----> 8 spark.range(1).toDF("value").select(foo("value")).show()

File ~/personal/test-spark/.venv/lib/python3.9/site-packages/pyspark/sql/classic/dataframe.py:285, in DataFrame.show(self, n, truncate, vertical)
    284 def show(self, n: int = 20, truncate: Union[bool, int] = True, vertical: bool = False) -> None:
--> 285     print(self._show_string(n, truncate, vertical))

...

File ~/personal/test-spark/.venv/lib/python3.9/site-packages/pyspark/errors/exceptions/captured.py:271, in capture_sql_exception.<locals>.deco(*a, **kw)
    267 converted = convert_exception(e.java_exception)
    268 if not isinstance(converted, UnknownException):
    269     # Hide where the exception came from that shows a non-Pythonic
    270     # JVM exception message.
--> 271     raise converted from None
    272 else:
    273     raise

Cell In[3], line 6, in foo()
      4 @udf(returnType=StringType())
      5 def foo(value):
----> 6     1 / 0

PythonException: 
  An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
  File "/var/folders/jm/2m7c_fjj75v7gqlr5px0j8780000gp/T/ipykernel_6171/1200677507.py", line 6, in foo
ZeroDivisionError: division by zero

Why are the changes needed?

To improve debuggability of Python UDFs (and UDTFs, Data Sources, etc.) by recovering the traceback when calling the UDF using PySpark.

Does this PR introduce any user-facing change?

Yes. Exceptions converted from JVM will include additional traceback frames if the JVM error message includes a Python traceback.

How was this patch tested?

Unit tests and end to end tests in python/pyspark/errors/tests/test_traceback.py

Was this patch authored or co-authored using generative AI tooling?

No

What are the risks?

  1. This change will break anything that assume that the frames of converted exception don't include UDF frames. This sounds unlikely though.
  2. If there's a Python source file that is present on both the worker and the client, but with different content, then the traceback may show wrong line content for frames in that file.
  3. If the traceback string format changes in a future Python version, then the parsing will break. This seems unlikely, but if it happens, the unit tests should catch the change.
  4. If the Python source file name contains a line break then parsing will return an incorrect traceback.

Why do the worker tracebacks not include the line content?

When Python turns a traceback into string, it looks up the line content from the file and the line number using the linecache module. This module contains a in memory cache from file name to lines, and when there's a miss, it reads the file from module globals or from the file system. IPython adds the cell content to the cache when the cell is executed.

When we run a UDF the function is pickled and sent to the driver JVM. The pickle doesn't include the source code nor the line cache so linecache will generally not be able to find the source code when it generates the traceback on the Python worker, unless the client code is in a .py file on the same host as the worker.

What alternative solutions did we consider?

This solution may be brittle since it depends on the specific traceback string format. And it also introduces a new 3rd party dependency. Here's some alternatives and why we didn't choose them.

Pass linecache to Python worker

If we can pass the linecache content to the worker, then the traceback on the worker will include the line content. To do this, we need to find which files are part of the pickle, and pass the content of these files to the worker. I think this is not feasible because it can potentially cause a lot of overhead when calling UDFs. And the user would still not be able to benefit from rich tracebacks like the one provided by IPython.

Send serialized traceback from Python worker to JVM

To avoid the brittle string parsing, we could send the traceback object from the worker to the JVM, then deserialize in Python client. Unfortunately Python tracebacks are not pickleable, so we would still need to use a 3rd party library (tblib again) to serialize the traceback. Also, this would require a lot more changes across the codebase.

@wengh wengh changed the title [WIP] Python UDF traceback improvement [SPARK-51566] Python UDF traceback improvement Mar 20, 2025
@wengh wengh marked this pull request as ready for review March 20, 2025 00:26
@wengh wengh force-pushed the python-udf-traceback branch from 721f428 to 463f703 Compare March 20, 2025 15:59
@wengh wengh changed the title [SPARK-51566] Python UDF traceback improvement [SPARK-51566][PYTHON] Python UDF traceback improvement Mar 20, 2025
tb = Traceback.from_string(python_exception_string)
tb.populate_linecache()
return e.with_traceback(tb.as_traceback())
except Exception:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a configuration, or at least environment variable if it's difficult to add a configuration here (for the cases when exceptions are thrown without JVM). Parsing exceptions can potentially cause a lot of performance overhead, e.g., if users are relaying on a lot of exceptions.

In addition, it would have to be BaseException if you absolutely want to catch all the exceptions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parsing exceptions can potentially cause a lot of performance overhead

Parsing should take time linear to the size of the error message so I don't worry much about performance. Let me do a quick benchmark to validate this.

In addition, it would have to be BaseException if you absolutely want to catch all the exceptions.

Good point 👍 (although I don't think it could possibly throw stuff like KeyboardInterrupt or SystemExit, but let's be safe)

Copy link
Contributor Author

@wengh wengh Mar 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a quick benchmark with different traceback sizes (1, 10, 100, 500, 1000)
https://perfpy.com/986

Benchmark Runtime
1 frame 297 us
10 frames 335 us
100 frames 2.41 ms
500 frames 12.4 ms
1000 frames 28.7 ms
1000 frames, failed to parse 3.13 ms

The default recursion limit is 1000 so tracebacks with more than 1000 frames should be unlikely.

Also, the results I see when I benchmark locally is around 1/5 of the above. Therefore performance shouldn't be a concern.

@wengh wengh requested a review from HyukjinKwon March 24, 2025 22:56
Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me otherwise

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants