[SPARK-51566][PYTHON] Python UDF traceback improvement #50313

wengh · 2025-03-18T23:50:39Z

Motivation

Currently, when a Python UDF raises an error, the traceback only includes the file name and the line number, but doesn't include the content of that specific line. This behavior is different from local code tracebacks that show the line content. See following example.

Error inside UDF.

Local error. Notice that IPython additionally includes more more lines around the line where error happens, with links to the notebook cell, making it even easier to understand.

What changes were proposed in this pull request?

This PR changes convert_exception to detect Python tracebacks in the JVM error message, parse them back to a traceback object, and include them in the converted exception. This way, these frames will be included in the exception traceback as if they were part of the call stack.

If we fail to parse the traceback, we will silently ignore the failure, keeping the original behavior. In either case, the original error message, with the traceback in string form, will always be included in the exception so that we don't lose any information.

This PR also introduces tblib, a lightweight library that allows parsing traceback from string. The library is included as a source file, with modifications to make it preserve the original line content when the file is not available anymore.

Example of improved traceback as displayed in IPython. Notice that the Python worker frames are concatenated to the exception so that IPython ultratb shows the code around the error line.

---------------------------------------------------------------------------
PythonException                           Traceback (most recent call last)
Cell In[3], line 8
      4 @udf(returnType=StringType())
      5 def foo(value):
      6     1 / 0
----> 8 spark.range(1).toDF("value").select(foo("value")).show()

File ~/personal/test-spark/.venv/lib/python3.9/site-packages/pyspark/sql/classic/dataframe.py:285, in DataFrame.show(self, n, truncate, vertical)
    284 def show(self, n: int = 20, truncate: Union[bool, int] = True, vertical: bool = False) -> None:
--> 285     print(self._show_string(n, truncate, vertical))

...

File ~/personal/test-spark/.venv/lib/python3.9/site-packages/pyspark/errors/exceptions/captured.py:271, in capture_sql_exception.<locals>.deco(*a, **kw)
    267 converted = convert_exception(e.java_exception)
    268 if not isinstance(converted, UnknownException):
    269     # Hide where the exception came from that shows a non-Pythonic
    270     # JVM exception message.
--> 271     raise converted from None
    272 else:
    273     raise

Cell In[3], line 6, in foo()
      4 @udf(returnType=StringType())
      5 def foo(value):
----> 6     1 / 0

PythonException: 
  An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
  File "/var/folders/jm/2m7c_fjj75v7gqlr5px0j8780000gp/T/ipykernel_6171/1200677507.py", line 6, in foo
ZeroDivisionError: division by zero

Why are the changes needed?

To improve debuggability of Python UDFs (and UDTFs, Data Sources, etc.) by recovering the traceback when calling the UDF using PySpark.

Does this PR introduce any user-facing change?

Yes. Exceptions converted from JVM will include additional traceback frames if the JVM error message includes a Python traceback.

How was this patch tested?

Unit tests and end to end tests in python/pyspark/errors/tests/test_traceback.py

Was this patch authored or co-authored using generative AI tooling?

No

What are the risks?

This change will break anything that assume that the frames of converted exception don't include UDF frames. This sounds unlikely though.
If there's a Python source file that is present on both the worker and the client, but with different content, then the traceback may show wrong line content for frames in that file.
If the traceback string format changes in a future Python version, then the parsing will break. A recent example is PEP 657 in Python 3.11 which introduced fine-grained error locations in traceback. If similar changes happen in the future, the unit tests should catch the change. In the worst case the parsing will fail and we will fall back to the original traceback string.
If the Python source file name contains a line break then parsing will return an incorrect traceback.

Why do the worker tracebacks not include the line content?

When Python turns a traceback into string, it looks up the line content from the file and the line number using the linecache module. This module contains a in memory cache from file name to lines, and when there's a miss, it reads the file from module globals or from the file system. IPython adds the cell content to the cache when the cell is executed.

When we run a UDF the function is pickled and sent to the driver JVM. The pickle doesn't include the source code nor the line cache so linecache will generally not be able to find the source code when it generates the traceback on the Python worker, unless the client code is in a .py file on the same host as the worker.

What alternative solutions did we consider?

This solution may be brittle since it depends on the specific traceback string format. And it also introduces a new 3rd party dependency. Here's some alternatives and why we didn't choose them.

Pass linecache to Python worker

If we can pass the linecache content to the worker, then the traceback on the worker will include the line content. To do this, we need to find which files are part of the pickle, and pass the content of these files to the worker. I think this is not feasible because it can potentially cause a lot of overhead when calling UDFs. And the user would still not be able to benefit from rich tracebacks like the one provided by IPython.

Send serialized traceback from Python worker to JVM

To avoid the brittle string parsing, we could send the traceback object from the worker to the JVM, then deserialize in Python client. Unfortunately Python tracebacks are not pickleable, so we would still need to use a 3rd party library (tblib again) to serialize the traceback. Also, this would require a lot more changes across the codebase.

python/pyspark/errors/tests/test_traceback.py

HyukjinKwon · 2025-03-24T03:29:15Z

python/pyspark/errors/exceptions/base.py

+        tb = Traceback.from_string(python_exception_string)
+        tb.populate_linecache()
+        return e.with_traceback(tb.as_traceback())
+    except Exception:


I think we need a configuration, or at least environment variable if it's difficult to add a configuration here (for the cases when exceptions are thrown without JVM). Parsing exceptions can potentially cause a lot of performance overhead, e.g., if users are relaying on a lot of exceptions.

In addition, it would have to be BaseException if you absolutely want to catch all the exceptions.

Parsing exceptions can potentially cause a lot of performance overhead

Parsing should take time linear to the size of the error message so I don't worry much about performance. Let me do a quick benchmark to validate this.

In addition, it would have to be BaseException if you absolutely want to catch all the exceptions.

Good point 👍 (although I don't think it could possibly throw stuff like KeyboardInterrupt or SystemExit, but let's be safe)

Here's a quick benchmark with different traceback sizes (1, 10, 100, 500, 1000)
https://perfpy.com/986

Benchmark Runtime

1 frame 297 us

10 frames 335 us

100 frames 2.41 ms

500 frames 12.4 ms

1000 frames 28.7 ms

1000 frames, failed to parse 3.13 ms

The default recursion limit is 1000 so tracebacks with more than 1000 frames should be unlikely.

Also, the results I see when I benchmark locally is around 1/5 of the above. Therefore performance shouldn't be a concern.

python/pyspark/errors/tests/test_traceback.py

LICENSE-binary

python/pyspark/errors/exceptions/base.py

python/pyspark/errors/tests/test_traceback.py

python/pyspark/errors/tests/connect/__init__.py

python/pyspark/errors/tests/test_traceback.py

HyukjinKwon

Looks fine to me otherwise

ueshin

What happens if spark.sql.pyspark.jvmStacktrace.enabled is enabled?

python/pyspark/errors/exceptions/base.py

wengh · 2025-03-26T23:48:00Z

What happens if spark.sql.pyspark.jvmStacktrace.enabled is enabled?

@ueshin
When that conf is enabled, the error message from JVM will contain the same Python traceback multiple times. But this parsing still works because it only parses the first traceback and stops when the traceback ends.

Added a test with conf on/off

allisonwang-db · 2025-03-27T20:55:12Z

Test failure is unrelated. Thanks, merging to master

github-actions bot added PYTHON BUILD labels Mar 18, 2025

wengh changed the title ~~[WIP] Python UDF traceback improvement~~ [SPARK-51566] Python UDF traceback improvement Mar 20, 2025

wengh marked this pull request as ready for review March 20, 2025 00:26

wengh force-pushed the python-udf-traceback branch from 721f428 to 463f703 Compare March 20, 2025 15:59

wengh changed the title ~~[SPARK-51566] Python UDF traceback improvement~~ [SPARK-51566][PYTHON] Python UDF traceback improvement Mar 20, 2025

HyukjinKwon reviewed Mar 24, 2025

View reviewed changes

python/pyspark/errors/tests/test_traceback.py Outdated Show resolved Hide resolved

HyukjinKwon reviewed Mar 24, 2025

View reviewed changes

python/pyspark/errors/tests/test_traceback.py Show resolved Hide resolved

HyukjinKwon reviewed Mar 24, 2025

View reviewed changes

python/pyspark/errors/tests/test_traceback.py Outdated Show resolved Hide resolved

HyukjinKwon reviewed Mar 24, 2025

View reviewed changes

LICENSE-binary Show resolved Hide resolved

github-actions bot added the CONNECT label Mar 24, 2025

wengh requested a review from HyukjinKwon March 24, 2025 22:56