Description
Current Behaviour
When using ydata-profiling with Spark, if the dataset is empty after filtering numeric columns, an exception is raised due to Correlation.corr() not handling empty DataFrames properly. This issue occurs in _compute_spark_corr_natively when converting a DataFrame into a feature vector and then computing correlation.
The code does not check if df_vector is empty before calling Correlation.corr(), leading to a RuntimeException from Spark.
The process crashes instead of handling the case properly.
Expected Behaviour
If the DataFrame is empty after filtering, Correlation.corr() should be skipped gracefully instead of raising an exception.
The function _compute_spark_corr_natively should check if df_vector is empty before calling Correlation.corr().
Data Description
when there are cols whose cells are all empty or reaches 98~99% missing, in Spark DataFrame
(I didn't see such error when I converted to Pandas dataframe)
Code that reproduces the bug
# Sample 10%
df = spark.sql(
"select * from @@@@.@@@@ where rand() < 0.1"
).cache()
# type casting 1
df_casted = df.select(
[
(
col(field.name).cast("string").alias(field.name)
if isinstance(field.dataType, (DateType, TimestampType))
else col(field.name)
)
for field in df.schema
]
)
# type casting 2
complex_columns = [
field.name
for field in df.schema.fields
if isinstance(field.dataType, (ArrayType, MapType, StructType))
]
for col_name in complex_columns:
df_casted = df_casted.withColumn(col_name, to_json(col(col_name)))
profile = ProfileReport(df_casted, title=app_name, explorative=True)
profile.to_file(f"/tmp/ydata.html")
pandas-profiling version
v2.2.3
Dependencies
dependencies:
- bzip2=1.0.8
- ca-certificates=2025.1.31
- conda-pack=0.8.1
- libffi=3.4.2
- liblzma=5.6.4
- libsqlite=3.49.1
- libzlib=1.3.1
- ncurses=6.5
- openssl=3.4.1
- pip=25.0.1
- pyspark=3.5.3
- python=3.9.21
- readline=8.2
- setuptools=75.8.2
- tk=8.6.13
- wheel=0.45.1
- pip:
- executing==2.2.0
- fastjsonschema==2.21.1
- great-expectations==0.18.22
- jupyter-events==0.12.0
- notebook-shim==0.2.4
- pandocfilters==1.5.1
- phik==0.12.4
- pydantic-core==2.27.2
- python-json-logger==3.2.1
- ruamel-yaml-clib==0.2.12
- soupsieve==2.6
- stack-data==0.6.3
- tzdata==2025.1
- ydata-profiling==4.12.2
OS
macos
Checklist
- There is not yet another bug report for this issue in the issue tracker
- The problem is reproducible from this bug report. This guide can help to craft a minimal bug report.
- The issue has not been resolved by the entries listed under Common Issues.