Skip to content

Bug Report - Correlation.corr() fails when input DataFrame is empty in Spark #1722

Open
@minseokim12

Description

@minseokim12

Current Behaviour

When using ydata-profiling with Spark, if the dataset is empty after filtering numeric columns, an exception is raised due to Correlation.corr() not handling empty DataFrames properly. This issue occurs in _compute_spark_corr_natively when converting a DataFrame into a feature vector and then computing correlation.

The code does not check if df_vector is empty before calling Correlation.corr(), leading to a RuntimeException from Spark.
The process crashes instead of handling the case properly.

Expected Behaviour

If the DataFrame is empty after filtering, Correlation.corr() should be skipped gracefully instead of raising an exception.
The function _compute_spark_corr_natively should check if df_vector is empty before calling Correlation.corr().

Data Description

when there are cols whose cells are all empty or reaches 98~99% missing, in Spark DataFrame
(I didn't see such error when I converted to Pandas dataframe)

Code that reproduces the bug

# Sample 10%
df = spark.sql(
    "select * from @@@@.@@@@ where rand() < 0.1"
).cache()
# type casting 1
df_casted = df.select(
    [
        (
            col(field.name).cast("string").alias(field.name)
            if isinstance(field.dataType, (DateType, TimestampType))
            else col(field.name)
        )
        for field in df.schema
    ]
)
# type casting 2
complex_columns = [
    field.name
    for field in df.schema.fields
    if isinstance(field.dataType, (ArrayType, MapType, StructType))
]
for col_name in complex_columns:
    df_casted = df_casted.withColumn(col_name, to_json(col(col_name)))


profile = ProfileReport(df_casted, title=app_name, explorative=True)
profile.to_file(f"/tmp/ydata.html")

pandas-profiling version

v2.2.3

Dependencies

dependencies:
  - bzip2=1.0.8
  - ca-certificates=2025.1.31
  - conda-pack=0.8.1
  - libffi=3.4.2
  - liblzma=5.6.4
  - libsqlite=3.49.1
  - libzlib=1.3.1
  - ncurses=6.5
  - openssl=3.4.1
  - pip=25.0.1
  - pyspark=3.5.3
  - python=3.9.21
  - readline=8.2
  - setuptools=75.8.2
  - tk=8.6.13
  - wheel=0.45.1
  - pip:
      - executing==2.2.0
      - fastjsonschema==2.21.1
      - great-expectations==0.18.22
      - jupyter-events==0.12.0
      - notebook-shim==0.2.4
      - pandocfilters==1.5.1
      - phik==0.12.4
      - pydantic-core==2.27.2
      - python-json-logger==3.2.1
      - ruamel-yaml-clib==0.2.12
      - soupsieve==2.6
      - stack-data==0.6.3
      - tzdata==2025.1
      - ydata-profiling==4.12.2

OS

macos

Checklist

  • There is not yet another bug report for this issue in the issue tracker
  • The problem is reproducible from this bug report. This guide can help to craft a minimal bug report.
  • The issue has not been resolved by the entries listed under Common Issues.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions