fix: Multiple Spark Enhancements #1800

MCBoarder289 · 2025-12-12T22:10:48Z

This is a single PR that addresses multiple issues at once:

I'm closing my other PRs (see list below) because once I got to resolving each issue individually, other rendering issues cropped up, so it's easier to test all of the fixes at once.

Main Goal
Combining all of the previous issues in Spark, it all boils down to the fact that the Spark reports were inaccurate when compared with Pandas. So my main goal here was to make Spark closer to parity with the Pandas output

Example of pandas output on toy dataset:

Example of broken/pre-fixes spark output on toy dataset:

Fixed Spark Output

Issues and Root Causes
There are couple of commits in here that address specific root causes of these discrepancies. Here are the summarized issues with their solutions:

Issue 1: pandas by default will count "NaN" values as Null in summary stats, but Spark SQL does not, so we explicitly address that in one of the commits.
- Resolution: This was resolved by ensuring the numeric_stats_spark() method explicitly filters out nulls and Nans to match pandas' default behavior
Issue 2: Missing values were not being properly calculated because NaN in Spark is not null, so they weren't considered missing when they should be
- Resolution: Adding nan filters to the n_missing computation in the describe_spark_counts()method
Issue 3: Histogram counts and Common Values counts using the summary["value_counts_without_nan"] Series were not correctly summing counts.
- Resolution: Adding a sum to the counts, and removing the limit(200) makes everything line up to parity with the Pandas output
- NOTE: Since we're pre-aggregating data for the value_counts, I don't think the limit(200) is necessary even with Spark. Since we're pulling this down into a Pandas Series anyway, if the data was too big, then that would explicitly fail the process instead of producing misleading reports. If you're running this in Spark, you're likely using a machine that has a good bit of memory anyway.

Concluding Thoughts
While there is still some very slight variation to the computed stats because of how Spark handles nulls/NaNs differently than pandas, I think this new output is acceptably close to the pandas version and any differences are ultimately negligible. Especially when comparing the initial outputs where the differences are misleading without these fixes, or reports would not even complete/render with some edge cases (all null numeric columns, etc.)

@fabclmnt - Apologies for all of the tags, and I'm still open to all feedback on this approach! I'm happy to discuss further, and hope this is helpful to anyone using the Spark backend.

Misc. Notes:

All pytest tests should be passing
List of previous PRs that I'm closing in favor of this one:

fabclmnt · 2025-12-15T15:57:09Z

Hi @MCBoarder289,

Thank you for the tags and for the contribution, it’s great to see these issues addressed 🙂

From my side, everything looks good. I just have a few small requests before we move forward:

Some test environments intended to validate only the pandas version are failing due to pyspark not being installed. Please ensure that Spark is imported only in the test files that explicitly require it, so that pandas and Spark tests remain properly isolated.
Please address the linter-suggested improvements.
The commit message should follow the project’s commit convention and be written in lowercase.

Thanks again for the contribution, and let me know if I can guide through any of these points.

In the Pandas implementation, the numeric stats like min/max/stddev/etc. by default ignore null values. This commit updates the spark implementation to more closely match that.

Need to add the isnan() check because Pandas isnull check will count NaN as null, but Spark does not.

The previous calculation of counts was actually counting an already summarized dataframe, so it wasn't capturing the correct counts for each instance of a value. This is updated by summing the count value instead of performing a row count.

Discovered this edge case with real data, and still need to fix the rendering of an empty histogram.

This change addresses issue ydataai#1602. Computations in the summarize process result in some floats when computing against decimal columns.To solution this, we simply convert those types to a DoubleType when performing those numeric operations.

Assembling a vector column in Spark with no numeric columns results in features with a NULL size, NULL indices, and an empty list of values. This causes an exception to be raised when computing correlations. The solution here is to avoid computing the correlation matrix when there are no interval columns (numeric). This change addresses issue ydataai#1722.

This change addresses issue ydataai#1723. It implements a "N/A" string as the default when formatting NoneType values.

MCBoarder289 · 2025-12-15T19:41:08Z

Hi @fabclmnt, thank you for all of your feedback here!

I've force-pushed all of my changes to this branch which I believe should address all of those issues. A brief summary of what I updated:

Updated all of the commit messages to match the project’s commit convention.
Many of the mypy issues stemmed from me making the histogram functions return None if they were invalid. I've wrapped those calls in a new utility function that will render an empty container in place of an image in those scenarios. Needed to update more report rendering files, but I feel like those should be safe and caught by unit tests.
Moved all of the new issue tests that I added into the tests/backends/spark_backend directory.

Can you please take a look and let me know if there's anything else that needs to be updated?

Thanks again for your time!

Addresses handling of completely null numeric columns, and gracefully handling empty correlation sets and plots.

MCBoarder289 · 2025-12-18T00:56:25Z

@fabclmnt - One additional minor force push to remove some now unused imports. Apologies for not catching that earlier, but I think this is good for a review now with those other tests/actions you ran. Let me know if you have any other feedback!

This was referenced Dec 12, 2025

fix: Allow NoneType values to be string formatted #1795

Closed

fix: Add handling for spark correlations with no numeric fields #1796

Closed

fix: Spark freqtable and missing values #1798

Closed

fix: Add handling for spark DecimalType #1797

Closed

gmartinsribeiro requested a review from fabclmnt December 12, 2025 23:03

MCBoarder289 added 8 commits December 15, 2025 10:20

fix: update spark numeric stats calculations

b2d6d1a

In the Pandas implementation, the numeric stats like min/max/stddev/etc. by default ignore null values. This commit updates the spark implementation to more closely match that.

fix: update spark null checks for describe_counts_spark method

8a0204e

Need to add the isnan() check because Pandas isnull check will count NaN as null, but Spark does not.

fix: update Spark frequency counts

d4d8a49

The previous calculation of counts was actually counting an already summarized dataframe, so it wasn't capturing the correct counts for each instance of a value. This is updated by summing the count value instead of performing a row count.

fix: adding tests for issue 1429

64b7564

fix: edge case - completely null numeric field in Spark

3487c84

Discovered this edge case with real data, and still need to fix the rendering of an empty histogram.

fix: add handling for spark DecimalType

abde1f2

This change addresses issue ydataai#1602. Computations in the summarize process result in some floats when computing against decimal columns.To solution this, we simply convert those types to a DoubleType when performing those numeric operations.

fix: allow NoneType values to be string formatted

a9c7cf9

This change addresses issue ydataai#1723. It implements a "N/A" string as the default when formatting NoneType values.

MCBoarder289 force-pushed the fix/multiple-spark-enhancements branch 3 times, most recently from 3de47f3 to 8caefa8 Compare December 15, 2025 19:32

fix: multiple Spark fixes to approach closer parity to Pandas profiles

aed14e3

Addresses handling of completely null numeric columns, and gracefully handling empty correlation sets and plots.

MCBoarder289 force-pushed the fix/multiple-spark-enhancements branch from 8caefa8 to aed14e3 Compare December 18, 2025 00:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Multiple Spark Enhancements #1800

fix: Multiple Spark Enhancements #1800

Uh oh!

MCBoarder289 commented Dec 12, 2025 •

edited

Loading

Uh oh!

fabclmnt commented Dec 15, 2025

Uh oh!

MCBoarder289 commented Dec 15, 2025

Uh oh!

MCBoarder289 commented Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: Multiple Spark Enhancements #1800

Are you sure you want to change the base?

fix: Multiple Spark Enhancements #1800

Uh oh!

Conversation

MCBoarder289 commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fabclmnt commented Dec 15, 2025

Uh oh!

MCBoarder289 commented Dec 15, 2025

Uh oh!

MCBoarder289 commented Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MCBoarder289 commented Dec 12, 2025 •

edited

Loading