Skip to content

Allow more column types to be interpolated #421

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jun 30, 2025

Conversation

BrianDeacon
Copy link
Contributor

@BrianDeacon BrianDeacon commented Mar 17, 2025

linear and zero interpolation methods continue to require numeric column types, but ffill, bfill, and null will work on any column type.

Changes

Rather than a blanket rejection of all non-numeric column types, the requirements are applied on a per-column basis depending on the interpolation method required. ValueError is still thrown when the column type doesn't work, but the check is done at the time of attempting to interpolate the column.

Linked issues

Resolves #420

Functionality

  • added relevant user documentation
  • added a new Class method
  • modified existing Class method: ...
  • added a new function
  • modified existing function: ...
  • added a new test
  • modified existing test: ...
  • added a new example
  • modified existing example: ...
  • added a new utility
  • modified existing utility: ...

Tests

  • manually tested
  • added unit tests
  • added integration tests
  • verified on staging environment (screenshot attached)

@BrianDeacon BrianDeacon changed the title Allow resampling to only reject column types that cannot work for the… Allow interpolation to only reject column types that cannot work for the… Mar 20, 2025
@BrianDeacon BrianDeacon changed the title Allow interpolation to only reject column types that cannot work for the… Allow more column types to be interpolated Mar 20, 2025
@BrianDeacon BrianDeacon marked this pull request as ready for review March 21, 2025 20:09
@R7L208
Copy link
Contributor

R7L208 commented Mar 25, 2025

Hey @BrianDeacon - Given the Databricks License and that our Databricks Labs workflow broke to accept external contributions; I'm not sure if this can be merged. I am checking on options for you

cc: @pohlposition @gueniai

@BrianDeacon
Copy link
Contributor Author

Hey @BrianDeacon - Given the Databricks License and that our Databricks Labs workflow broke to accept external contributions; I'm not sure if this can be merged. I am checking on options for you

cc: @pohlposition @gueniai

I mean, I'm pretty sure this just fell off a truck, and you wrote it @R7L208

;)

@R7L208
Copy link
Contributor

R7L208 commented May 27, 2025

@BrianDeacon, if you're still interested in authoring this, could you email me the below info to [email protected] to get you whitelisted?

  • Contributor’s name
  • Contributor’s email address
  • Contributor’s GitHub handle
  • Contributor’s static IP address

@BrianDeacon
Copy link
Contributor Author

BrianDeacon commented Jun 8, 2025

  • Contributor’s static IP address

I've emailed Lorin. Is the static ip address a requirement?

Lol, you're Lorin. :)

@R7L208
Copy link
Contributor

R7L208 commented Jun 12, 2025

#425 should fix the mypy errors. Once that's merged then you can pull the latest from master into this branch

@kwang-databricks fyi

@R7L208
Copy link
Contributor

R7L208 commented Jun 16, 2025

@BrianDeacon - #425 was merged into master. Can you update your fork & branch? That PR should resolve the lint issues that are unrelated to your changes

@BrianDeacon
Copy link
Contributor Author

@BrianDeacon - #425 was merged into master. Can you update your fork & branch? That PR should resolve the lint issues that are unrelated to your changes

Done!

@BrianDeacon
Copy link
Contributor Author

@R7L208 Anything else?

@BrianDeacon
Copy link
Contributor Author

Looks like the pipeline uncovered that I had some unit test code that wasn't compatible with older versions of pyspark. That's fixed now.

Copy link
Contributor

@R7L208 R7L208 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found a few small things but otherwise looks great!


if not self._is_valid_method_for_column(series, method, target_col):
raise ValueError(
f"Interpolation method '{method}' is not supported for column '{target_col}' of type '{series.schema[target_col].dataType}'"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BrianDeacon - can you clarify the error message to indicate the column must be of NumericType but instead received a non-numeric type of {}

ValueError,
self.interpolate_helper.interpolate,
simple_input_tsdf,
freq="30 seconds", func="ceil", method="linear", ts_col="event_ts",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be method="zero" IIUC. Can you update and make sure test still passes?

Comment on lines 176 to 185
if "decimal_convert" in self.df:
for decimal_col in self.df["decimal_convert"]:
if "." in date_col:
col, field = date_col.split(".")
convert_field_expr = sfn.col(col).getField(field).cast("decimal")
df = df.withColumn(
col, sfn.col(col).withField(field, convert_field_expr)
)
else:
df = df.withColumn(decimal_col, sfn.col(decimal_col).cast("decimal"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is a bug here and you need decimal_col inside of the for loop instead of date_col for the if expression and call of split(). Can you double check this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Live by the clipboard, die by the clipboard. :) Fixed.

Can you sanity check me? As far as I can tell, none of the test configs pass in data that would hit these if "." branches. I'm guessing these got pulled in from some other code base? I'm not suggesting to yank it out, but I just wanted to make sure I wasn't misunderstanding how this works.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's used in the interpol tests to manage decimal accuracy when converting data to DecimalType

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But do the other variations ever run into a column with a "." in the name? These _convert variations are all to process these entries, right? So I just got "lucky" that none of these bits in the config had a . in the name?

"date_convert": ["date_col"],

@BrianDeacon
Copy link
Contributor Author

@R7L208 Ready for another look. 👍

@@ -176,7 +176,7 @@ def as_sdf(self) -> DataFrame:
if "decimal_convert" in self.df:
for decimal_col in self.df["decimal_convert"]:
if "." in date_col:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if "." in date_col:
if "." in decimal_col:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:face_palm:
Fixed

Copy link
Contributor

@R7L208 R7L208 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One final change i think as long as test pass

Copy link
Contributor

@R7L208 R7L208 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @BrianDeacon!

@R7L208 R7L208 merged commit 74c2b07 into databrickslabs:master Jun 30, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE]: Support more column types for resampling and interpolating
3 participants