Allow more column types to be interpolated #421

BrianDeacon · 2025-03-17T18:09:53Z

linear and zero interpolation methods continue to require numeric column types, but ffill, bfill, and null will work on any column type.

Changes

Rather than a blanket rejection of all non-numeric column types, the requirements are applied on a per-column basis depending on the interpolation method required. ValueError is still thrown when the column type doesn't work, but the check is done at the time of attempting to interpolate the column.

Linked issues

Resolves #420

Functionality

Tests

manually tested
added unit tests
added integration tests
verified on staging environment (screenshot attached)

…the specific method used

R7L208 · 2025-03-25T15:11:54Z

Hey @BrianDeacon - Given the Databricks License and that our Databricks Labs workflow broke to accept external contributions; I'm not sure if this can be merged. I am checking on options for you

cc: @pohlposition @gueniai

BrianDeacon · 2025-03-25T17:58:38Z

Hey @BrianDeacon - Given the Databricks License and that our Databricks Labs workflow broke to accept external contributions; I'm not sure if this can be merged. I am checking on options for you

cc: @pohlposition @gueniai

I mean, I'm pretty sure this just fell off a truck, and you wrote it @R7L208

;)

R7L208 · 2025-05-27T18:51:59Z

@BrianDeacon, if you're still interested in authoring this, could you email me the below info to [email protected] to get you whitelisted?

Contributor’s name
Contributor’s email address
Contributor’s GitHub handle
Contributor’s static IP address

BrianDeacon · 2025-06-08T22:29:24Z

Contributor’s static IP address

I've emailed Lorin. Is the static ip address a requirement?

Lol, you're Lorin. :)

R7L208 · 2025-06-12T23:46:12Z

#425 should fix the mypy errors. Once that's merged then you can pull the latest from master into this branch

@kwang-databricks fyi

R7L208 · 2025-06-16T14:21:54Z

@BrianDeacon - #425 was merged into master. Can you update your fork & branch? That PR should resolve the lint issues that are unrelated to your changes

BrianDeacon · 2025-06-18T21:39:58Z

@BrianDeacon - #425 was merged into master. Can you update your fork & branch? That PR should resolve the lint issues that are unrelated to your changes

Done!

BrianDeacon · 2025-06-23T18:22:41Z

@R7L208 Anything else?

BrianDeacon · 2025-06-26T22:17:35Z

Looks like the pipeline uncovered that I had some unit test code that wasn't compatible with older versions of pyspark. That's fixed now.

R7L208

Found a few small things but otherwise looks great!

R7L208 · 2025-06-27T21:00:15Z

python/tempo/interpol.py

+
+        if not self._is_valid_method_for_column(series, method, target_col):
+            raise ValueError(
+                f"Interpolation method '{method}' is not supported for column '{target_col}' of type '{series.schema[target_col].dataType}'"


@BrianDeacon - can you clarify the error message to indicate the column must be of NumericType but instead received a non-numeric type of {}

R7L208 · 2025-06-27T21:02:01Z

python/tests/interpol_tests.py

+            ValueError,
+            self.interpolate_helper.interpolate,
+            simple_input_tsdf,
+            freq="30 seconds", func="ceil", method="linear", ts_col="event_ts",


This should be method="zero" IIUC. Can you update and make sure test still passes?

R7L208 · 2025-06-27T21:04:47Z

python/tests/base.py

+        if "decimal_convert" in self.df:
+            for decimal_col in self.df["decimal_convert"]:
+                if "." in date_col:
+                    col, field = date_col.split(".")
+                    convert_field_expr = sfn.col(col).getField(field).cast("decimal")
+                    df = df.withColumn(
+                        col, sfn.col(col).withField(field, convert_field_expr)
+                    )
+                else:
+                    df = df.withColumn(decimal_col, sfn.col(decimal_col).cast("decimal"))


I believe this is a bug here and you need decimal_col inside of the for loop instead of date_col for the if expression and call of split(). Can you double check this?

Live by the clipboard, die by the clipboard. :) Fixed.

Can you sanity check me? As far as I can tell, none of the test configs pass in data that would hit these if "." branches. I'm guessing these got pulled in from some other code base? I'm not suggesting to yank it out, but I just wanted to make sure I wasn't misunderstanding how this works.

It's used in the interpol tests to manage decimal accuracy when converting data to DecimalType

But do the other variations ever run into a column with a "." in the name? These _convert variations are all to process these entries, right? So I just got "lucky" that none of these bits in the config had a . in the name?

"date_convert": ["date_col"],

BrianDeacon · 2025-06-30T17:14:14Z

@R7L208 Ready for another look. 👍

R7L208 · 2025-06-30T17:19:54Z

python/tests/base.py


        if "decimal_convert" in self.df:
            for decimal_col in self.df["decimal_convert"]:
                if "." in date_col:


Suggested change

if "." in date_col:

if "." in decimal_col:

:face_palm:
Fixed

R7L208

One final change i think as long as test pass

R7L208

LGTM! Thanks @BrianDeacon!

* making contributing.md slightly more clear (#422) * making contributing.md slightly more clear * Remove `Analyze` job from test.yml (#423) - This job called CodeQL which is broken due to new firewall rules --------- Co-authored-by: Lorin Dawson <[email protected]> * created prelim makefile with tox commands, updated contributing.md (#424) * created prelim makefile with tox commands, updated contributing.md * adding 3.11 to docs, and updating create-env in makefile to install all necessary python versions * removing 3.8 * Update Makefile to improve test and environment management Added commands for creating virtual environments, running tests, and generating coverage reports. Enhanced documentation for supported DBR versions and added new command options for better usability. * Update Makefile to conditionally install Python versions Modified the `venv` target to check for existing Python versions before installing them. This prevents redundant installations and ensures only missing versions are installed via `pyenv`. * Update Makefile to use dynamic virtualenv variable Replaced hardcoded `.venv` with the `$(VENV)` variable in the Makefile. This allows for greater flexibility and customization of the virtual environment directory name. * Fix typo in Makefile target comment Corrected a misspelling in the comment for the `test` target. Changed "testtests" to "tests" for improved clarity and professionalism. --------- Co-authored-by: Lorin <[email protected]> * [Chore] Cicd updates 02 (#425) * created prelim makefile with tox commands, updated contributing.md * adding 3.11 to docs, and updating create-env in makefile to install all necessary python versions * removing 3.8 * updating git workflows to use make * create coverage report in test * remove commented tox command * update check_for_nan, force bool returns * updates to io.py to pass tests * testing out change to tsdf to account for new mypy restrcitions * updating mypy error handling * revert * updating mypy.ini to ignore scipy * remove edits to tox.ini to show versions * - Fix `bool` conversions in `tempo/intervals.py` to ensure consistent type handling. - Correct `tox.ini` basepython to `py311` and manage additional dependencies. - Clarify complex number handling in `fourier_transform` and ensure float usage in `fftfreq`. - Expand `Makefile` functionality with environment checks for Python and Java setups. * Add `_cleanup_delta_warehouse` method for Delta test environment cleanup. Integrate pre- and post-test cleanup of Delta warehouse directories and metastore files in `setUpClass` and `tearDownClass` to ensure a consistent test environment. * Expand Makefile with enhanced Python environment checks and management. - Add `.claude` to `.gitignore`. - Replace `check-pyenv` with `check-python` to support system Python and automate `pyenv` installation. - Introduce `setup-python-versions` and `setup-all-python-versions` targets for flexible Python version setups. - Update `venv`, `test`, and `test-all` targets to utilize new utilities. --------- Co-authored-by: Lorin <[email protected]> * chore: code formatting and linting updates * Upgrading Tox to Hatch, and Updating respective commands in Makefile (#426) * created prelim makefile with tox commands, updated contributing.md * adding 3.11 to docs, and updating create-env in makefile to install all necessary python versions * removing 3.8 * updating git workflows to use make * create coverage report in test * initial hatch commit * updates to makefile * updating github workflow dependencies to install hatch instead of tox * fixing posargs issue in lint * fixing type checker * adding version call so hatch knows what to pick up * using correct method in version.py" * adding get_version to version.py for hatch environment creation * adding semver as dep in git * using vcs as hatch version * updating version.py to dynamically pull version, and semver as dep in all testenvs * checking semver install * updating semver * fixing var ref before assignment in version.py * fixing correct error type * getting around coverage circ dep * forgot to update makefile * remove commented tox command * update check_for_nan, force bool returns * updates to io.py to pass tests * testing out change to tsdf to account for new mypy restrcitions * updating mypy error handling * revert * updating mypy.ini to ignore scipy * remove edits to tox.ini to show versions * linting fix * refactor: simplify resampling and interpolation logic - Removed `_ResampledTSDF` class in favor of integrating resampling metadata (`_resample_freq`, `_resample_func`) into `TSDF`. - Improved error handling for `freq` and `func` parameters in resample and interpolation methods. - Updated interpolation logic to utilize a resampled TSDF object and mapped predefined fill methods. - Adjusted references to `partitionCols` by transitioning to `series_ids`. - Added `parameterized` dependency to `pyproject.toml` for enhanced test capabilities. - Reduced circular imports and restructured imports for maintainability. - Updated tests to align with changes in resample and interpolation workflows. * refactor: extract resample helper functions to `resample_utils` and update tests - Moved reusable functions and constants (e.g., `validateFuncExists`, `checkAllowableFreq`) to `resample_utils` for better modularity. - Updated `resample`, `tsdf`, and associated tests to use the refactored helper functions. - Simplified test data construction by introducing `get_test_function_df_builder`. * core refactor done: enhance resampling logic and consolidate column handling behavior - Introduced `AGG_KEY` constant for consistent grouping across resample functions. - Refined column handling logic in `aggregate` and `resample` to align with best practices, emphasizing explicit configurations and preserving observational columns by default. - Updated `calc_bars` to seamlessly integrate OHLC bar calculations with enhanced column handling. - Adjusted tests to use more descriptive data keys (`input_data`, `expected_data`) for better clarity. - Updated tests to align with modified column handling and aggregation behavior. * refactor: introduce `resample_utils` module for shared resampling utilities - Added `resample_utils.py` to encapsulate utility functions (`checkAllowableFreq`, `validateFuncExists`) and constants for resampling. - Defined global frequency and aggregation options for modularity. - Improved frequency validation logic with `freq_dict` and `ALLOWED_FREQ_KEYS`. * refactor: optimize time bounds calculation in resample logic. All resmaple tests pass - Replaced window function-based approach with `groupBy` for calculating time bounds, reducing duplicate computations. - Improved handling of `series_ids` for partitioning, ensuring accuracy in sequence generation. - Simplified logic for timestamp sequence generation using `time_bounds`. * refactor: update test data construction and skip conditions for enhanced clarity and precision - Consolidated shared test data definitions and references using `$ref`. - Adjusted timestamp handling and sequence column logic in various tests. - Added skip conditions for tests involving composite timestamp indexes or complex timestamp structures. - Refactored utility tests to improve consistency in `get_test_function_df_builder` usage. * refactor: replace `get_test_df_builder` with `get_test_function_df_builder` in tests for consistent data building * refactor: add PySpark 3.5+ compatibility and provide fallbacks for older versions - Introduced version checks for PySpark features like `count_if` and `bool_or`. - Added compatibility wrapper `_bool_or_compat` for `bool_or` functionality. - Updated segment handling logic with fallbacks for PySpark < 3.5, ensuring robust interpolation behavior. * Allow more column types to be interpolated (#421) * Allow interpolation to only reject column types that cannot work for the specific method used * Fix unit test incompatibility with dbr113 * Fix tests and clarify error message * Fix incorrect column reference in test --------- Co-authored-by: Brian Deacon <[email protected]> * update: extend .gitignore to include `.venv-*` directory pattern * chore linting: improve type annotations, imports, and formatting for code consistency - Replaced redundant `Any` in type hints with more precise alternatives (e.g., `object`, `TSDF`, `Collection`). - Converted `TimeUnitsType` from `NamedTuple` creation to a class for improved readability. - Consolidated and reorganized imports across modules for better clarity. - Removed unused imports and redundant `pass` statements in abstract methods. - Standardized and fixed minor formatting issues (consistency in blank lines, indentation, and trailing spaces). * refactor: improve type annotations, reorganize imports, and enhance join logic across modules - Added precise type annotations (e.g., `Optional`, `Tuple`) to methods for better clarity and static analysis support. - Refactored imports to include `# type: ignore` directives where necessary for untyped packages or compatibility. - Enhanced `_toleranceFilter` and `_join` methods with placeholders and TODOs for future logic implementations. - Introduced stricter validation for parameters (e.g., `freq` checks) to avoid runtime errors. - Updated join constructors to handle default prefix values and improve readability. * style: fix indentation to 4 spaces per PEP 8 * adding imports for IntervalsDF + pytest in dev.txt * test: remove `intervals_tests.py` to clean up unused and redundant test cases * test: remove `intervals_df_tests.json` to clean up deprecated and redundant test cases * test: update exception type in `test_appendAggKey_freq_is_none` - Replaced `TypeError` with `ValueError` in assertion to align with updated `_appendAggKey` behavior. --------- Co-authored-by: kwang-databricks <[email protected]> Co-authored-by: Brian Deacon <[email protected]> Co-authored-by: Brian Deacon <[email protected]>

BrianDeacon mentioned this pull request Mar 18, 2025

[FEATURE]: Support more column types for resampling and interpolating #420

Closed

1 task

BrianDeacon changed the title ~~Allow resampling to only reject column types that cannot work for the…~~ Allow interpolation to only reject column types that cannot work for the… Mar 20, 2025

BrianDeacon changed the title ~~Allow interpolation to only reject column types that cannot work for the…~~ Allow more column types to be interpolated Mar 20, 2025

BrianDeacon marked this pull request as ready for review March 21, 2025 20:09

Allow interpolation to only reject column types that cannot work for …

6021d6c

…the specific method used

BrianDeacon force-pushed the expanded-resample branch from ace43dd to 6021d6c Compare March 21, 2025 20:19

Merge branch 'databrickslabs:master' into expanded-resample

a20f7bf

Merge branch 'databrickslabs:master' into expanded-resample

844fcb7

Fix unit test incompatibility with dbr113

5ec455c

Merge branch 'databrickslabs:master' into expanded-resample

bad301a

R7L208 requested changes Jun 27, 2025

View reviewed changes

Fix tests and clarify error message

17cd07b

R7L208 requested changes Jun 30, 2025

View reviewed changes

Fix incorrect column reference in test

95a61e0

R7L208 requested changes Jun 30, 2025

View reviewed changes

R7L208 approved these changes Jun 30, 2025

View reviewed changes

R7L208 merged commit 74c2b07 into databrickslabs:master Jun 30, 2025
7 checks passed

R7L208 mentioned this pull request Sep 10, 2025

Release v0.1.30: Migrate to Hatch, fix coverage reporting, and enhance interpolation support #435

Merged

15 tasks

Allow more column types to be interpolated #421

Allow more column types to be interpolated #421

Uh oh!

Conversation

BrianDeacon commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Linked issues

Functionality

Tests

Uh oh!

R7L208 commented Mar 25, 2025

Uh oh!

BrianDeacon commented Mar 25, 2025

Uh oh!

R7L208 commented May 27, 2025

Uh oh!

BrianDeacon commented Jun 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

R7L208 commented Jun 12, 2025

Uh oh!

R7L208 commented Jun 16, 2025

Uh oh!

BrianDeacon commented Jun 18, 2025

Uh oh!

BrianDeacon commented Jun 23, 2025

Uh oh!

BrianDeacon commented Jun 26, 2025

Uh oh!

R7L208 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BrianDeacon commented Jun 30, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

R7L208 left a comment

Choose a reason for hiding this comment

Uh oh!

R7L208 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

BrianDeacon commented Mar 17, 2025 •

edited

Loading

BrianDeacon commented Jun 8, 2025 •

edited

Loading