-
Notifications
You must be signed in to change notification settings - Fork 174
Add xarray SPI API, modern tooling, and numba acceleration #596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
- Add numba for JIT-compiled performance kernels - Configure mypy with strict typing overrides - Introduce pre-commit hooks for code quality - Update lockfile with new dependencies
- Add numba for JIT-compiled performance kernels - Configure mypy with strict typing overrides - Introduce pre-commit hooks for code quality - Update lockfile with new dependencies
- Implement spi_xarray() using apply_ufunc for chunked computation - Add ClimateIndicesAccessor for DataArray.climate_indices.spi() - Register accessor in package __init__.py - Include comprehensive accessor test coverage
- Replace shared-memory multiprocessing with apply_ufunc pipeline - Support optional dask parallelism via --multiprocessing flag - Handle daily 366-day calendar transforms for leap year alignment - Maintain backward compatibility with saved fitting parameters
Reviewer's GuideRefactors SPI computation to an xarray/NumPy/Numba-based pipeline with an xarray SPI API and accessor, replaces the legacy shared-memory multiprocessing SPI CLI with dask-friendly apply_ufunc flows, and hardens core compute/Palmer kernels with typing and JIT acceleration while adding modern tooling (pre-commit, mypy) and documentation/examples for contributors. Sequence diagram for the new SPI CLI xarray/dask pipelinesequenceDiagram
actor User
participant CLI as spi_cli_main
participant Dask as dask_Client
participant SPI as _compute_write_index
participant XR as xarray
participant IDX as indices_spi_xarray
participant FS as netcdf_filesystem
User->>CLI: invoke spi CLI with args
CLI->>CLI: parse_args and validate
CLI->>CLI: determine multiprocessing mode
alt multiprocessing all
CLI->>Dask: create Client(n_workers)
Dask-->>CLI: client ready
end
CLI->>SPI: _compute_write_index(keyword_arguments)
SPI->>XR: open_dataset(netcdf_precip, chunks time -1)
SPI->>SPI: select precip variable and normalize units
SPI->>SPI: transpose dims based on InputType
alt periodicity daily
SPI->>XR: apply_ufunc(transform_to_366day)
XR-->>SPI: precip_366day DataArray
end
SPI->>SPI: load_or_build_fitting_dataset
loop for each scale
SPI->>SPI: build fitting_var_names for scale
alt load_params is not None
SPI->>XR: read fitting parameters from ds_fitting
else save_params is not None
SPI->>XR: apply_ufunc(sum_to_scale)
SPI->>XR: apply_ufunc(gamma_parameters or pearson_parameters)
XR-->>SPI: alphas betas prob_zero loc scale skew
SPI->>XR: write parameters into ds_fitting
end
loop for distribution in gamma pearson
SPI->>IDX: spi_xarray(precip_da, scale, distribution, fitting_params)
IDX->>XR: apply_ufunc(spi_1d, dask parallelized)
XR-->>IDX: spi_values DataArray
IDX-->>SPI: spi_values DataArray
alt periodicity daily
SPI->>XR: apply_ufunc(transform_to_gregorian)
XR-->>SPI: spi_gregorian
end
SPI->>XR: build_dataset_spi_grid_or_divisions_or_timeseries
XR-->>SPI: ds_spi
SPI->>FS: ds_spi.to_netcdf(output_path)
end
end
alt save_params is not None and ds_fitting not None
SPI->>FS: ds_fitting.to_netcdf(save_params)
end
SPI-->>CLI: success
alt multiprocessing all
CLI->>Dask: client.close()
end
CLI-->>User: SPI NetCDF files written
File-Level Changes
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey - I've found 6 issues, and left some high level feedback:
- In
_compute_write_index, theds_fittinglogic will break when neitherload_paramsnorsave_paramsis provided (it remainsNonebut is later accessed for.coordsin the per-scale loop); consider either always initializing a temporary fitting dataset or fully guarding allds_fitting[...]accesses behindds_fitting is not None. - The new
spi_xarrayhelper inindices.pyends with an unreachablereturn xr.apply_ufunc(...)block after raisingValueError, and references_spi_1dthat may be out of scope; this dead code should be removed or refactored to avoid confusion. - The SPI CLI now opens all inputs with
chunks={"time": -1}regardless ofInputType, which disables spatial chunking that previously existed for grid/division inputs; consider preserving chunking along spatial dimensions while still enforcing a single time chunk to keep memory usage manageable on large grids.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `_compute_write_index`, the `ds_fitting` logic will break when neither `load_params` nor `save_params` is provided (it remains `None` but is later accessed for `.coords` in the per-scale loop); consider either always initializing a temporary fitting dataset or fully guarding all `ds_fitting[...]` accesses behind `ds_fitting is not None`.
- The new `spi_xarray` helper in `indices.py` ends with an unreachable `return xr.apply_ufunc(...)` block after raising `ValueError`, and references `_spi_1d` that may be out of scope; this dead code should be removed or refactored to avoid confusion.
- The SPI CLI now opens all inputs with `chunks={"time": -1}` regardless of `InputType`, which disables spatial chunking that previously existed for grid/division inputs; consider preserving chunking along spatial dimensions while still enforcing a single time chunk to keep memory usage manageable on large grids.
## Individual Comments
### Comment 1
<location> `src/climate_indices/palmer.py:787-790` </location>
<code_context>
for param in ["alpha", "beta", "gamma", "delta"]:
if (
param in fitting_params
- and isinstance(fitting_params[param], (list, tuple, np.ndarray))
+ and isinstance(fitting_params[param], list | tuple | np.ndarray)
and len(fitting_params[param]) == 12
):
</code_context>
<issue_to_address>
**issue (bug_risk):** Runtime use of `list | tuple | np.ndarray` in `isinstance` is invalid and will raise `TypeError`.
`isinstance` doesn’t accept PEP 604 unions; it requires a type or a tuple of types. This call will raise `TypeError: isinstance() argument 2 cannot be a union`. Use `isinstance(fitting_params[param], (list, tuple, np.ndarray))` so the check executes correctly.
</issue_to_address>
### Comment 2
<location> `src/climate_indices/indices.py:230-239` </location>
<code_context>
+def spi_xarray(
</code_context>
<issue_to_address>
**issue:** Unreachable duplicate `xr.apply_ufunc` call at end of `spi_xarray` should be removed.
Because the `ValueError` for unsupported distributions is raised just before the final `return xr.apply_ufunc(...)`, that block is dead code and duplicates the earlier `apply_ufunc` call for the `fitting_params is None` case. Please remove the trailing `return xr.apply_ufunc(...)` after the `raise ValueError` to simplify control flow and avoid confusion.
</issue_to_address>
### Comment 3
<location> `src/climate_indices/__spi__.py:319-327` </location>
<code_context>
- raise ValueError(message)
-
- # convert daily values into 366-day years
- if periodicity == compute.Periodicity.daily:
- initial_year = int(dataset_climatology["time"][0].dt.year)
- final_year = int(dataset_climatology["time"][-1].dt.year)
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Daily reindexing to a synthetic `np.arange` time coordinate may desynchronize data and time metadata.
Here you reshape to a 366‑day calendar and set `time = np.arange(total_years * period_length)`, but `ds_precip['time']` remains the original Gregorian coordinate and is later used to build outputs. This desynchronizes the DataArray’s `time` from the dataset‑level `time`. Please either (a) update `ds_precip[time_dim]` to the new coordinate, or (b) keep the 366‑day calendar as a separate auxiliary time coordinate to avoid inconsistent state.
Suggested implementation:
```python
time = np.arange(total_years * period_length)
# keep dataset- and dataarray-level time coordinates in sync with the
# synthetic 366-day calendar used for daily reindexing
dataset_climatology = dataset_climatology.assign_coords(time=time)
if "time" in ds_precip.coords:
ds_precip = ds_precip.assign_coords(time=time)
```
This change assumes:
1. `time` is the correct name of the time dimension across `dataset_climatology` and `ds_precip`.
2. `np` is already imported (it must be, since `np.arange` is used).
If your code uses a different `time_dim` name, adjust `"time"` in `assign_coords` calls to that variable/name. Also ensure that all subsequent operations that rely on the original Gregorian `time` coordinate are compatible with the synthetic 366-day index; if they require the original calendar, consider storing it in a separate auxiliary coordinate (e.g., `original_time`) before overwriting `time`.
</issue_to_address>
### Comment 4
<location> `tests/test_zero_precipitation_fix.py:534` </location>
<code_context>
compute.calculate_time_step_params(insufficient_data)
except compute.DistributionFittingError as e:
# Should catch both InsufficientDataError and PearsonFittingError
- assert isinstance(e, (compute.InsufficientDataError, compute.PearsonFittingError))
+ assert isinstance(e, compute.InsufficientDataError | compute.PearsonFittingError)
assert str(e) # Should have meaningful message
</code_context>
<issue_to_address>
**issue (bug_risk):** The isinstance check using `|` will fail at runtime; it must receive a type or tuple of types.
`isinstance` doesn’t support PEP 604 unions (`A | B`) as its second argument; it only accepts a type or a tuple of types. This will raise `TypeError` when the test runs. Please revert to the tuple form so the assertion actually checks for both exception types:
```python
assert isinstance(e, (compute.InsufficientDataError, compute.PearsonFittingError))
```
</issue_to_address>
### Comment 5
<location> `tests/test_accessors.py:8-17` </location>
<code_context>
+def test_spi_accessor_round_trip(tmp_path):
</code_context>
<issue_to_address>
**suggestion (testing):** Accessor test only covers 1-D gamma/monthly; consider adding tests for other distributions and periodicities.
The current round-trip test is a solid start, but it only exercises a 1-D `DataArray` with a gamma distribution and monthly periodicity. To better validate the accessor, please add tests for:
1. Pearson distribution, to exercise the more complex fitting path.
2. Daily periodicity with appropriate time coordinates, to align with the core `spi` daily handling.
3. An invalid 2-D `DataArray` case, asserting the expected `ValueError`.
This will bring test coverage closer to the combinations supported by the core SPI code.
</issue_to_address>
### Comment 6
<location> `tests/test_accessors.py:45-54` </location>
<code_context>
+ nc_path = tmp_path / "precip.nc"
+ ds.to_netcdf(nc_path, engine="h5netcdf")
+
+ with xr.open_dataset(nc_path) as opened:
+ precip = opened["precip"]
+ spi_da = precip.indices.spi(
+ scale=3,
+ distribution="gamma",
+ data_start_year=2000,
+ calibration_year_initial=2000,
+ calibration_year_final=2001,
+ periodicity=compute.Periodicity.monthly,
+ )
+
+ expected = indices.spi(
+ values,
+ 3,
+ indices.Distribution.gamma,
+ 2000,
+ 2000,
+ 2001,
+ compute.Periodicity.monthly,
+ )
+
+ assert isinstance(spi_da, xr.DataArray)
+ assert spi_da.dims == ("time",)
+ assert np.array_equal(spi_da["time"].values, time)
+ assert spi_da.attrs["long_name"] == "Standardized Precipitation Index"
+ assert spi_da.attrs["units"] == "unitless"
+ np.testing.assert_allclose(spi_da.values, expected, equal_nan=True)
</code_context>
<issue_to_address>
**suggestion (testing):** No tests currently cover Dask-chunked `DataArray`s or time-chunking behaviour for the SPI API.
This test only exercises the in-memory accessor path (which calls `indices.spi` directly) and never touches the `xarray.apply_ufunc`/Dask implementation used by `indices.spi_xarray` and the CLI. To cover the new API, please add tests that:
- Use `DataArray`s with explicit Dask chunking (e.g. `chunk({'time': -1})` and `chunk({'time': 10})`).
- Call `indices.spi_xarray` for both gamma and Pearson.
- Compare results to `indices.spi` on the same data, and verify that unsupported time chunking either behaves correctly or fails with a clear error.
This will exercise the parallel execution path and guard against regressions as xarray/Dask evolve.
Suggested implementation:
```python
assert isinstance(spi_da, xr.DataArray)
assert spi_da.dims == ("time",)
assert np.array_equal(spi_da["time"].values, time)
assert spi_da.attrs["long_name"] == "Standardized Precipitation Index"
assert spi_da.attrs["units"] == "unitless"
np.testing.assert_allclose(spi_da.values, expected, equal_nan=True)
@pytest.mark.parametrize("distribution_name", ["gamma", "pearson"])
@pytest.mark.parametrize("chunk_spec", [{"time": -1}, {"time": 10}])
def test_spi_xarray_with_dask_chunked_dataarray(distribution_name, chunk_spec):
"""Ensure spi_xarray works correctly with Dask-chunked DataArrays.
This exercises the xarray.apply_ufunc/Dask path used by the public API and CLI.
"""
values = np.array(
[
np.nan,
0.553276,
0.650286,
0.810409,
0.722108,
1.071896,
0.792567,
1.175593,
1.200544,
0.973729,
1.024978,
1.035279,
1.059984,
1.117864,
1.151453,
1.084299,
0.866829,
0.910031,
0.876611,
0.676108,
0.704111,
0.667759,
0.746614,
0.574251,
],
dtype=float,
)
time = np.arange(values.size)
da = xr.DataArray(values, dims=("time",), coords={"time": time}, name="precip")
# Explicitly chunk along time to trigger the Dask execution path.
da_chunked = da.chunk(chunk_spec)
expected = indices.spi(
values,
3,
getattr(indices.Distribution, distribution_name),
2000,
2000,
2001,
compute.Periodicity.monthly,
)
spi_da = indices.spi_xarray(
da_chunked,
scale=3,
distribution=distribution_name,
data_start_year=2000,
calibration_year_initial=2000,
calibration_year_final=2001,
periodicity=compute.Periodicity.monthly,
)
# spi_xarray may return a lazy Dask-backed DataArray; compute before comparison.
spi_da_computed = spi_da.compute()
assert isinstance(spi_da, xr.DataArray)
assert spi_da_computed.dims == ("time",)
assert np.array_equal(spi_da_computed["time"].values, time)
assert spi_da_computed.attrs["long_name"] == "Standardized Precipitation Index"
assert spi_da_computed.attrs["units"] == "unitless"
np.testing.assert_allclose(spi_da_computed.values, expected, equal_nan=True)
```
1. Ensure `pytest` is imported in this file if it is not already, e.g. `import pytest` near the top of `tests/test_accessors.py`.
2. If `indices.spi_xarray` requires a different argument name/signature (e.g. `dist` instead of `distribution`), adjust the call accordingly:
- Update `indices.spi_xarray(da_chunked, scale=3, distribution=distribution_name, ...)` to use the correct parameter names.
3. If the actual test suite uses a shared fixture for `values`/`time`, you may want to refactor the duplicated array literal into that fixture and reuse it in both tests to keep things DRY.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| for param in ["alpha", "beta", "gamma", "delta"]: | ||
| if ( | ||
| param in fitting_params | ||
| and isinstance(fitting_params[param], (list, tuple, np.ndarray)) | ||
| and isinstance(fitting_params[param], list | tuple | np.ndarray) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (bug_risk): Runtime use of list | tuple | np.ndarray in isinstance is invalid and will raise TypeError.
isinstance doesn’t accept PEP 604 unions; it requires a type or a tuple of types. This call will raise TypeError: isinstance() argument 2 cannot be a union. Use isinstance(fitting_params[param], (list, tuple, np.ndarray)) so the check executes correctly.
| compute.calculate_time_step_params(insufficient_data) | ||
| except compute.DistributionFittingError as e: | ||
| # Should catch both InsufficientDataError and PearsonFittingError | ||
| assert isinstance(e, (compute.InsufficientDataError, compute.PearsonFittingError)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (bug_risk): The isinstance check using | will fail at runtime; it must receive a type or tuple of types.
isinstance doesn’t support PEP 604 unions (A | B) as its second argument; it only accepts a type or a tuple of types. This will raise TypeError when the test runs. Please revert to the tuple form so the assertion actually checks for both exception types:
assert isinstance(e, (compute.InsufficientDataError, compute.PearsonFittingError))- Project Overview: Added xarray-native API with optional Dask parallelism - Module Structure: Added accessors.py, __spi__.py, and updated module descriptions - Key Design Patterns: - Replaced multiprocessing architecture with Xarray/Dask processing pipeline - Added Xarray-Native SPI API section with code examples for both function and accessor APIs - Added Numba Acceleration section (_pearson_fit(), _minimum_possible()) - Added distribution fallback strategy for robust fitting - Development Commands: Added pre-commit hook command, updated uv sync syntax - CLI Usage: - Added --multiprocessing all example for Dask parallelism - Added parallelization options and fitting parameter persistence sections - Development Notes: - Updated dependencies to include numba, mypy, pre-commit - Added Pre-commit Hooks section with setup instructions - Added Mypy Configuration section with strict settings - Updated performance considerations with numba JIT and Dask distributed
Removes unreachable dead code in spi_xarray after ValueError and improves memory efficiency for large gridded datasets. Changes: - indices.py: Remove unreachable xr.apply_ufunc block after raise - __spi__.py: Preserve spatial chunking (lat/lon/division) while keeping time as single chunk for rolling window correctness - Add clarifying comment about daily time coordinate handling - Add comprehensive accessor tests for Pearson distribution, 2-D validation, and Dask-chunked DataArrays The spatial chunking improvement maintains correctness (single time chunk for rolling windows) while reducing memory usage on large gridded datasets by allowing Dask to chunk spatial dimensions.
Adds notebooks/00_quickstart_indices_demo.ipynb demonstrating core climate indices (SPI, SPEI, PET, PNP) using both NumPy and xarray APIs. Uses bundled test fixtures for reproducibility without external data dependencies. Notebook includes: - Environment setup with version tracking - Programmatic data loading from tests/fixture/ - SPI/SPEI computation with multiple time scales - PET calculations via Thornthwaite and Hargreaves methods - PNP (percent-of-normal precipitation) computation - Validation plots for visual verification Intended as entry point for new users to explore library capabilities interactively.
…oml: - numpy (explicit version for stability) - matplotlib, ipykernel, notebook (for Jupyter workflows)
Add a unified context system for AI coding tools (Claude Code, Codex, Gemini CLI) following the pattern established in the cdip project.
New files:
- AGENTS.md: canonical instructions for all AI tools
- GEMINI.md: thin pointer for Gemini CLI
- context/: on-demand reference library
- INDEX.md, architecture.md, project_brief.md, tech_stack.md
- python_conventions.md, scientific_computing.md, uv_rules.md
- climate_indices_reference.md, dev_workflow.md
- plan_validation_checklist.md
- .claude/context/brief.md: brief for Claude's context window
- .codex/config.toml, INSTRUCTIONS.md: Codex sandbox config
- .gemini/GEMINI.md: Gemini-specific pointer
Updated:
- CLAUDE.md: add note pointing to AGENTS.md and context/INDEX.md
This enables consistent AI-assisted development across different tools while keeping context loading token-economical (load only what's needed).
|
Code reviewFound 2 issues:
The following functions lack docstrings:
Both files use modern type hints but are missing the future annotations import:
🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
Review findings
|




Description
Testing
Summary by Sourcery
Modernize SPI computation and tooling by introducing an xarray/NumPy-first API with optional Dask parallelism, adding numba-accelerated kernels and stricter typing, and simplifying the CLI away from custom multiprocessing while preserving scientific behavior.
New Features:
Enhancements:
Build:
CI:
Documentation:
Tests: