Skip to content

Refactor forecast #39

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 35 commits into
base: main
Choose a base branch
from
Open

Refactor forecast #39

wants to merge 35 commits into from

Conversation

isaacaka
Copy link
Collaborator

@isaacaka isaacaka commented Mar 4, 2025

The aim is to refactor the forecast.py code to make it easier to swap between different dimensionality reduction and forecasting techniques.

  • Refactored jumper.ipynb, main_forecast.py and main to reflect changes to the refactored libraries. The jumper notebook is specific to using normal PCA (not Kernal PCA)

  • Separated code into forecasting technique classes and dimensionality reduction classes

  • forecast.py calls the methods defined in the above two classes

  • The DR and forecasting techniques created are imported and stored in a dictionary

  • The DR and forecasting techniques the user wants to use can be specified by setting the name within techniques_config.yaml

  • ocean_terms.yaml defines the names of the terms stored in the NEMO grid files, e.g. different configurations of nemo use different names for temperature, ssh and salinity. You can specify which names are used in the yaml file. and restart.py will then be able to read the correct keys.

Tested that the refactored code runs end-to-end

Test written against the non-refactored code need to be written.

Closes #37 #52 #9 #44 #48

@ma595 ma595 marked this pull request as draft April 9, 2025 19:39
@ma595 ma595 changed the title Refactor forecast [DRAFT] Refactor forecast Apr 9, 2025
@ma595 ma595 changed the title [DRAFT] Refactor forecast [WIP] Refactor forecast Apr 9, 2025
@ma595 ma595 mentioned this pull request Apr 17, 2025
…gnature for get_component

Creates a setter method to pass neccessary variables from simulation class to the decompostion class
Corrects method signature for get_component in forecast.py and dimensionality_reduction.py
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@isaacaka isaacaka requested review from ma595 and surbhigoel77 April 23, 2025 15:11
@isaacaka isaacaka marked this pull request as ready for review April 24, 2025 09:06
@Copilot Copilot AI review requested due to automatic review settings April 24, 2025 09:06
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the forecasting code by separating dimensionality reduction and forecasting techniques into dedicated classes, updating configuration files, and modifying related scripts for consistency. Key changes include updating function signatures (e.g. adding a filename parameter to prepare), refactoring PCA/decomposition routines to leverage strategy classes, and adjusting forecast prediction reconstruction and error evaluation.

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
techniques_config.yaml Added configuration entries to specify DR and forecast techniques.
ocean_terms.yaml Introduced a YAML file to map ocean term names for flexible usage.
main_forecast.py Updated function signatures, logging, and file handling logic.
lib/restart.py Added YAML loading for ocean terms in restart functions.
lib/forecast_technique.py Refactored forecast techniques including direct and recursive approaches.
lib/forecast.py Modified simulation loading, decomposition, reconstruction and forecasting logic.
lib/dimensionality_reduction.py Updated PCA and KernelPCA implementations with new methods for reconstruction.
Comments suppressed due to low confidence (3)

main_forecast.py:557

  • The revised logic for generating x_pred may not cover the full intended time range; please verify that the prediction array handles the simulation length and forecast steps as expected.
x_pred = np.arange(1 + pas, 1 + steps * pas, pas).reshape(-1, 1)

lib/forecast.py:156

  • The comparison 'if self.len < self.end' may lead to errors if self.end is None; ensure that self.end is always set to a valid integer value or introduce a conditional branch to handle the None case.
array = [self.loadFile(file) for file in self.files if self.len < self.end]

main_forecast.py:85

  • [nitpick] Minor typo: 'forcasted' should be corrected to 'forecasted'.
print(f"{term} time series forcasted")

@ma595 ma595 added the iccs label Apr 29, 2025
@ma595 ma595 changed the title [WIP] Refactor forecast Refactor forecast Apr 29, 2025
Copy link
Member

@ma595 ma595 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running through Jumper.ipynb. Some small nitpicking issues (will look over main structural changes next).

I deployed with python==3.11 and ran the Jumper.ipynb:

  • I modified the path to point to Grid_T files.
  • Error is returned with KPCA due to the setting of 0.9 for components
  • I modified the DR_technique in techniques_config.yaml to "DimensionalityReductionPCA"
    • Could we alias these names to make them a bit more readable?
    • I modified "DimensionalityReductionPCA" to "DimensionalityReductionABC" and I get a KeyError 'DimensionalityReductionABC' - could a more helpful error message be provided?
  • Word "Compilated" is not correct (remnant of Maud's notebook).
  • Error bars look a little weird (using jupyter notebook defaults) - I loaded the data in the google drive.
image
  • simu_predicted needs to be manually created otherwise predictions will fail (again a remnant from Maud's code). Consider adding to cell 12 - where it was previously.
  • When plotting evaluation AttributeError: 'Simulation' object has no attribute 'getPC' - does the getPC function exist anymore? Perhaps this isn't important?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few initial comments.

Copy link
Member

@ma595 ma595 May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure that ruff check <file.py passes. We need to add this as a check in the CI as well. I can add the linting rules I've used for Spinup-Evaluation as a new PR.

@ma595 ma595 self-requested a review May 8, 2025 14:12
@isaacaka
Copy link
Collaborator Author

isaacaka commented May 9, 2025

Running through Jumper.ipynb. Some small nitpicking issues (will look over main structural changes next).

I deployed with python==3.11 and ran the Jumper.ipynb:

  • I modified the path to point to Grid_T files.

  • Error is returned with KPCA due to the setting of 0.9 for components

  • I modified the DR_technique in techniques_config.yaml to "DimensionalityReductionPCA"

    • Could we alias these names to make them a bit more readable?
    • I modified "DimensionalityReductionPCA" to "DimensionalityReductionABC" and I get a KeyError 'DimensionalityReductionABC' - could a more helpful error message be provided?
  • Word "Compilated" is not correct (remnant of Maud's notebook).

  • Error bars look a little weird (using jupyter notebook defaults) - I loaded the data in the google drive.

image * `simu_predicted` needs to be manually created otherwise predictions will fail (again a remnant from Maud's code). Consider adding to cell 12 - where it was previously. * When plotting evaluation `AttributeError: 'Simulation' object has no attribute 'getPC'` - does the `getPC` function exist anymore? Perhaps this isn't important?

Thanks, I renamed them to just KernelPCA and PCA and raised an exception with a message when an invalid technique is specified.
The error bars are a bit off set. They should start at t=20 rather than t=0
getPC is now get_component

Copy link
Member

@ma595 ma595 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR introduces skforecast Forecasters, sklearn regressors and a dimensionality reduction class.

Within forecast_technique.py we create instances of the combined regressor and forecaster. So we essentially create all available options here.

These are then imported in the forecast.py

At present, all forecasters, regressors and dimensionality reduction algorithms are imported to the user in the forecast.py

Added new files:

  • dimensionality_reduction.py
  • forecast_technique.py

Configuration files

  • ocean_terms.yaml
  • techniques_config.yaml

ocean_terms.yaml now generalises the inputs quite nicely. Used in restart.py.

General Issues

  • Add a quick sketch of an updated README.md (already mentioned). Generally we need more comments and docs.
  • Testing - We need this project wide. I'll help with this.
    • Testing (separate PR coming). Avoid assertions in favour of Exceptions.
  • Formatting / linting issues need addressing - To fix some automatically: Run ruff check --fix . Others will need manually fixed. Make sure local install of Ruff is up to date ruff check
  • STYLE: Inconsistent formatting in docstrings. Consider adopting numpy style https://numpydoc.readthedocs.io/en/latest/format.html#parameters. I've added a PR that imposes this style project wide using github actions. create_gp_regressor in lib/forecast_technique.py is a good example of how to do it.
  • STYLE: Could you add some static type hinting where you think it'd be useful? This is to get us ready for mypy in future.
  • STYLE: Consider renaming forecast_technique.py to forecast_method.py?
  • FUNCTIONALITY: I don't think the parallel jobs in Forecast works. (MINOR)
  • STYLE: snake_case for function and method names - class names have CamelCase / PascalCase. Again, these are remnants from Maud's code.
  • Remove optimisation from lib/forecast.py move to a different class? I'm not sure what this is currently used for?
  • STYLE: Needs unifying (Run ruff check with or without --fix.
  • Possibly need to refactor forecast_technique again (and rename).
  • Could we add a simple regressor like linear or tree based?
  • We need to think about how to generalise this further so we do not hardcode the regressor and Forecaster?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I reviewed this previously. Will run through again and check if it runs without error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicely modularised now. getPC() -> get_component().

General comments

  • Need some tests
    • Avoid asserts - replace with ValueError or whatever's more relevant.
    • Probably better as a PR

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Comment on lines +676 to +679
# TODO: Change to DR_technique
self.int_mask, ts_array = DimensionalityReductionPCA.reconstruct_predictions(
predictions, n, self.info, begin
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be great to generalise this now to DimensionalityReductionMethod or similar

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Comment on lines -550 to +569
x_pred = np.arange(0, (len(self) + steps) * pas, pas).reshape(-1, 1)
# x_pred = np.arange(0, (len(self) + steps) * pas, pas).reshape(-1, 1) #TODO: Check this len(self) + steps logic
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has this defintely been checked? I recall discussing this - but I can't remember what we decided on. I suspect the old implementation was correct?

Comment on lines +27 to +71
from lib.forecast_technique import (
GaussianProcessForecaster,
GaussianProcessRecursiveForecaster,
)
from lib.dimensionality_reduction import (
DimensionalityReductionPCA,
DimensionalityReductionKernelPCA,
)


sys.path.insert(0, "../")

path_to_nemo_directory = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
path_to_nemo_directory = Path(path_to_nemo_directory)

warnings.filterwarnings("ignore")

Forecast_techniques = {
"GaussianProcessForecaster": GaussianProcessForecaster,
"GaussianProcessRecursiveForecaster": GaussianProcessRecursiveForecaster,
}
Dimensionality_reduction_techniques = {
"PCA": DimensionalityReductionPCA,
"KernelPCA": DimensionalityReductionKernelPCA,
}


with open(path_to_nemo_directory / "techniques_config.yaml", "r") as f:
config = yaml.safe_load(f)


if config["DR_technique"]["name"] not in Dimensionality_reduction_techniques:
raise KeyError(
f"DR_technique {config['DR_technique']['name']} not found. Have you specified a valid dimensionality reduction technique in the config file?"
)
else:
DR_technique = Dimensionality_reduction_techniques[config["DR_technique"]["name"]]

if config["Forecast_technique"]["name"] not in Forecast_techniques:
raise KeyError(
f"Forecast_technique {config['Forecast_technique']['name']} not found. Have you specified a valid forecasting technique in the config file?"
)
else:
Forecast_technique = Forecast_techniques[config["Forecast_technique"]["name"]]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be best to move all of this to somewhere else like a main_forecast.py? See file comments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • STYLE: Lower case Forecast function
  • STYLE: There's a prepare method in Simulation class as well as main_forecast and Predictions class. It's quite confusing (I realise this is a remnant of Maud's code)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are similarly named functions in Simulation and Prediction too, which makes the whole code quite hard to navigate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This defines the forecasting strategy i.e. recursive, direct and the regressor i.e. GP or regression.

  • Regressor is defined here.
    • GP, regression etc.
  • Do all regressors have predict(return_std=True)?
    • suggests that all regressors have uncertainty outputs
  • Lower case these i.e.: GaussianProcessRecursiveForecaster to gp_recursive_forecaster?
    • They are instances of a class (i.e a Forecaster that takes a regressor as parameter).
    • But they are treated in much the same way as dimensionality reduction classes - this is potentially confusing.
  • Rename ForcastTechnique to something like ForecastMethod or ForecastStrategy.
  • Could this code be generalised further?
    • Drawbacks: Hardcoded forecast steps / lags in GP and quite tight coupling to skforecast
    • I could imagine a situation where instead of hardcoding combinations like you have done. You could 'register' different regressors and 'forecasters'. And then use a function to combine them.
    • And use a configuration yaml file which specifies the regressor and forecast method you want a long with their parameters - so they aren't hardcoded.
    • We can leave this to a future PR if you want.
    • Happy to help with a PR here or provide an example.
    • OR we could just define all of the available options a little more cleanly - this is better if we only expect to have a few regressor + forecaster combos.
  • We could move instantiation of regressors and forecasters into a separate module.
  • Testing

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • This is nicely abstracted now with the get_ocean_term function. The idea, I suppose, is to generalise this "toolbox" to work for other restart files i.e. Oceananigans?
    • This get_ocean_term() function needs a docstring. I understand it gets the name of the generated [so,toce,soce].npy file.
  • STYLE: Think about the naming of the variables in this file. I would avoid thetao, so and zos / ssh. The restart convention of NEMO / DINO is heavily embedded in here. (restart["tn"], restart["sshn"] etc.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Would be nice to have the config file that you're hinting at in main_forecast.py. I.e. abstract out the DINO and ssh/soce/toce mapping. You are already using this in the restart.py - could it also be used here as well?

@ma595
Copy link
Member

ma595 commented May 20, 2025

ruff check gives the following:

Found 111 errors.
[*] 43 fixable with the `--fix` option (16 hidden fixes can be enabled with the `--unsafe-fixes` option).

I can open a PR to start addressing some of these, if that helps?

@ma595
Copy link
Member

ma595 commented May 20, 2025

Have issues #9, #49, #48 and #44 been addressed with this PR @isaacaka? If so, can you update the PR description accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor out dimensionality reduction and forecast technique code in forecast.py
2 participants