Skip to content

Running multiple sites at the same time on a GPU #1723

Description

@kmdeck

Problem Statement
We would like to run multiple sites (e.g., 200 random locations on Earth, not nearby each other or in any order) at the same time, for O(20) years. To force these simulations, we would like to have option to (1) use site level data or (2) lookup the nearest ERA5 forcing data. We would also like the option to calibrate these models using a single ensemble Kalman process (i.e., fit globally constant parameters using site level data).

The SYPD on a CPU for a single column run is around 200-700 SYPD depending the timestep. Each simulation takes therefore takes: 20 simulated years/ 200 SYPD - 20/700 =2.5 - 0.7 hours to run (not including setup). Let's use an hour as a guideline to run one site on CPU for 20 years. Therefore, 200 sites takes ~200 hours to simulate for 20 years.

What is the best way to parallelize this?

N.B. If someone wants to calibrate site specific parameters (EKP for each site individually), they will need to parallelize in an embarrassingly parallel way using the single site calibration script we already have (submit lots of jobs to slurm). No software solution is needed (although higher CPU SYPD would be useful!)

Possible solutions

  • Request N_ensemble x ? CPUs, where each CPU runs multiple sites for 20 years each. If each CPU does ~10 hours of work, it can simulate 10 sites. Therefore we would need 20 CPUs per ensemble member. This could be prohibitive: if N_ensemble is O(10), this is 200 CPUs for 10 hours for 1 iteration. Check this math! This makes this path feasible only if our SYPD is increased.
  • Run all the sites on a single GPU. If the SYPD is unchanged, we would need 1 hour on the GPU to run them all. Then we only need N_ensemble GPUs for 1 hour to complete 1 iteration.

Challenges with running on GPU currently/plan to make possible)
(@ph-kev to fill out plan here)
The sites would not be near each other in space (e.g. pick 100 random locations on Earth).
Can we set up a lat long space corresponding to a list of locations [(lat1, lon1), (lat2, lon2), ....], and can we read in the forcing data in a similar way (a matrix of location vs time)? how would we change the writing of diagnostics?

Running multiple sites on GPU would involve

  1. Creating a new space in ClimaCore that would support multiple columns with varying z
  2. Supporting this new space in ClimaDiagnostics to output diagnostics on this space
  3. Support this new space in ClimaUtilities for interpolation from forcing data to ClimaCore fields
  4. Updating ClimaLand to use all the new functionality

Questions for Land Team
(@EvaMarieMetz @braghiere @AlexisRenchon to provide input here)

  1. How is the forcing data stored?
    The forcing data can be modified to fit whatever format that ClimaUtilities would expect.

  2. How should the output be written? (per site or all together? CSV or NetCDF?) Is one format more helpful for processing the data for calibration?It doesn't matter. Currently, the DictWriter is being used since it is the fastest of the writer and the data is being written to a CSV file.

  3. How many sites will be used?
    O(100) sites

  4. How long is the simulation ran for per iteration of EKP? How many iterations of EKP?
    This depends on the specific configuration. For a full calibration, some runs are 1 hour (a single year of data for each iteration) and other runs are 9 hours (from using the full dataset for each iteration).

Additional Consideration
Is there something we are doing that makes the code very slow on CPU that would be easy to fix?
Is it possible that the way we save half hourly diagnostics in memory make this very slow, and writing to disk less frequently would be much faster?
Is running O(100) sites on a GPU much faster than running every location on land (~20,000) if we are not saturating the GPU in either case? If so, this could also open up the option for us to calibrate the model using a subset of columns more quickly.

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions