Skip to content

Commit c2b9ca9

Browse files
Merge pull request #102 from CliMA/ne/derecho
Add PBS controller, DerechoBackend
2 parents 25dab9d + db5cb81 commit c2b9ca9

File tree

16 files changed

+699
-241
lines changed

16 files changed

+699
-241
lines changed

.buildkite/clima_server_test/pipeline.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ steps:
2121

2222
- wait
2323
- label: "SurfaceFluxes perfect model calibration"
24-
command: julia --project=experiments/surface_fluxes_perfect_model test/slurm_backend_e2e.jl
24+
command: julia --project=experiments/surface_fluxes_perfect_model test/hpc_backend_e2e.jl
2525
artifact_paths: output/surface_fluxes_perfect_model/*
2626

2727
- label: "Slurm job controller unit tests"

.buildkite/pipeline.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ steps:
1717

1818
- wait
1919
- label: "SurfaceFluxes perfect model calibration"
20-
command: julia --project=experiments/surface_fluxes_perfect_model test/slurm_backend_e2e.jl
20+
command: julia --project=experiments/surface_fluxes_perfect_model test/hpc_backend_e2e.jl
2121
artifact_paths: output/surface_fluxes_perfect_model/*
2222

2323
- label: "Slurm job controller unit tests"

Project.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
name = "ClimaCalibrate"
22
uuid = "4347a170-ebd6-470c-89d3-5c705c0cacc2"
33
authors = ["Climate Modeling Alliance"]
4-
version = "0.0.1"
4+
version = "0.0.2"
55

66
[deps]
77
Distributions = "31c24e10-a181-5473-b8eb-7969acd0382f"

README.md

Lines changed: 5 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -9,27 +9,21 @@
99
calibration pipelines using with minimal boilerplate.</strong>
1010
</p>
1111

12-
[![docsbuild][docs-bld-img]][docs-bld-url]
1312
[![dev][docs-dev-img]][docs-dev-url]
1413
[![ghaci][gha-ci-img]][gha-ci-url]
15-
[![codecov][codecov-img]][codecov-url]
16-
17-
[docs-bld-img]: https://github.com/CliMA/ClimaCalibrate.jl/workflows/Documentation/badge.svg
18-
[docs-bld-url]: https://github.com/CliMA/ClimaCalibrate.jl/actions?query=workflow%3ADocumentation
1914

2015
[docs-dev-img]: https://img.shields.io/badge/docs-dev-blue.svg
2116
[docs-dev-url]: https://CliMA.github.io/ClimaCalibrate.jl/dev/
2217

2318
[gha-ci-img]: https://github.com/CliMA/ClimaCalibrate.jl/actions/workflows/ci.yml/badge.svg
2419
[gha-ci-url]: https://github.com/CliMA/ClimaCalibrate.jl/actions/workflows/ci.yml
2520

26-
[codecov-img]: https://codecov.io/gh/CliMA/ClimaCalibrate.jl/branch/main/graph/badge.svg
27-
[codecov-url]: https://codecov.io/gh/CliMA/ClimaCalibrate.jl
28-
29-
The recommended Julia version is: Stable release v1.10.0
21+
The recommended Julia version is: Stable release v1.10.4
3022

31-
This pipeline currently runs on the Resnick High Performance Computing Center.
32-
We strive to support flexible and clearly documented calibration experiments.
23+
Currently supported backends:
24+
- [Resnick High Performance Computing Center](https://www.hpc.caltech.edu/)
25+
- [NSF NCAR Supercomputer Derecho](https://ncar-hpc-docs.readthedocs.io/en/latest/compute-systems/derecho/)
26+
- CliMA's private GPU server
3327

3428
## Contributing
3529

docs/make.jl

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,6 @@ makedocs(
2626
"Getting Started" => "quickstart.md",
2727
"ClimaAtmos Setup Guide" => "atmos_setup_guide.md",
2828
"Emulate and Sample" => "emulate_sample.md",
29-
"Precompilation" => "precompilation.md",
3029
"API" => "api.md",
3130
],
3231
)

docs/src/api.md

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,24 @@ ClimaCalibrate.observation_map
1313
```@docs
1414
ClimaCalibrate.get_backend
1515
ClimaCalibrate.calibrate
16-
ClimaCalibrate.sbatch_model_run
16+
ClimaCalibrate.model_run
17+
ClimaCalibrate.module_load_string
18+
```
19+
20+
## Job Scheduler
21+
```@docs
22+
ClimaCalibrate.wait_for_jobs
23+
ClimaCalibrate.log_member_error
24+
ClimaCalibrate.kill_job
25+
ClimaCalibrate.job_status
26+
ClimaCalibrate.kwargs
27+
ClimaCalibrate.slurm_model_run
28+
ClimaCalibrate.generate_sbatch_script
29+
ClimaCalibrate.generate_sbatch_directives
30+
ClimaCalibrate.submit_slurm_job
31+
ClimaCalibrate.pbs_model_run
32+
ClimaCalibrate.generate_pbs_script
33+
ClimaCalibrate.submit_pbs_job
1734
```
1835

1936
## EnsembleKalmanProcesses Interface

docs/src/index.md

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,7 @@
33
ClimaCalibrate.jl is a toolkit for developing scalable and reproducible model
44
calibration pipelines using CalibrateEmulateSample.jl with minimal boilerplate.
55

6-
To use this framework, component models (and the coupler) define their own versions of the functions provided in the interface (`get_config`, `get_forward_model`, and `run_forward_model`).
7-
8-
Calibrations can either be run using pure Julia, the Caltech central cluster, or CliMA's GPU server.
6+
To use this framework, component models (and the coupler) define their own versions of the functions provided in the interface.
7+
Calibrations can either be run using just Julia, the Caltech central cluster, NCAR Derecho, or CliMA's GPU server.
98

109
For more information, see our Getting Started page.

src/ClimaCalibrate.jl

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@ module ClimaCalibrate
33
include("ekp_interface.jl")
44
include("model_interface.jl")
55
include("slurm.jl")
6+
include("pbs.jl")
67
include("backends.jl")
78
include("emulate_sample.jl")
89

src/backends.jl

Lines changed: 98 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,17 @@
1-
export get_backend, calibrate
1+
export get_backend, calibrate, model_run
22

33
abstract type AbstractBackend end
44

55
struct JuliaBackend <: AbstractBackend end
6-
abstract type SlurmBackend <: AbstractBackend end
6+
7+
abstract type HPCBackend <: AbstractBackend end
8+
abstract type SlurmBackend <: HPCBackend end
9+
710
struct CaltechHPCBackend <: SlurmBackend end
811
struct ClimaGPUBackend <: SlurmBackend end
912

13+
struct DerechoBackend <: HPCBackend end
14+
1015
"""
1116
get_backend()
1217
@@ -18,6 +23,8 @@ function get_backend()
1823
(r"^clima.gps.caltech.edu$", ClimaGPUBackend),
1924
(r"^login[1-4].cm.cluster$", CaltechHPCBackend),
2025
(r"^hpc-(\d\d)-(\d\d).cm.cluster$", CaltechHPCBackend),
26+
(r"derecho([1-8])$", DerechoBackend),
27+
(r"dec(\d\d\d\d)$", DerechoBackend), # This should be more specific
2128
]
2229

2330
for (pattern, backend) in HOSTNAMES
@@ -28,12 +35,12 @@ function get_backend()
2835
end
2936

3037
"""
31-
module_load_string(T) where {T<:Type{SlurmBackend}}
38+
module_load_string(backend)
3239
3340
Return a string that loads the correct modules for a given backend when executed via bash.
3441
"""
3542
function module_load_string(::Type{CaltechHPCBackend})
36-
return """export MODULEPATH=/groups/esm/modules:\$MODULEPATH
43+
return """export MODULEPATH="/groups/esm/modules:\$MODULEPATH"
3744
module purge
3845
module load climacommon/2024_05_27"""
3946
end
@@ -43,32 +50,14 @@ function module_load_string(::Type{ClimaGPUBackend})
4350
module load julia/1.10.0 cuda/julia-pref openmpi/4.1.5-mpitrampoline"""
4451
end
4552

46-
"""
47-
calibrate(::Type{JuliaBackend}, config::ExperimentConfig)
48-
calibrate(::Type{JuliaBackend}, experiment_dir::AbstractString)
49-
50-
Run a calibration in Julia.
51-
52-
Takes an ExperimentConfig or an experiment folder.
53-
If no backend is passed, one is chosen via `get_backend`.
54-
This function is intended for use in a larger workflow, assuming that all needed
55-
model interface and observation map functions are set up for the calibration.
56-
57-
# Example
58-
Run: `julia --project=experiments/surface_fluxes_perfect_model`
59-
```julia
60-
import ClimaCalibrate
61-
62-
# Generate observational data and load interface
63-
experiment_dir = dirname(Base.active_project())
64-
include(joinpath(experiment_dir, "generate_data.jl"))
65-
include(joinpath(experiment_dir, "observation_map.jl"))
66-
include(joinpath(experiment_dir, "model_interface.jl"))
53+
function module_load_string(::Type{DerechoBackend})
54+
return """export MODULEPATH="/glade/campaign/univ/ucit0011/ClimaModules-Derecho:\$MODULEPATH"
55+
module purge
56+
module load climacommon
57+
module list
58+
"""
59+
end
6760

68-
# Initialize and run the calibration
69-
eki = ClimaCalibrate.calibrate(experiment_dir)
70-
```
71-
"""
7261
calibrate(config::ExperimentConfig; ekp_kwargs...) =
7362
calibrate(get_backend(), config; ekp_kwargs...)
7463

@@ -86,9 +75,8 @@ function calibrate(
8675
config::ExperimentConfig;
8776
ekp_kwargs...,
8877
)
89-
initialize(config; ekp_kwargs...)
9078
(; n_iterations, ensemble_size) = config
91-
eki = nothing
79+
eki = initialize(config; ekp_kwargs...)
9280
for i in 0:(n_iterations - 1)
9381
@info "Running iteration $i"
9482
for m in 1:ensemble_size
@@ -103,75 +91,80 @@ function calibrate(
10391
end
10492

10593
"""
106-
calibrate(::Type{SlurmBackend}, config::ExperimentConfig; kwargs...)
107-
calibrate(::Type{SlurmBackend}, experiment_dir; kwargs...)
94+
calibrate(::Type{AbstractBackend}, config::ExperimentConfig; kwargs...)
95+
calibrate(::Type{AbstractBackend}, experiment_dir; kwargs...)
10896
10997
Run a full calibration, scheduling the forward model runs on Caltech's HPC cluster.
11098
11199
Takes either an ExperimentConfig or an experiment folder.
112100
101+
Available Backends: CaltechHPCBackend, ClimaGPUBackend, DerechoBackend, JuliaBackend
102+
103+
113104
# Keyword Arguments
114105
- `experiment_dir: Directory containing experiment configurations.
115106
- `model_interface: Path to the model interface file.
116-
- `slurm_kwargs`: Dictionary of slurm arguments, passed through to `sbatch`.
117-
- `verbose::Bool`: Enable verbose output for debugging.
107+
- `hpc_kwargs`: Dictionary of resource arguments, passed to the job scheduler.
108+
- `verbose::Bool`: Enable verbose logging.
118109
119110
# Usage
120111
Open julia: `julia --project=experiments/surface_fluxes_perfect_model`
121112
```julia
122-
import ClimaCalibrate: CaltechHPCBackend, calibrate
113+
using ClimaCalibrate
123114
124-
experiment_dir = dirname(Base.active_project())
115+
experiment_dir = joinpath(pkgdir(ClimaCalibrate), "experiments", "surface_fluxes_perfect_model")
125116
model_interface = joinpath(experiment_dir, "model_interface.jl")
126117
127118
# Generate observational data and load interface
128119
include(joinpath(experiment_dir, "generate_data.jl"))
129120
include(joinpath(experiment_dir, "observation_map.jl"))
130121
include(model_interface)
131122
132-
slurm_kwargs = kwargs(time = 3)
133-
eki = calibrate(CaltechHPCBackend, experiment_dir; model_interface, slurm_kwargs);
123+
hpc_kwargs = kwargs(time = 3)
124+
backend = get_backend()
125+
eki = calibrate(backend, experiment_dir; model_interface, hpc_kwargs);
134126
```
135127
"""
136128
function calibrate(
137-
b::Type{<:SlurmBackend},
129+
b::Type{<:HPCBackend},
138130
experiment_dir::AbstractString;
139-
slurm_kwargs,
131+
hpc_kwargs,
140132
ekp_kwargs...,
141133
)
142-
calibrate(b, ExperimentConfig(experiment_dir); slurm_kwargs, ekp_kwargs...)
134+
calibrate(b, ExperimentConfig(experiment_dir); hpc_kwargs, ekp_kwargs...)
143135
end
144136

145137
function calibrate(
146-
b::Type{<:SlurmBackend},
138+
b::Type{<:HPCBackend},
147139
config::ExperimentConfig;
148140
experiment_dir = dirname(Base.active_project()),
149141
model_interface = abspath(
150142
joinpath(experiment_dir, "..", "..", "model_interface.jl"),
151143
),
152144
verbose = false,
153-
slurm_kwargs = Dict(:time_limit => 45, :ntasks => 1),
145+
reruns = 1,
146+
hpc_kwargs,
154147
ekp_kwargs...,
155148
)
156149
# ExperimentConfig is created from a YAML file within the experiment_dir
157150
(; n_iterations, output_dir, ensemble_size) = config
158151
@info "Initializing calibration" n_iterations ensemble_size output_dir
159-
initialize(config; ekp_kwargs...)
160152

161-
eki = nothing
153+
eki = initialize(config; ekp_kwargs...)
162154
module_load_str = module_load_string(b)
163155
for iter in 0:(n_iterations - 1)
164156
@info "Iteration $iter"
165157
jobids = map(1:ensemble_size) do member
166158
@info "Running ensemble member $member"
167-
sbatch_model_run(
159+
model_run(
160+
b,
168161
iter,
169162
member,
170163
output_dir,
171164
experiment_dir,
172165
model_interface,
173166
module_load_str;
174-
slurm_kwargs,
167+
hpc_kwargs,
175168
)
176169
end
177170

@@ -182,14 +175,69 @@ function calibrate(
182175
experiment_dir,
183176
model_interface,
184177
module_load_str;
185-
slurm_kwargs,
178+
hpc_kwargs,
186179
verbose,
180+
reruns,
187181
)
188-
report_iteration_status(statuses, output_dir, iter)
189182
@info "Completed iteration $iter, updating ensemble"
190183
G_ensemble = observation_map(iter)
191184
save_G_ensemble(config, iter, G_ensemble)
192185
eki = update_ensemble(config, iter)
193186
end
194187
return eki
195188
end
189+
190+
# Dispatch on backend type to unify `calibrate` for all HPCBackends
191+
# Scheduler interfaces should not depend on backend struct
192+
"""
193+
model_run(backend, iter, member, output_dir, experiment_dir; model_interface, verbose, hpc_kwargs)
194+
195+
Construct and execute a command to run a single forward model on a given job scheduler.
196+
197+
Dispatches on `backend` to run [`slurm_model_run`](@ref) or [`pbs_model_run`](@ref).
198+
199+
Arguments:
200+
- iter: Iteration number
201+
- member: Member number
202+
- output_dir: Calibration experiment output directory
203+
- experiment_dir: Directory containing the experiment's Project.toml
204+
- model_interface: File containing the model interface
205+
- module_load_str: Commands which load the necessary modules
206+
- hpc_kwargs: Dictionary containing the resources for the job. Easily generated using [`kwargs`](@ref).
207+
"""
208+
model_run(
209+
b::Type{<:SlurmBackend},
210+
iter,
211+
member,
212+
output_dir,
213+
experiment_dir,
214+
model_interface,
215+
module_load_str;
216+
hpc_kwargs,
217+
) = slurm_model_run(
218+
iter,
219+
member,
220+
output_dir,
221+
experiment_dir,
222+
model_interface,
223+
module_load_str;
224+
hpc_kwargs,
225+
)
226+
model_run(
227+
b::Type{DerechoBackend},
228+
iter,
229+
member,
230+
output_dir,
231+
experiment_dir,
232+
model_interface,
233+
module_load_str;
234+
hpc_kwargs,
235+
) = pbs_model_run(
236+
iter,
237+
member,
238+
output_dir,
239+
experiment_dir,
240+
model_interface,
241+
module_load_str;
242+
hpc_kwargs,
243+
)

src/ekp_interface.jl

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -171,10 +171,10 @@ function env_model_interface(env = ENV)
171171
return string(env[key])
172172
end
173173

174-
function env_iter_number(env = ENV)
175-
key = "CALIBRATION_ITER_NUMBER"
174+
function env_iteration(env = ENV)
175+
key = "CALIBRATION_ITERATION"
176176
haskey(env, key) || error(
177-
"Iteration number not found in environment. Ensure that env variable \"CALIBRATION_ITER_NUMBER\" is set.",
177+
"Iteration number not found in environment. Ensure that env variable \"CALIBRATION_ITERATION\" is set.",
178178
)
179179
return parse(Int, env[key])
180180
end

0 commit comments

Comments
 (0)