Orchestration spec
on veut run 1000 simulations
- e.g., server centralisé + n workers + f byzantins
- en changeant valeurs
- de f
- ou l'aggregateurs
- paramètres de l'aggregateurs
- récupérer les métriques de chacune des simulations
- loss à chaque step/epoch
- gradient norm, curvature
- custom metric research-specific
- etc.
une simulation
- module python, bien custom (completely written by end-user)
- run as os process
- to avoid oom killer to kill all simulations and the orchestrator
- simulation output = collection of metrics
- base class metric tracking instantiations
- serialized in some folder
orchestration plan
- in python
- requested simulations with parameters
- requested metrics
- each metric kind has an associated visualization
orchestration execution
- launches all simulation subprocesses
- happy path (all simulations exit 0)
- collect all metric events
- generate associated visualizations
- sick path (some simulations exit $\ne$ 1)
- best effort for metric collections and visualizations
- on new plan run, only the failed simulations are re-run
orchestration state
- content: execution graph, etc.
- lock to prevent two orchestrators to run simultaneously
- state inferred from simulation subfolders
Orchestration spec
on veut run 1000 simulations
une simulation
orchestration plan
orchestration execution
orchestration state