Vivarium simulation model for the vivarium_profiling project.
Contents
You will need conda to install all of this repository's requirements.
We recommend installing Miniforge.
Once you have conda installed, you should open up your normal shell
(if you're on linux or OSX) or the git bash shell if you're on windows.
You'll then make an environment, clone this repository, then install
all necessary requirements as follows:
:~$ conda create --name=vivarium_profiling python=3.11 git git-lfs ...conda will download python and base dependencies... :~$ conda activate vivarium_profiling (vivarium_profiling) :~$ git clone https://github.com/ihmeuw/vivarium_profiling.git ...git will copy the repository from github and place it in your current directory... (vivarium_profiling) :~$ cd vivarium_profiling (vivarium_profiling) :~$ pip install -e . ...pip will install vivarium and other requirements...
Supported Python versions: 3.10, 3.11
Note the -e flag that follows pip install. This will install the python
package in-place, which is important for making the model specifications later.
To install requirements from a provided requirements.txt (e.g. installing an archived repository with the exact same requirements it was run with), replace pip install -e . with the following:
(vivarium_profiling) :~$ pip install -r requirements.txt
Cloning the repository should take a fair bit of time as git must fetch
the data artifact associated with the demo (several GB of data) from the
large file system storage (git-lfs). If your clone works quickly,
you are likely only retrieving the checksum file that github holds onto,
and your simulations will fail. If you are only retrieving checksum
files you can explicitly pull the data by executing git-lfs pull.
Vivarium uses the Hierarchical Data Format (HDF) as the backing storage
for the data artifacts that supply data to the simulation. You may not have
the needed libraries on your system to interact with these files, and this is
not something that can be specified and installed with the rest of the package's
dependencies via pip. If you encounter HDF5-related errors, you should
install hdf tooling from within your environment like so:
(vivarium_profiling) :~$ conda install hdf5
The (vivarium_profiling) that precedes your shell prompt will probably show
up by default, though it may not. It's just a visual reminder that you
are installing and running things in an isolated programming environment
so it doesn't conflict with other source code and libraries on your
system.
You'll find six directories inside the main
src/vivarium_profiling package directory:
artifactsThis directory contains all input data used to run the simulations. You can open these files and examine the input data using the vivarium artifact tools. A tutorial can be found at https://vivarium.readthedocs.io/en/latest/tutorials/artifact.html#reading-data
componentsThis directory is for Python modules containing custom components for the vivarium_profiling project. You should work with the engineering staff to help scope out what you need and get them built.
dataIf you have small scale external data for use in your sim or in your results processing, it can live here. This is almost certainly not the right place for data, so make sure there's not a better place to put it first.
model_specificationsThis directory should hold all model specifications and branch files associated with the project.
results_processingAny post-processing and analysis code or notebooks you write should be stored in this directory.
toolsThis directory hold Python files used to run scripts used to prepare input data or process outputs.
This repository provides tools for profiling and benchmarking Vivarium simulations to analyze their performance characteristics. See the tutorials at https://vivarium.readthedocs.io/en/latest/tutorials/running_a_simulation/index.html and https://vivarium.readthedocs.io/en/latest/tutorials/exploration.html for general instructions on running simulations with Vivarium.
This repository includes a custom MultiComponentParser plugin that allows you to
easily create scaling simulations by defining multiple instances of diseases and risks
using a simplified YAML syntax.
To use the parser, add it to your model specification:
plugins:
required:
component_configuration_parser:
controller: "vivarium_profiling.plugins.parser.MultiComponentParser"
Then use the causes and risks multi-config blocks:
Causes Configuration
Define multiple disease instances with automatic numbering:
components:
causes:
lower_respiratory_infections:
number: 4 # Creates 4 disease instances
duration: 28 # Disease duration in days
observers: True # Auto-create DiseaseObserver components
This creates components named lower_respiratory_infections_1,
lower_respiratory_infections_2, etc., each with its own observer if enabled.
Risks Configuration
Define multiple risk instances and their effects on causes:
components:
risks:
high_systolic_blood_pressure:
number: 2
observers: False # Set False for continuous risks
affected_causes:
lower_respiratory_infections:
effect_type: nonloglinear
measure: incidence_rate
number: 2 # Affects first 2 LRI instances
unsafe_water_source:
number: 2
observers: True # Set True for categorical risks
affected_causes:
lower_respiratory_infections:
effect_type: loglinear
number: 2
See model_specifications/model_spec_scaling.yaml for a complete working example
of a scaling simulation configuration.
The profile_sim command profiles runtime and memory usage for a single simulation of a vivarium model
given a model specification file. The underlying simulation model can
be any vivarium-based model, including the aforementioned scaling simulations as well as models in a
separate repository. This will generate, in addition to the standard simulation outputs, profiling data
depending on the profiler backend provided. By default, runtime profiling is performed with cProfile, but
you can also use scalene for more detailed call stack analysis.
The run_benchmark command runs multiple iterations of one or more model specification, in order to compare
the results. It requires at least one baseline model for comparison,
and any other number of 'experiment' models to benchmark against the baseline, which can be passed via glob patterns.
You can separately configure the sample size of runs for the baseline and experiment models. The command aggregates
the profiling results and generates summary statistics and visualizations for a default set of important function calls
to help identify performance bottlenecks.
The command creates a timestamped directory containing:
benchmark_results.csv: Raw profiling data for each runsummary.csv: Aggregated statistics (automatically generated)performance_analysis.png: Performance charts (automatically generated)- Additional analysis plots for runtime phases and bottlenecks
The summarize command processes benchmark results and creates visualizations.
This runs automatically after run_benchmark, but can also be run manually
for custom analysis after the fact.
By default, this creates the following files in the specified output directory:
summary.csv: Aggregated statistics with mean, median, std, min, max for all metrics, plus percent differences from baselineperformance_analysis.png: Runtime and memory usage comparison chartsruntime_analysis_*.png: Individual phase runtime charts (setup, run, etc.)bottleneck_fraction_*.png: Bottleneck fraction scaling analysis
You can also generate an interactive Jupyter notebook including the same default plots
and summary dataframe with a --nb flag, in which case the command also creates an
analysis.ipynb file in the output directory.
By default, the benchmarking tools extract standard profiling metrics:
- Simulation phases: setup, initialize_simulants, run, finalize, report
- Common bottlenecks: gather_results, pipeline calls, population views
- Memory usage and total runtime
You can customize which metrics to extract by creating an extraction config YAML file.
See extraction_config_example.yaml for a complete annotated example.
Basic Pattern Structure:
patterns:
- name: my_function # Logical name for the metric
filename: my_module.py # Source file containing the function
function_name: my_function # Function name to match
extract_cumtime: true # Extract cumulative time (default: true)
extract_percall: false # Extract time per call (default: false)
extract_ncalls: false # Extract number of calls (default: false)
In turn, this yaml can be passed to the run_benchmark and summarize commands
using the --extraction_config flag. summarize will automatically create runtime
analysis plots for the specified functions.