Running MESSAGE_ix in HPCs #967

criscfer · 2025-07-23T06:05:54Z

criscfer
Jul 23, 2025

What is this about?

My team at the University of Victoria is looking to execute a series of many runs of our version of the MESSAGEix model, MESSAGEix for Canada, to perform a sensitivity analysis on our inputs to the model. We have been finding it quite difficult, however, to perform many runs in parallel, or even in series, using some HPC and cloud computing resources we have access to.

The bottlenecks we have experienced so far have mostly been due to issues around how GAMS interacts with the model. Some examples of failure cases are:

Failing to execute model runs in a series of slurm jobs (the first few runs would work as expected, but the chain of jobs would start to fail after a certain number of model runs was reached, for exampple after the fourth model run is done executing);
Failing to execute model runs in parallel in a controlled cloud virtual machine with GAMS engine installed, because GAMS engine does not seem to work with python-based models, but can only execute .gms model files

Has anyone else in the wider community ever tried to run MESSAGEix on HPC clusters, or cloud environments and experienced similar issues? Is there a correct way to carry-out model executions on distributed computing resources?

I have browsed through the documentation, but have not found a clear method for running the model in these remote environments. The only pages I could find related to this topic coming out of the main documentation are listed below.

Any help is greatly appreciated.

glatterf42 · 2025-07-23T06:41:47Z

glatterf42
Jul 23, 2025
Maintainer

Hi @criscfer, thanks for reaching out :)
For completeness, there's also this guide on how to run message_ix on the IIASA-internal HPC cluster in our docs (the cluster is also using slurm). However, this explicitly mentions that Scenarios were not run in parallel.
I'm not aware of anyone in our team running Scenarios in parallel, or using cloud environments, for that matter, but I can ask in tomorrow's team meeting. We certainly don't test it in our CI systems, so there's no guarantee it works, but generally speaking, I would expect it to work.
The same holds true for running many Scenarios in series: I don't see why this shouldn't work. Could you please elaborate on your errors, i.e. do you have an actual traceback for us to study? Or the slurm job script you were trying to use?

1 reply

khaeru Jul 23, 2025
Maintainer

We also have the message_ix_models.util.slurm utilities, documented here and code visible here.

Apologies that these are not well-linked from the above HOWTO and other pages.

glatterf42 · 2025-07-23T06:43:44Z

glatterf42
Jul 23, 2025
Maintainer

(Just moving this to a discussion because that seems more appropriate; if we discover an underlying issue with message_ix, we can still create a dedicated issue for that :)

1 reply

khaeru Jul 23, 2025
Maintainer

@criscfer I agree here with Fridolin that this is more a 'discussion', and one that branches out into potentially multiple threads. Just to hint at what those are:

Has anyone else in the wider community ever tried to run MESSAGEix on HPC clusters, or cloud environments and experienced similar issues?

I can't speak for the 'wider community', but at least in the IIASA/ECE team we have a number of people using our internal Slurm-based cluster, 'UniCC', on a regular basis. (Because this cluster is only available within IIASA, its documentation is not public, though likely could be shared privately.) These include @yiyi1991 @macflo8 @Wegatriespython @junukitashepard and others.

It may help that we arrange a short show-and-tell where one of those colleagues shows how they use the particular UniCC system for particular workflows, and then you explain about your system (also see below) and the way you are using it. This could reveal some techniques for you to try, or sharpen changes needed to enable workflows that would work on your system.

Is there a correct way to carry-out model executions on distributed computing resources?

There is never one correct way, but rather ways that are (a) functional and (b) efficient on given systems (hardware + software). Often the appropriate usage depends very heavily on the given system; for example, a system running HTCondor should be used very differently from one with Slurm, and two systems running either HTCondor or Slurm may be configured very differently.

This is one reason that the docs are structured as collections of tools and (single-system-scoped) HOWTOs, instead of claiming to provide a one-size-fits-all guide. I think it would be misleading for us to claim there could be one.

some HPC and cloud computing resources we have access to

Some examples of failure cases are:

Failing to execute model runs in a series of slurm jobs (the first few runs would work as expected, but the chain of jobs would start to fail after a certain number of model runs was reached, for exampple after the fourth model run is done executing);

Failing to execute model runs in parallel in a controlled cloud virtual machine with GAMS engine installed, because GAMS engine does not seem to work with python-based models, but can only execute .gms model files

Just like debugging issues running code on a local laptop or desktop, it helps to have a minimum reproducible example and a precise, complete description of the system the code is running on and the error message(s) or symptoms seen.

I would encourage to include that system description in a fresh discussion thread, and to open 1 such discussion thread for each distinct issue (you can cross-link the common info between threads).

criscfer · 2025-08-12T20:01:02Z

criscfer
Aug 12, 2025
Author

Hi all. I have been trying to recreate the issues I initially highlighted, however, it now seems that the cluster where we run our SLURM jobs is working as I originally expected, and is executing our model runs.

This lead me to have other questions which you may be able to address, concerning how the MESSAGEix framework behaves in SLURM job environments.

I have noticed that my test scenario runs as I executed when executed locally, solving to optimality and creating our output files. When executing the same scenario inside a SLURM job, however, the model solves, but is considered infeasible, therefore not producing our own model report outputs. This is a very strange and rare behavior, which I have seen before in local runs of the model

Have ever seen this behavior before from remote jobs? Do you have any insight that can help me address this new unexpected behavior?

Additionally, do you have any tips to run MESSAGEix in a job array, that is not already mentioned in the resources mentioned in previous messages to this thread?

Specs and Additional Information

The table below contains the specs of the jobs I have been running, and how I run the model locally.

Parameter	Remote Cluster (Compute Canada)	Local Computer
Solver	Coin-OR CBC	Coin-OR CBC
Solution Time	~ 5 minutes	~ 30 minutes
CPU count	8	8
RAM	128 GB	32 GB
Total run time	~ 12 minutes	~ 40 minutes

See below for the CBC solver report from the cloud run of the model.

--- MESSAGE_run.gms(4570) 140 Mb
    +++ Solve the perfect-foresight version of MESSAGEix +++
--- Generating LP model MESSAGE_LP
--- MESSAGE_run.gms(4588) 388 Mb
---   384,808 rows  338,637 columns  1,795,274 non-zeroes
--- Range statistics (absolute non-zero finite values)
--- RHS       [min, max] : [ 8.550E-07, 2.283E+16] - Zero values observed as well
--- Bound     [min, max] : [        NA,        NA] - Zero values observed as well
--- Matrix    [min, max] : [ 6.579E-06, 1.214E+06]
--- Executing CBC (Solvelink=2): elapsed 0:00:19.271

COIN-OR CBC      48.6.1 67fbb04b Jan 23, 2025          LEG x86 64bit/Linux

COIN-OR Branch and Cut (CBC Library 2.10.11)
written by J. Forrest
Space for names approximately 42 MB.
Use statement '<modelname>.dictfile=0;' to turn dictionary off.
*** Error Cannot open parameter file "/localscratch/criscfer.65761409.0/env/lib/python3.12/site-packages/message_ix/model/cbc.opt"
*** Error Error code = 2; No such file or directory

Parallel mode: none, using 48 threads in linear algebra

Calling CBC main solution routine...
Analysis indicates model infeasible or unbounded
1 infeasibilities
Analysis indicates model infeasible or unbounded
Perturbing problem by 0.001% of 128.10319 - largest nonzero change 0 ( 0%) - largest zero change 9.9997853e-05

(...)

Primal infeasible - objective value 85343.768

(...)

As requested, I am also attaching the batch file I used to run the jobs on the cloud here:

#!/bin/bash
#SBATCH --account=user_group_account
#SBATCH --cpus-per-task=8
#SBATCH --mem=125G
#SBATCH --time=2:00:00
#SBATCH [email protected]
#SBATCH --mail-type=ALL
#SBATCH --array=1
#SBATCH --output=slurm-YOUR_JOB_NAME.out


echo "Current working directory: $PWD"
echo "Starting run at: `date`"

# Load Compute Canada modules
module load StdEnv/2023 gcc/12.3 python/3.12 arrow/18.1.0 postgresql/16.0 rust java/17.0.6

# Create virtual environment
virtualenv --no-download $SLURM_TMPDIR/env
source $SLURM_TMPDIR/env/bin/activate

# Install packages in the environment
pip install --no-index --upgrade pip
pip install -e .

# Change the solver for MESSAGE from CPLEX to CBC
sed -i 's/option LP = CPLEX ;/option LP = CBC ;/' $SLURM_TMPDIR/env/lib/python3.12/site-packages/message_ix/model/MESSAGE/auxiliary_settings.gms

# Export the path to the GAMS installation - SESIT team's project folder
export PATH=$PATH:/project/rrg-mcpher16/SHARED/software/gams/gams48.6_linux_x64_64_sfx/

# Delete bloat from the folder
rm -rf docs/ Documentation/ tests/
rm -rf *.yml LICENSE *.pdf


# Begin running script
srun python MESSAGE_CA.py

The setup.py files are also slightly different between the remote and the local versions. Here's the a copy of setup.py for the remote environment, with comments on how the lines differ for the local version of the file.

from setuptools import setup, find_packages
setup(
    name='messageix_canada',
    version='0.1.0',
    description='An Open Source Integrated Assessment Model for Canada',
    author='Your Name',
    author_email='[email protected]',
    packages=find_packages(include=['MESSAGE_CA', 'MESSAGE_CA.*']),  # Adjust if needed
    install_requires=[
        'setuptools >= 64',
        'pandas',
        'numpy<2.0.0', # local version runs most recent version of numpy
        'plotly',
        'pycountry',
        'ixmp',
        'message_ix >= 3.4.0',
        'pyam-iamc >= 0.6',
        'scipy',
        'toml',
        'ipykernel',
        'pytest',
        'click',
        'message-ix-models[report]',
        'typing_extensions', # Dependency of IXMP. This line is not included in the local copy of setup.py
    ],
    entry_points={
        'console_scripts': [
            'message-canada=MESSAGE_CA:messagerun',  
        ],
    },
    extras_require={
        'extras': ['geotext', 'message-ix-models'],
    },
)

3 replies

Wegatriespython Aug 12, 2025
Maintainer

Hey @criscfer,

Thanks for the detailed log and the file. Just dropping my thoughts here. I think this is almost certainly a solver issue wrt the infeasability (we can eliminate python scrutiny), you can additionally verify this (if you haven't already) by comparing the input GDX using either gams studio gdx under the tools section for diff, else via the cli you can directly gdxdiff file1 file2 {diffile} {options}

On the solver itself the error says the params file is not found, perhaps its there on the local machine, solver options can lead different outcomes w.r.t feasability. Additionally as a recommendation HiGHS is a better option than CBC and its open source too so you could consider swapping to it if CPLEX is not an option. CBC doesn't support interior-point methods, which for message-ix-models we default to, to solve most models due to difficulties in scaling, etc.

criscfer Aug 12, 2025
Author

Hi @Wegatriespython thanks for your insights!

I suspected it could be an issue with the solver, but since I wasn't able to reproduce it locally I thought it could be something else. I will perform a few tests locally and try to reproduce them in the cloud using the HiGHS solver

khaeru Aug 13, 2025
Maintainer

Agreed with @Wegatriespython here—I would look first at the hardware, and then exact solver versions, configuration, log output, etc.

The same solver (CPLEX, CBC, HiGHS) may choose different parameters for itself, based on the environment it finds itself in. For instance, CPLEX can choose to use different numbers of threads based on the number of cores it sees on the system. It would be nice if solver outcomes were robust to these differences, but they aren't necessarily so. CPU architecture can also change solver behaviour.

With a complete side-by-side of these details on your local vs. HPC system, it would be possible to narrow down any such possible causes.

Running MESSAGE_ix in HPCs #967

Uh oh!

criscfer Jul 23, 2025

What is this about?

Replies: 3 comments · 5 replies

Uh oh!

glatterf42 Jul 23, 2025 Maintainer

Uh oh!

khaeru Jul 23, 2025 Maintainer

Uh oh!

glatterf42 Jul 23, 2025 Maintainer

Uh oh!

khaeru Jul 23, 2025 Maintainer

Uh oh!

criscfer Aug 12, 2025 Author

Specs and Additional Information

Uh oh!

Uh oh!

Wegatriespython Aug 12, 2025 Maintainer

Uh oh!

criscfer Aug 12, 2025 Author

Uh oh!

Uh oh!

khaeru Aug 13, 2025 Maintainer

criscfer
Jul 23, 2025

Replies: 3 comments 5 replies

glatterf42
Jul 23, 2025
Maintainer

khaeru Jul 23, 2025
Maintainer

glatterf42
Jul 23, 2025
Maintainer

khaeru Jul 23, 2025
Maintainer

criscfer
Aug 12, 2025
Author

Wegatriespython Aug 12, 2025
Maintainer

criscfer Aug 12, 2025
Author

khaeru Aug 13, 2025
Maintainer