Skip to content

Takeaways from LUMI Pilot program #421

Open
@mdzik

Description

@mdzik

TCLB was part of LUMI (https://lumi-supercomputer.eu/lumis-second-pilot-phase-in-full-swing/) Pilot Program which is now ending.

Apart from performance results, there are some issues that might be worth consideration. LUMI is a brand new CRAY/HPE computer with AMD Instinct MI250X 128GB HBM2e cards.

  1. Performance: is not there yet, while scalability up to 1024 GPUs works nice, per-GPU performance is unsatisfactory. Some profiling data is available, we could get more. This is a subject for another talk/issue
  2. rinside: HPE/CRAY decided that R is statically linked which prevents from building rcpp/rinside. HPC centers will resist installing non-HPE versions of R, so user will need to build it themselves. This makes life harder, especially on Cray :/
  3. GPU per Core/Process allocation should be done in some specific order. This could be handled by shell script and ROCM_VISIBLE_GPU variable but TCLB requires "gpu_oversubscribe="True" in XML to run. I think that configure flag to disable it would be nice. Imagine you copy XML from other system, and it will fail after 2 days in queue because of that.
  4. More generic in-situ processing would be nice, like ADIOS. I could work on that, but let me know if you are interested

As for results, I made 0.8e9 lattice dissolution simulation for AGU :D That is around half of the 12cm experimental core at 30um resolution.

I still have few days left - if you want to check something we could do it.

Metadata

Metadata

Labels

internalTeam discussions not related to users.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions