Takeaways from LUMI Pilot program

TCLB was part of LUMI (https://lumi-supercomputer.eu/lumis-second-pilot-phase-in-full-swing/) Pilot Program which is now ending.

Apart from performance results, there are some issues that might be worth consideration. LUMI is a brand new CRAY/HPE computer with AMD Instinct MI250X 128GB HBM2e cards.  

1. Performance: is not there yet, while scalability up to 1024 GPUs works nice, per-GPU performance is unsatisfactory. Some profiling data is available, we could get more. This is a subject for another talk/issue
2. rinside: HPE/CRAY decided that R is statically linked which prevents from building rcpp/rinside. HPC centers will resist installing non-HPE versions of R, so user will need to build it themselves. This makes life harder, especially on Cray :/
3. GPU per Core/Process allocation should be done in some specific order. This could be handled by shell script and ROCM_VISIBLE_GPU variable _but_ TCLB requires "gpu_oversubscribe="True" in XML to run. I think that configure flag to disable it would be nice. Imagine you copy XML from other system, and it will fail after 2 days in queue because of that.
4. More generic in-situ processing would be nice, like ADIOS. I could work on that, but let me know if you are interested

As for results, I made 0.8e9 lattice dissolution simulation for AGU :D That is around half of the 12cm experimental core at 30um resolution.

I still have few days left - if you want to check something we could do it.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Takeaways from LUMI Pilot program #421

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Takeaways from LUMI Pilot program #421

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions