|
1 | 1 | Best practices
|
2 | 2 | **************
|
3 | 3 |
|
| 4 | +.. contents:: |
| 5 | + :depth: 3 |
| 6 | + :local: |
| 7 | + |
4 | 8 | Other best practices information
|
5 | 9 | ================================
|
6 | 10 |
|
@@ -303,5 +307,149 @@ building, and then run using a separate container invoked from a different
|
303 | 307 | terminal.
|
304 | 308 |
|
305 | 309 |
|
306 |
| -.. LocalWords: userguide Gruening Souppaya Morello Scarfone openmpi nist |
307 |
| -.. LocalWords: ident OCFS MAGICK |
| 310 | +MPI |
| 311 | +=== |
| 312 | + |
| 313 | +Problems that best practices help you avoid |
| 314 | +------------------------------------------- |
| 315 | + |
| 316 | +These recommendations are derived from our experience in mitigating container |
| 317 | +MPI issues. It is important to note that, despite marketing claims, no single |
| 318 | +container implementation has “solved” MPI or is free of warts; the issues are |
| 319 | +numerous, multifaceted, and dynamic. |
| 320 | + |
| 321 | +Key concepts and related issues include: |
| 322 | + |
| 323 | + 1. **Workload management**. Running applications on HPC clusters requires |
| 324 | + resource management and job scheduling. Put simply, resource management |
| 325 | + is the act of allocating and restricting compute resources, e.g., CPU and |
| 326 | + memory, whereas job scheduling is the act of prioritizing and enforcing |
| 327 | + resource management. *Both require privileged operations.* |
| 328 | + |
| 329 | + Some privileged container implementations attempt to provide their own |
| 330 | + workload management, often referred to as “container orchestration”. |
| 331 | + |
| 332 | + Charliecloud is lightweight and completely unprivileged. We rely on |
| 333 | + existing, reputable and well established HPC workload managers such as |
| 334 | + Slurm. |
| 335 | + |
| 336 | + 2. **Job launch**. When a multi-node MPI job is launched, each node must |
| 337 | + launch a number of containerized processes, i.e., *ranks*. Doing this |
| 338 | + unprivileged and at scale requires interaction between the application |
| 339 | + and workload manager. That is, something like Process Management |
| 340 | + Interface (PMI) is needed to facilitate the job launch. |
| 341 | + |
| 342 | + 3. **Shared memory**. Processes in separate sibling containers cannot use |
| 343 | + single-copy *cross-memory attach* (CMA), as opposed to double-copy POSIX |
| 344 | + or SysV shared memory. The solution is to put all ranks in the *same* |
| 345 | + container with :code:`ch-run --join`. (See above for details: |
| 346 | + :ref:`faq_join`.) |
| 347 | + |
| 348 | + 4. **Network fabric.** Performant MPI jobs must recognize and use a system’s |
| 349 | + high-speed interconnect. Common issues that arise are: |
| 350 | + |
| 351 | + a. Libraries required to use the interconnect are proprietary or |
| 352 | + otherwise unavailable to the container. |
| 353 | + |
| 354 | + b. The interconnect is not supported by the container MPI. |
| 355 | + |
| 356 | + In both cases, the containerized MPI application will either fail or run |
| 357 | + significantly slower. |
| 358 | + |
| 359 | +These problems can be avoided, and this section describes our recommendations |
| 360 | +to do so. |
| 361 | + |
| 362 | +Recommendations TL;DR |
| 363 | +--------------------- |
| 364 | + |
| 365 | +Generally, we recommend building a flexible MPI container using: |
| 366 | + |
| 367 | + a. **libfabric** to flexibly manage process communication over a diverse |
| 368 | + set of network fabrics; |
| 369 | + |
| 370 | + b. a parallel **process management interface** (PMI), compatible with the |
| 371 | + host workload manager (e.g., PMI2, PMIx, flux-pmi); and |
| 372 | + |
| 373 | + c. an **MPI** that supports (1) libfabric and (2) the selected PMI. |
| 374 | + |
| 375 | +More experienced MPI and unprivileged container users can find success through |
| 376 | +MPI replacement (injection); however, such practices are beyond the scope of |
| 377 | +this FAQ. |
| 378 | + |
| 379 | +The remaining sections detail the reasoning behind our approach. We recommend |
| 380 | +referencing, or directly using, our examples |
| 381 | +:code:`examples/Dockerfile.{libfabric,mpich,openmpi}`. |
| 382 | + |
| 383 | +Use libfabric |
| 384 | +------------- |
| 385 | + |
| 386 | +`libfabric <https://ofiwg.github.io/libfabric>`_ (a.k.a. Open Fabrics |
| 387 | +Interfaces or OFI) is a low-level communication library that abstracts diverse |
| 388 | +networking technologies. It defines *providers* that implement the mapping |
| 389 | +between application-facing software (e.g., MPI) and network specific drivers, |
| 390 | +protocols, and hardware. These providers have been co-designed with fabric |
| 391 | +hardware and application developers with a focus on HPC needs. libfabric lets |
| 392 | +us more easily manage MPI communication over diverse network high-speed |
| 393 | +interconnects (a.k.a. *fabrics*). |
| 394 | + |
| 395 | +From our libfabric example (:code:`examples/Dockerfile.libfabric`): |
| 396 | + |
| 397 | +.. literalinclude:: ../examples/Dockerfile.libfabric |
| 398 | + :language: docker |
| 399 | + :lines: 116-135 |
| 400 | + |
| 401 | +The above compiles libfabric with several “built-in” providers, i.e. |
| 402 | +:code:`psm3` (on x86-64), :code:`rxm`, :code:`shm`, :code:`tcp`, and |
| 403 | +:code:`verbs`, which enables MPI applications to run efficiently over most |
| 404 | +verb devices using TCP, IB, OPA, and RoCE protocols. |
| 405 | + |
| 406 | +Two key advantages of using libfabric are: (1) the container’s libfabric can |
| 407 | +make use of “external” i.e. dynamic-shared-object (DSO) providers, and |
| 408 | +(2) libfabric replacement is simpler than MPI replacement and preserves the |
| 409 | +original container MPI. That is, managing host/container ABI compatibility is |
| 410 | +difficult and error-prone, so we instead manage the more forgiving libfabric |
| 411 | +ABI compatibility. |
| 412 | + |
| 413 | +A DSO provider can be used by a libfabric that did not originally compile it, |
| 414 | +i.e., they can be compiled on a target host and later injected into the |
| 415 | +container along with any missing shared library dependencies, and used by the |
| 416 | +container's libfabric. To build a libfabric provider as a DSO, add :code:`=dl` |
| 417 | +to its :code:`configure` argument, e.g., :code:`--with-cxi=dl`. |
| 418 | + |
| 419 | +A container's libfabric can also be replaced by a host libfabric. This is a |
| 420 | +brittle but usually effective way to give containers access to the Cray |
| 421 | +libfabric Slingshot provider :code:`cxi`. |
| 422 | + |
| 423 | +In Charliecloud, both of these injection operations are currently done with |
| 424 | +:code:`ch-fromhost`, though see `issue #1861 |
| 425 | +<https://github.com/hpc/charliecloud/issues/1861>`_. |
| 426 | + |
| 427 | +Choose a compatible PMI |
| 428 | +----------------------- |
| 429 | + |
| 430 | +Unprivileged processes, including unprivileged containerized processes, are |
| 431 | +unable to independently launch containerized processes on different nodes, |
| 432 | +aside from using SSH, which isn’t scalable. We must either (1) rely on a host |
| 433 | +supported parallel process management interface (PMI), or (2) achieve |
| 434 | +host/container MPI ABI compatibility through unsavory practices such as |
| 435 | +complete container MPI replacement. |
| 436 | + |
| 437 | +The preferred PMI implementation, e.g., PMI1, PMI2, OpenPMIx, or flux-pmi, |
| 438 | +will be that which is best supported by your host workload manager and |
| 439 | +container MPI. |
| 440 | + |
| 441 | +In :code:`example/Dockerfile.libfabric`, we selected :code:`OpenPMIx` because |
| 442 | +(1) it is supported by SLURM, OpenMPI, and MPICH, (2)~it is required for |
| 443 | +exascale, and (3) OpenMPI versions 5 and newer will no longer support PMI2. |
| 444 | + |
| 445 | +Choose an MPI compatible with your libfabric and PMI |
| 446 | +---------------------------------------------------- |
| 447 | + |
| 448 | +There are various MPI implementations, e.g., OpenMPI, MPICH, MVAPICH2, |
| 449 | +Intel-MPI, etc., to consider. We generally recommend OpenMPI; however, your |
| 450 | +MPI implementation of choice will ultimately be that which best supports the |
| 451 | +libfabric and PMI most compatible with your hardware and workload manager. |
| 452 | + |
| 453 | + |
| 454 | +.. LocalWords: userguide Gruening Souppaya Morello Scarfone openmpi nist dl |
| 455 | +.. LocalWords: ident OCFS MAGICK mpich psm rxm shm DSO pmi MVAPICH |
0 commit comments