Skip to content

Issue in OSHMEM idle contexts reuse #13060

Open
@zhongchen530

Description

@zhongchen530

In file ompi/oshmem/mca/spml/ucx/spml_ucx.c
around line 1140,

/* Check if we have an idle context to reuse */
    SHMEM_MUTEX_LOCK(mca_spml_ucx.internal_mutex);
    for (i = 0; i < idle_array->ctxs_count; i++) {
        if (idle_array->ctxs[i]->options & options) {
            ucx_ctx = idle_array->ctxs[i];
            _ctx_remove(idle_array, ucx_ctx, i);
            break;
        }
    }

We have the above code that attempts to reuse idle contexts from an array of idle contexts. However, the condition (idle_array->ctxs[i]->options & options at line 1140 suggests that an idle context with option 0 will never get reused.

If you modify the OSHMEM code to print out ctxs_count which is the length of the array idle_array, and run the following simple program,

#include <shmem.h>
int main() {
    shmem_init();
    for (int i = 0; i < 10; i++) {
        shmem_ctx_t ctx;
        shmem_ctx_create(0, &ctx);
        shmem_ctx_destroy(ctx);
    }
    shmem_finalize();
}

you will observe that ctxs_count grows from 0 to 9, indicating the idle contexts are not getting reused and the idle array explodes. I believe this is not the expected behavior and the fix would be as simple as changing the condition from idle_array->ctxs[i]->options & options to idle_array->ctxs[i]->options == options.

Moreover, the current code can lead to correctness issue as well because it can potentially assign a more restrictive context as a context with less restrictive option. Consider the following program.

#include <shmem.h>
int main() {
    shmem_init();
    shmem_ctx_t ctx1, ctx2;
    shmem_ctx_create(SHMEM_CTX_NOSTORE | SHMEM_CTX_PRIVATE, &ctx1);
    shmem_ctx_destroy(ctx1);
    shmem_ctx_create(SHMEM_CTX_NOSTORE, &ctx2);
    shmem_ctx_destroy(ctx2);
    shmem_finalize();
}

With the current code, a context configured with both SHMEM_CTX_NOSTORE and SHMEM_CTX_PRIVATE will be assigned as a context configured with only SHMEM_CTX_NOSTORE. This will lead to correctness issue as ctx2 may be shared. In fact, the current code crushes when the above program runs. Changing & to == fixes this issue as well.

It is clearly stated in section 9.4.1 of the OpenSHMEM 1.4 specification that multiple options can be combined with a bitwise OR operation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions