Skip to content

process_group_test - Enhance fault tolerance collective tests #109

Closed
@allenwang28

Description

Description

process_group_test is the test suite responsible for testing collectives.

Currently it supports basic correctness tests: 1. running the collective with a single process group to ensure that it passes (and that tensors are sane) and 2. running the collective with 2 process groups to ensure that it succeeds and that numerics are correct.

As mentioned in #108, @rohan-varma had two great suggestions:

  1. Split up sequential collective tests into individual tests, and
  2. Test for actual fault tolerance

These are valuable contributions and will require some restructuring / refactoring of the tests.

Collectives as individual tests

An idea for this is to i.e. set an explicit list of the supported collectives and parameterize based on the backend. This is explicit, but one issue with the current approach is that process groups are expensive to spin up and teardown. In #103, this doubled the execution time of the test for a limited number of collectives,

Building on the above approach, this could be mitigated by creating process groups in a setupClass method and running all tests on those process groups.

TestDistBackend and MultiProcessTestCase in PT-D may provide some pointers for doing something like this.

From @wconstab:

There is a test class called MultiProcContinuousTest defined in the same test utils file as MultiProcesTestCase that shares a PG across test instances. It requires having main defined differently for that test file and isn't compatible with hahving MultiProcesTestCase instnaces inside the same file currently, but it is in use in a number of pt-d tests bc it saves a lot of time

Fault Tolerance

Collectives currently only test for "wrapper correctness", but adding in correctness for fault tolerant behaviors would provide more confidence.

One idea to test the actual fault tolerance: model it such that i.e. a sender fails, sender succeeds but receiver fails, and verify that an appropriate exception / timeout is returned back to the user process. This is then retried with success after the PG gets reconfigured.

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions