Skip to content

New MachineView representation #1458

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 40 commits into from
Oct 9, 2024

Conversation

Marsella8
Copy link
Contributor

@Marsella8 Marsella8 commented Aug 7, 2024

Description of changes:

Related Issues:

Linked Issues:

  • Issue #

Issues closed by this PR:

  • Closes #

This change is Reviewable

@Marsella8 Marsella8 requested review from lockshaw and wmdi August 7, 2024 03:03
Copy link

codecov bot commented Aug 7, 2024

Codecov Report

Attention: Patch coverage is 94.33805% with 68 lines in your changes missing coverage. Please review.

Project coverage is 76.83%. Comparing base (a9d10d7) to head (20f6a75).
Report is 1 commits behind head on repo-refactor.

Files with missing lines Patch % Lines
lib/op-attrs/src/op-attrs/parallel_tensor_shape.cc 0.00% 18 Missing ⚠️
lib/pcg/src/pcg/device_id.cc 23.52% 13 Missing ⚠️
lib/pcg/src/pcg/machine_view.cc 87.50% 8 Missing ⚠️
lib/pcg/src/pcg/start_invariant_machine_view.cc 80.48% 8 Missing ⚠️
lib/op-attrs/src/op-attrs/parallel_dim.cc 0.00% 6 Missing ⚠️
lib/pcg/src/pcg/machine_space_offset.cc 63.63% 4 Missing ⚠️
lib/pcg/src/pcg/machine_specification.cc 87.09% 4 Missing ⚠️
lib/compiler/src/compiler/allowed_machine_views.cc 95.23% 3 Missing ⚠️
lib/pcg/src/pcg/operator_task_space.cc 88.23% 2 Missing ⚠️
lib/utils/include/utils/containers/foldl.h 92.85% 1 Missing ⚠️
... and 1 more
Additional details and impacted files
@@                Coverage Diff                @@
##           repo-refactor    #1458      +/-   ##
=================================================
+ Coverage          76.10%   76.83%   +0.72%     
=================================================
  Files                778      792      +14     
  Lines              25611    26559     +948     
  Branches             720      743      +23     
=================================================
+ Hits               19492    20406     +914     
- Misses              6119     6153      +34     
Flag Coverage Δ
unittests 76.83% <94.33%> (+0.72%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
lib/compiler/test/src/allowed_machine_views.cc 100.00% <100.00%> (ø)
...ler/machine_mapping/get_optimal_machine_mapping.cc 100.00% <100.00%> (ø)
...ne_mapping/get_tensor_set_movement_across_split.cc 99.36% <100.00%> (+0.16%) ⬆️
...st/src/compiler/machine_mapping/machine_mapping.cc 100.00% <100.00%> (ø)
...compiler/machine_mapping/machine_mapping_result.cc 100.00% <100.00%> (ø)
...-execution/include/local-execution/cost_estimate.h 0.00% <ø> (ø)
lib/local-execution/src/local_cost_estimator.cc 0.00% <ø> (ø)
lib/pcg/test/src/pcg/computation_graph_builder.cc 90.90% <ø> (ø)
lib/pcg/test/src/pcg/machine_specification.cc 100.00% <100.00%> (ø)
lib/pcg/test/src/pcg/machine_view.cc 100.00% <100.00%> (ø)
... and 30 more

... and 1 file with indirect coverage changes

Copy link
Collaborator

@lockshaw lockshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 35 of 41 files at r1, all commit messages.
Reviewable status: 35 of 41 files reviewed, 14 unresolved discussions (waiting on @Marsella8 and @wmdi)


lib/pcg/include/pcg/device_coordinates.struct.toml line 1 at r1 (raw file):

# DeviceCoordinates is exclusive to machine_view.cc, must not be used outside of it.

I don't think this requirement is necessary anymore with the new design?


lib/pcg/include/pcg/strided_rectangle.struct.toml line 17 at r1 (raw file):

  "utils/hash/unordered_multiset.h",
  "utils/fmt/unordered_multiset.h",
]

utils/hash and utils/fmt are only needed for the implementation, not the declaration

Suggestion:

includes = [
  "pcg/strided_rectangle_side.dtg.h",
  "<unordered_set>",
]

src_includes = [
  "utils/hash/unordered_multiset.h",
  "utils/fmt/unordered_multiset.h",
]

lib/pcg/src/pcg/device_id.cc line 29 at r1 (raw file):

int get_raw_id(device_id_t device_id) {
  if (get_device_type(device_id) == DeviceType::GPU) {

switch might be cleaner? You can also use overload if you'd prefer

Code quote:

  if (get_device_type(device_id) == DeviceType::GPU) {

lib/pcg/src/pcg/device_id.cc line 35 at r1 (raw file):

  } else {
    assert(false && "Unsupported DeviceType");
    return -1;

asserts will get removed in non-debug builds

Suggestion:

    throw mk_runtime_error(fmt::format("Unsupported DeviceType {} for device_id_t {}", get_device_type(device_id), device_id));

lib/pcg/test/src/pcg/machine_view.cc line 66 at r1 (raw file):

        CHECK(expected == result);
      }
      SUBCASE("get_last_device_id") {

Test functions separately (i.e., group by get_last_device_id rather than by MachineView tested on) except for particularly trivial utility functions


lib/pcg/test/src/pcg/machine_view.cc line 70 at r1 (raw file):

      }
    }
    SUBCASE("MachineView #2") {

Use better test names


lib/pcg/test/src/pcg/machine_view.cc line 93 at r1 (raw file):

  }

  TEST_CASE("MachineView make_1d_machine_view - GPU") {

The return type here isn't really necessary and is a bit confusing

Suggestion:

  TEST_CASE("make_1d_machine_view - GPU") {

lib/utils/include/utils/containers/foldl.h line 27 at r1 (raw file):

  C remaining(it, c.end());
  return foldl(remaining, init, func);
}

Suggestion:

template <typename C, typename F, typename E = typename C::value_type>
auto foldl1(C const &c, F func) -> typename C::value_type {
  if (c.empty()) {
     throw mk_runtime_error(fmt::format("foldl1 received empty container: {}", c)); 
  }
  
  std::optional<E> result;
  
  for (E const &e : c) { 
    if (!result.has_value()) {
      result = e; 
    } else {
      result = f(result.value(), e); 
    }
  }
  
  return result.value();
}

lib/utils/include/utils/containers/range.h line 11 at r1 (raw file):

template <typename T>
std::vector<T> range(T start, T end, T step = 1) {

C++ doesn't do generics super well so I'd lean toward concrete types unless there's a strong need for something more flexible

Suggestion:

std::vector<int> range(int start, int end, int step = 1) {

lib/utils/include/utils/containers/replicate.h line 9 at r1 (raw file):

template <typename T>
std::unordered_multiset<T> replicate(std::size_t n, T const &element) {

Prefer int--size_t intuitively seems like it should be good (it represents non-negative values!) but actually what it does is just remove the failure case by overflowing so you can't even assert that the value is okay. At some point we may add a non-negative integer type with the proper semantics, but we don't have one yet

Code quote:

std::size_t

lib/utils/include/utils/containers/scanl.h line 9 at r1 (raw file):

template <typename C, typename F, typename T>
std::vector<T> scanl(C const &c, T init, F const &op) {

Can you add a docstring with a reference to https://hackage.haskell.org/package/base-4.20.0.1/docs/Prelude.html#v:scanl? Not everyone on the FlexFlow team does a bunch of Haskell 🙂


lib/utils/include/utils/containers/scanl.h line 24 at r1 (raw file):

template <typename C, typename F>
auto scanl1(C const &c, F op) {
  using T = typename C::value_type;

See implementation of fold1 for a more readable version


lib/utils/test/src/utils/containers/foldl.cc line 9 at r1 (raw file):

TEST_SUITE(FF_TEST_SUITE) {
  TEST_CASE("foldl") {

I can see you're enjoying Haskell 😛


lib/utils/test/src/utils/containers/range.cc line 45 at r1 (raw file):

      std::vector<int> correct = {};
      CHECK(result == correct);
    }

What's the expected behavior of range(0, 10, -1)?

Copy link
Contributor Author

@Marsella8 Marsella8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 24 of 46 files reviewed, 14 unresolved discussions (waiting on @lockshaw and @wmdi)


lib/pcg/include/pcg/device_coordinates.struct.toml line 1 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

I don't think this requirement is necessary anymore with the new design?

Not a requirement anymore, but I think the user will generally reason about the position of the devices using device_id_t, so it probably won't be used.


lib/pcg/include/pcg/strided_rectangle.struct.toml line 17 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

utils/hash and utils/fmt are only needed for the implementation, not the declaration

Done.


lib/pcg/src/pcg/device_id.cc line 29 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

switch might be cleaner? You can also use overload if you'd prefer

Done.


lib/pcg/src/pcg/device_id.cc line 35 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

asserts will get removed in non-debug builds

Done.


lib/pcg/test/src/pcg/machine_view.cc line 66 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Test functions separately (i.e., group by get_last_device_id rather than by MachineView tested on) except for particularly trivial utility functions

Done.


lib/pcg/test/src/pcg/machine_view.cc line 70 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Use better test names

Done.


lib/pcg/test/src/pcg/machine_view.cc line 93 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

The return type here isn't really necessary and is a bit confusing

Done.


lib/utils/include/utils/containers/foldl.h line 27 at r1 (raw file):

  C remaining(it, c.end());
  return foldl(remaining, init, func);
}

cool pattern !


lib/utils/include/utils/containers/range.h line 11 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

C++ doesn't do generics super well so I'd lean toward concrete types unless there's a strong need for something more flexible

Done.


lib/utils/include/utils/containers/replicate.h line 9 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Prefer int--size_t intuitively seems like it should be good (it represents non-negative values!) but actually what it does is just remove the failure case by overflowing so you can't even assert that the value is okay. At some point we may add a non-negative integer type with the proper semantics, but we don't have one yet

Ahh makes sense thanks.


lib/utils/include/utils/containers/scanl.h line 9 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Can you add a docstring with a reference to https://hackage.haskell.org/package/base-4.20.0.1/docs/Prelude.html#v:scanl? Not everyone on the FlexFlow team does a bunch of Haskell 🙂

Done.


lib/utils/include/utils/containers/scanl.h line 24 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

See implementation of fold1 for a more readable version

Done.


lib/utils/test/src/utils/containers/foldl.cc line 9 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

I can see you're enjoying Haskell 😛

hahaha yes, I'm starting to see why you're so frustrated with C++. Functional code is so clean, wish the C++ syntax was a bit more accomodating...


lib/utils/test/src/utils/containers/range.cc line 45 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

What's the expected behavior of range(0, 10, -1)?

empty range (this range is modelled after python's ranges, so I'd like to keep it consistent, but we can also change it to throw an error).

Copy link
Collaborator

@lockshaw lockshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 41 files at r1, 19 of 24 files at r2, all commit messages.
Reviewable status: 44 of 49 files reviewed, 32 unresolved discussions (waiting on @Marsella8 and @wmdi)


lib/compiler/include/compiler/machine_mapping.h line 63 at r2 (raw file):

std::unordered_set<StartInvariantMachineView>
    get_allowed_start_invariant_machine_views(
        MachineSpecification const &machinespec,

Suggestion:

        MachineSpecification const &machine_spec,

lib/compiler/src/machine_mapping.cc line 378 at r2 (raw file):

}

std::vector<int> get_tensor_parallel_degrees(ParallelTensorShape const &shape) {

Should not return a vector as that imposes an order on ParallelTensorShape's dims, should return either an unordered_set or a custom dtgen struct


lib/compiler/src/machine_mapping.cc line 385 at r2 (raw file):

}

bool is_valid_machine_view(MachineView const &mv,

A blank line here or there would also help readability--you can use them to break up flows of steps into logical pieces


lib/compiler/src/machine_mapping.cc line 393 at r2 (raw file):

  tensor_degrees =
      filter(tensor_degrees, [](int degree) { return degree != 1; });
  return without_order(mv_degrees) == without_order(tensor_degrees);

This returns a multiset, right?

Code quote:

without_order(mv_degrees)

lib/compiler/src/machine_mapping.cc line 401 at r2 (raw file):

                                ParallelTensorShape const &shape) {

  auto candidate_strides = [](std::vector<int> tensor_dims, int total_devices) {

Suggestion:

std::vector<int> const &tensor_dims

lib/compiler/src/machine_mapping.cc line 401 at r2 (raw file):

                                ParallelTensorShape const &shape) {

  auto candidate_strides = [](std::vector<int> tensor_dims, int total_devices) {

I think a return type annotation here would really improve readability

Code quote:

[](std::vector<int> tensor_dims, int total_devices) {

lib/compiler/src/machine_mapping.cc line 404 at r2 (raw file):

    int max_stride_upper_bound =
        (total_devices + 1) /
        product(transform(tensor_dims, [](int degree) { return degree - 1; }));

This is a bit complicated, I feel like a one-line comment would help me figure out what this dense piece of arithmetic is actually doing (or break up the nested expressions and give them each names. While composition of a bunch of functions can be concise, having intermediate variables can really help make code more readable as you can describe each intermediate value/step along the way. Conciseness is good when it benefits readability, but sometimes it gets in the way more than it helps)

Code quote:

    int max_stride_upper_bound =
        (total_devices + 1) /
        product(transform(tensor_dims, [](int degree) { return degree - 1; }));

lib/compiler/src/machine_mapping.cc line 405 at r2 (raw file):

        (total_devices + 1) /
        product(transform(tensor_dims, [](int degree) { return degree - 1; }));
    std::unordered_multiset<std::vector<int>> strides = cartesian_product(

Should this be a stride_size_t?

Code quote:

int

lib/compiler/src/machine_mapping.cc line 410 at r2 (raw file):

  };

  std::vector<int> tensor_dims = filter(get_tensor_parallel_degrees(shape),

Seems like a common pattern, probably good to pull out a separate function?


lib/compiler/src/machine_mapping.cc line 413 at r2 (raw file):

                                        [](int degree) { return degree != 1; });
  std::unordered_set<MachineView> machine_views;
  int total_devices = machinespec.num_nodes * machinespec.num_gpus_per_node;

Why not just have a get_num_devices or something function for MachineSpecification?

Code quote:

  int total_devices = machinespec.num_nodes * machinespec.num_gpus_per_node;

lib/compiler/src/machine_mapping.cc line 414 at r2 (raw file):

  std::unordered_set<MachineView> machine_views;
  int total_devices = machinespec.num_nodes * machinespec.num_gpus_per_node;
  for (std::vector<int> stride :

Suggestion:

  for (std::vector<int> const &stride :

lib/compiler/src/machine_mapping.cc line 416 at r2 (raw file):

  for (std::vector<int> stride :
       candidate_strides(tensor_dims, total_devices)) {
    for (int start_id = 0; start_id < total_devices; start_id++) {

Seems like this loop is intuitively creating StartInvariantMachineViews and then looping over possible start degrees? If so, could this be written to do that? Using more descriptive data structures instead of things like std::vector<int> or std::vector<std::unordered_set<int>> is super helpful for readability. This would also let you break up this function into more intuitive pieces (generating all start-invariant machine views, and then generating all possible start dims for them)


lib/compiler/src/machine_mapping.cc line 437 at r2 (raw file):

      get_candidate_machine_views(machinespec, shape);
  views = filter(views, [&](MachineView const &view) {
    return is_valid_machine_view(view, shape);

Suggestion:

return is_valid_machine_view(view, shape) && is_valid_machine_view(view, machinespec);

lib/compiler/test/src/machine_mapping.cc line 9 at r2 (raw file):

TEST_SUITE(FF_TEST_SUITE) {

  TEST_CASE("get_allowed_machine_view") {

Suggestion:

  TEST_CASE("get_allowed_machine_views") {

lib/compiler/test/src/machine_mapping.cc line 12 at r2 (raw file):

    SUBCASE("1 degree of parallelism") {
      MachineSpecification ms = MachineSpecification(5, 1, 1, 0, 0);

Suggestion:

      MachineSpecification ms = MachineSpecification{5, 1, 1, 0, 0};

lib/pcg/include/pcg/device_coordinates.struct.toml line 18 at r2 (raw file):

[[fields]]
name = "coords"
type = "::FlexFlow::FFOrdered<int>"

I don't think FFOrdered should be used here, since there's no correlation between MachineView dim ordering and tensor dimension ordering. FFOrdered represents things that are in the same order as the dims in a TensorShape (or the shard dims in a ParallelTensorShape)


lib/pcg/include/pcg/machine_view.h line 20 at r2 (raw file):

size_t num_dims(MachineView const &mv);
size_t num_devices(MachineView const &mv);
size_t get_size(MachineView const &mv);

What is get_size here? Name is a bit unclear, might be good to either clarify the name or just add a quick doxygen docstring explaining what this does (I think I know what it is and I'm not sure I can think of a better name, so that would imply a doxygen docstring. That said, I'm not sure this function is actually useful anywhere?)


lib/pcg/include/pcg/start_invariant_machine_view.h line 9 at r2 (raw file):

namespace FlexFlow {

MachineView to_start_dependent(StartInvariantMachineView const &mv,

Suggestion:

MachineView machine_view_from_start_invariant(StartInvariantMachineView const &mv,

lib/pcg/include/pcg/start_invariant_machine_view.h line 11 at r2 (raw file):

MachineView to_start_dependent(StartInvariantMachineView const &mv,
                               device_id_t const &start_id);
StartInvariantMachineView to_start_invariant(MachineView const &mv);

Suggestion:

StartInvariantMachineView start_invariant_from_machine_view(MachineView const &mv);

lib/pcg/include/pcg/strided_rectangle.h line 15 at r2 (raw file):

private:
  std::vector<StridedRectangleSide> _sides;

I don't remember if there are reserved id issues with underscore prefixes, but convention-wise we don't usually use them

Suggestion:

  std::vector<StridedRectangleSide> sides;

lib/pcg/include/pcg/strided_rectangle.h line 28 at r2 (raw file):

  bool operator>=(StridedRectangle const &) const;

  std::vector<StridedRectangleSide> get_sides() const;

For pure getters (things where the result is actually a field) it's safe to return a const &. That said, if you're ever in doubt just return a value type, as that's always safe 🙂

Suggestion:

  std::vector<StridedRectangleSide> const &get_sides() const;

lib/pcg/include/pcg/strided_rectangle.h line 29 at r2 (raw file):

  std::vector<StridedRectangleSide> get_sides() const;
};

Suggestion:

private:
  std::tuple<decltype(sides) const &> tie() const;
  
  friend struct std::hash<StridedRectangle>;
};

lib/pcg/src/pcg/strided_rectangle.cc line 21 at r2 (raw file):

bool StridedRectangle::operator==(StridedRectangle const &other) const {
  return std::tie(this->_sides) == std::tie(other._sides);

Suggestion:

  return this->tie() == other.tie();

lib/pcg/src/pcg/strided_rectangle.cc line 85 at r2 (raw file):

  result ^= std::hash<std::vector<::FlexFlow::StridedRectangleSide>>{}(
                x.get_sides()) +
            0x9e3779b9 + (result << 6) + (result >> 2);

FYI this is actually a function in hash-utils called hash_combine, we just inline it in dtgen (for not really any good reason, we just do at the moment)


lib/pcg/src/pcg/strided_rectangle.cc line 86 at r2 (raw file):

                x.get_sides()) +
            0x9e3779b9 + (result << 6) + (result >> 2);
  return result;

Suggestion:

  return get_std_hash(x.tie());

lib/utils/include/utils/containers/cartesian_product.h line 13 at r2 (raw file):

template <typename Container>
auto cartesian_product(Container const &containers) {

Seems like this is always going to be called over a std::vector<SomeContainer>, so maybe do that specialization to simplify this code a bit?

Code quote:

auto cartesian_product(Container const &containers) {

lib/utils/include/utils/containers/cartesian_product.h line 26 at r2 (raw file):

    }

    for (const auto &item : ordered[depth]) {

Suggestion:

ordered.at(depth)

lib/utils/include/utils/containers/cartesian_product.h line 28 at r2 (raw file):

    for (const auto &item : ordered[depth]) {
      current.push_back(item);
      recurse(current, depth + 1);

Would this be simpler as a non-recursive loop? Not a big issue as this is pretty readable, just wondering


lib/utils/src/utils/containers/any_of.h line 0 at r2 (raw file):
Rename to any_of.cc?


lib/utils/test/src/utils/containers/cartesian_product.cc line 58 at r2 (raw file):

          {1, 2}, {1, 3}, {1, 3}, {1, 2}};
      CHECK(result == correct);
    }

What if one container is empty and the others aren't? Might be a good test case to add

Copy link
Contributor Author

@Marsella8 Marsella8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 30 of 63 files reviewed, 32 unresolved discussions (waiting on @lockshaw and @wmdi)


lib/compiler/include/compiler/machine_mapping.h line 63 at r2 (raw file):

std::unordered_set<StartInvariantMachineView>
    get_allowed_start_invariant_machine_views(
        MachineSpecification const &machinespec,

Done.


lib/compiler/src/machine_mapping.cc line 378 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Should not return a vector as that imposes an order on ParallelTensorShape's dims, should return either an unordered_set or a custom dtgen struct

changed to unordered_multiset (but have to sort the multiset in get_candidate_machine_views, since I need the ordering to be coherent between stride and dimension).


lib/compiler/src/machine_mapping.cc line 385 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

A blank line here or there would also help readability--you can use them to break up flows of steps into logical pieces

Done.


lib/compiler/src/machine_mapping.cc line 393 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

This returns a multiset, right?

yes, since different dimensions might have the same degree (we want to see if there exists a mapping between the 2 (multi)sets of dimensions, so order does not matter)


lib/compiler/src/machine_mapping.cc line 401 at r2 (raw file):

                                ParallelTensorShape const &shape) {

  auto candidate_strides = [](std::vector<int> tensor_dims, int total_devices) {

Done.


lib/compiler/src/machine_mapping.cc line 401 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

I think a return type annotation here would really improve readability

Done.


lib/compiler/src/machine_mapping.cc line 404 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

This is a bit complicated, I feel like a one-line comment would help me figure out what this dense piece of arithmetic is actually doing (or break up the nested expressions and give them each names. While composition of a bunch of functions can be concise, having intermediate variables can really help make code more readable as you can describe each intermediate value/step along the way. Conciseness is good when it benefits readability, but sometimes it gets in the way more than it helps)

I've simplified, but it's still a bit tricky to explain:
Naively, we could think that, given a (2,3) stride for example, it would result in 3*2=6 as many tiles occupied with respect to having all strides be 1, so we could say max_stride_product = num_total_devices/num_devices_used_by_tensor (where num_devices_used_by_tensor is the product of the parallel dims) and so we could simply say that the max stride across any dimension is max_stride_product, so we do the cartesian product and then filter downstream.
This however, doesn't quite work: consider, for example, a 2D MachineView with 2x2 points, and stride 2 across each dimension, and suppose there are 9 total devices. While the "volume" of the MachineView is technically 4x4, it can really fit into a 3x3 and thus we could fit it with the existing devices (since the outer volume can be cut without messing with any of the devices). To address this, I find not the number of total devices used by the tensor, but, the total number of "inner" devices, essentially the ones such that they have associated with them a full stride "volume". So I find the max stride for these using the naive procedure (which works since they all have full stride volume) and I know that if a given stride is too large for them then surely it's too large for the full set of devices, which essentially contains them. (Note that this solutions is incredibly underoptimized, if the generation turns out to be to slow I'll fix it).


lib/compiler/src/machine_mapping.cc line 405 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Should this be a stride_size_t?

fixed. is there a way to cast a container containing type X to the same container but with type Y (so not using transform and casting within the lambda)?


lib/compiler/src/machine_mapping.cc line 410 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Seems like a common pattern, probably good to pull out a separate function?

Done.


lib/compiler/src/machine_mapping.cc line 413 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why not just have a get_num_devices or something function for MachineSpecification?

Done.


lib/compiler/src/machine_mapping.cc line 414 at r2 (raw file):

  std::unordered_set<MachineView> machine_views;
  int total_devices = machinespec.num_nodes * machinespec.num_gpus_per_node;
  for (std::vector<int> stride :

Done.


lib/compiler/src/machine_mapping.cc line 416 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Seems like this loop is intuitively creating StartInvariantMachineViews and then looping over possible start degrees? If so, could this be written to do that? Using more descriptive data structures instead of things like std::vector<int> or std::vector<std::unordered_set<int>> is super helpful for readability. This would also let you break up this function into more intuitive pieces (generating all start-invariant machine views, and then generating all possible start dims for them)

fixed it up a bit, is this enough? I'm a bit hesitant using dtgen since I don't want to give too much visibility to components that are only used within a very narrow piece of the codebase. I could do something like using StrideVec = std::vector<stride_t> to make it more readable, though I'm not sure it would make it more understandable.


lib/compiler/src/machine_mapping.cc line 437 at r2 (raw file):

      get_candidate_machine_views(machinespec, shape);
  views = filter(views, [&](MachineView const &view) {
    return is_valid_machine_view(view, shape);

Done.


lib/compiler/test/src/machine_mapping.cc line 9 at r2 (raw file):

TEST_SUITE(FF_TEST_SUITE) {

  TEST_CASE("get_allowed_machine_view") {

Done.


lib/compiler/test/src/machine_mapping.cc line 12 at r2 (raw file):

    SUBCASE("1 degree of parallelism") {
      MachineSpecification ms = MachineSpecification(5, 1, 1, 0, 0);

Done.


lib/pcg/include/pcg/device_coordinates.struct.toml line 18 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

I don't think FFOrdered should be used here, since there's no correlation between MachineView dim ordering and tensor dimension ordering. FFOrdered represents things that are in the same order as the dims in a TensorShape (or the shard dims in a ParallelTensorShape)

Got it, so FFOrdered is only for Tensor dims?


lib/pcg/include/pcg/machine_view.h line 20 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

What is get_size here? Name is a bit unclear, might be good to either clarify the name or just add a quick doxygen docstring explaining what this does (I think I know what it is and I'm not sure I can think of a better name, so that would imply a doxygen docstring. That said, I'm not sure this function is actually useful anywhere?)

Obsolete and removed (originally it was the "volume", so product of all the sizes of the sides, which I thought I could use for doing bounds check, but it's now useless since the devices themselves can occupy a space smaller than that (e.g. imagine a 2D machine view with stride 2 and 2 devices per side, it technically occupies a 4x4 area but can also really fit into a 3x3)


lib/pcg/include/pcg/start_invariant_machine_view.h line 9 at r2 (raw file):

namespace FlexFlow {

MachineView to_start_dependent(StartInvariantMachineView const &mv,

Done.


lib/pcg/include/pcg/start_invariant_machine_view.h line 11 at r2 (raw file):

MachineView to_start_dependent(StartInvariantMachineView const &mv,
                               device_id_t const &start_id);
StartInvariantMachineView to_start_invariant(MachineView const &mv);

Done.


lib/pcg/include/pcg/strided_rectangle.h line 15 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

I don't remember if there are reserved id issues with underscore prefixes, but convention-wise we don't usually use them

Done.


lib/pcg/include/pcg/strided_rectangle.h line 28 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

For pure getters (things where the result is actually a field) it's safe to return a const &. That said, if you're ever in doubt just return a value type, as that's always safe 🙂

Done.


lib/pcg/include/pcg/strided_rectangle.h line 29 at r2 (raw file):

  std::vector<StridedRectangleSide> get_sides() const;
};

Done.


lib/pcg/src/pcg/strided_rectangle.cc line 21 at r2 (raw file):

bool StridedRectangle::operator==(StridedRectangle const &other) const {
  return std::tie(this->_sides) == std::tie(other._sides);

Done.


lib/pcg/src/pcg/strided_rectangle.cc line 86 at r2 (raw file):

                x.get_sides()) +
            0x9e3779b9 + (result << 6) + (result >> 2);
  return result;

Done.


lib/utils/include/utils/containers/cartesian_product.h line 13 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Seems like this is always going to be called over a std::vector<SomeContainer>, so maybe do that specialization to simplify this code a bit?

Yes you're right, fixed (not it's vector<Containers> -> unordered_multiset<vector<...>>


lib/utils/include/utils/containers/cartesian_product.h line 26 at r2 (raw file):

    }

    for (const auto &item : ordered[depth]) {

Done.


lib/utils/include/utils/containers/cartesian_product.h line 28 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Would this be simpler as a non-recursive loop? Not a big issue as this is pretty readable, just wondering

I like the recursive version and found it simpler to implement.


lib/utils/test/src/utils/containers/cartesian_product.cc line 58 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

What if one container is empty and the others aren't? Might be a good test case to add

Done.


lib/utils/src/utils/containers/any_of.h line at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Rename to any_of.cc?

Done.

Pietro Max Marsella added 2 commits August 13, 2024 12:33
Copy link
Contributor Author

@Marsella8 Marsella8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 30 of 66 files reviewed, 33 unresolved discussions (waiting on @lockshaw and @wmdi)


lib/compiler/src/compiler/allowed_machine_views.cc line 124 at r3 (raw file):

                   start_invariant_from_machine_view);
}

Unsure where to put these functions, will we be using parallel_tensor_dim_idx as the main way of indexing the ParallelTensorShape going forwards or is it just for doing the machine view to tensor mapping?

wmdi
wmdi previously approved these changes Aug 21, 2024
Copy link
Collaborator

@wmdi wmdi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 36 of 36 files at r3, all commit messages.
Reviewable status: all files reviewed, 33 unresolved discussions (waiting on @lockshaw)

Copy link
Collaborator

@lockshaw lockshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 36 of 36 files at r3, all commit messages.
Reviewable status: all files reviewed, 50 unresolved discussions (waiting on @Marsella8)


lib/compiler/include/compiler/tensor_to_machine_view_injection.struct.toml line 2 at r3 (raw file):

namespace = "FlexFlow"
name = "TensorToMachineViewInjection"

Why is this not a surjection? Seems like all of the MachineView dims need to be mapped to?


lib/compiler/include/compiler/tensor_to_machine_view_injection.struct.toml line 14 at r3 (raw file):

  "utils/bidict/bidict.h",
  "utils/hash/unordered_map.h"
]

Suggestion:

includes = [
  "pcg/machine_view_dim_idx.dtg.h",
  "op-attrs/parallel_tensor_dim_idx.dtg.h",
  "utils/bidict/bidict.h",
]

lib/compiler/src/compiler/allowed_machine_views.cc line 36 at r3 (raw file):

}

bool is_valid_machine_view(MachineView const &mv,

I don't see this function declared in allowed_machine_view.h?


lib/compiler/src/compiler/allowed_machine_views.cc line 124 at r3 (raw file):

Previously, Marsella8 wrote…

Unsure where to put these functions, will we be using parallel_tensor_dim_idx as the main way of indexing the ParallelTensorShape going forwards or is it just for doing the machine view to tensor mapping?

For now let's put get_parallel_dim_at_idx and get_parallel_tensor_indices into parallel_tensor_dim_idx.h


lib/compiler/src/compiler/allowed_machine_views.cc line 152 at r3 (raw file):

std::unordered_set<machine_view_dim_idx>
    get_machine_view_indices(MachineView const &mv) {

Move to machine_view_dim_idx.h


lib/compiler/src/compiler/allowed_machine_views.cc line 157 at r3 (raw file):

}

bool is_valid_injection(TensorToMachineViewInjection const &injection,

Probably better to move this and the functions below to an tensor_to_machine_view_injection.h file


lib/compiler/src/compiler/allowed_machine_views.cc line 175 at r3 (raw file):

  std::unordered_set<parallel_tensor_dim_idx> shape_indices =
      get_parallel_tensor_indices(shape);
  shape_indices = filter(shape_indices, [&](auto const idx) {

Suggestion:

(parallel_tensor_dim_idx const &idx)

lib/compiler/src/compiler/allowed_machine_views.cc line 183 at r3 (raw file):

       permutations(shape_indices)) {
    TensorToMachineViewInjection injection =
        TensorToMachineViewInjection(bidict(zip(sorted(mv_indices), p)));

Suggestion:

  std::vector<machine_view_dim_idx> machine_view_dim_ordering = sorted(mv_indices);
  std::unordered_set<TensorToMachineViewInjection> result;
  for (std::vector<parallel_tensor_dim_idx> const &tensor_dim_ordering :
       permutations(shape_indices)) {
    TensorToMachineViewInjection injection =
        TensorToMachineViewInjection(bidict(zip(machine_view_dim_ordering, tensor_dim_ordering)));

lib/compiler/test/src/allowed_machine_ views.cc line 62 at r3 (raw file):

          views.insert(mv);
        }
        return views;

Suggestion:

        return unordered_set_of(transform(count(num_starts), 
                                          [&](int start) {
                                             return MachineView{
                                               device_id_t{gpu_id_t{start}},
                                               StridedRectangle{{
                                                 StridedRectangleSide{num_points_t{2}, stride_t{stride1}},
                                                 StridedRectangleSide{num_points_t{3}, stride_t{stride2}}
                                               }},
                                             };
                                           }));

lib/compiler/test/src/allowed_machine_ views.cc line 64 at r3 (raw file):

        return views;
      };
      std::unordered_set<MachineView> correct;

Is it possible to instead consider a smaller MachineSpecification where the set of MachineViews is small enough that you don't need all this infrastructure to generate it? This testcase is a bit complicated and might be rather annoying to debug


lib/compiler/test/src/allowed_machine_ views.cc line 136 at r3 (raw file):

          make_2d_view(/*stride1*/ 3, /*stride2*/ 1),
          make_2d_view(/*stride1*/ 1, /*stride2*/ 3),
          make_2d_view(/*stride1*/ 1, /*stride2*/ 4)};

Suggestion:

      std::unordered_set<StartInvariantMachineView> correct = {
          make_2d_view(/*stride1*/ 1, /*stride2*/ 1),
          make_2d_view(/*stride1*/ 2, /*stride2*/ 1),
          make_2d_view(/*stride1*/ 1, /*stride2*/ 2),
          make_2d_view(/*stride1*/ 3, /*stride2*/ 1),
          make_2d_view(/*stride1*/ 1, /*stride2*/ 3),
          make_2d_view(/*stride1*/ 1, /*stride2*/ 4),
      };

lib/compiler/test/src/allowed_machine_ views.cc line 140 at r3 (raw file):

      std::unordered_set<StartInvariantMachineView> result =
          get_allowed_start_invariant_machine_views(ms, shape);
      CHECK(result == correct);

Suggestion:

      std::unordered_set<StartInvariantMachineView> result =
          get_allowed_start_invariant_machine_views(ms, shape);
          
      CHECK(result == correct);

lib/compiler/test/src/allowed_machine_ views.cc line 163 at r3 (raw file):

                         StridedRectangleSide{num_points_t(2), stride_t(4)},
                         StridedRectangleSide{num_points_t(3), stride_t(1)}},
                    }};

Suggestion:

  TEST_CASE("get_all_tensor_to_machine_view_injections") {
    ParallelTensorShape shape = ParallelTensorShape{
        ParallelTensorDims{
            FFOrdered<ShardParallelDim>{
                ShardParallelDim{10, 3},
            },
            ReplicaParallelDimSet{
                SumDegree{2},
                DiscardCopyDegree{2},
            },
        },
        DataType::FLOAT,
    };
    
    MachineView view =
        MachineView{
          device_id_from_index(0, DeviceType::GPU),
          StridedRectangle{{
            StridedRectangleSide{num_points_t(2), stride_t(1)},
            StridedRectangleSide{num_points_t(2), stride_t(4)},
            StridedRectangleSide{num_points_t(3), stride_t(1)},
          }},
        };

lib/compiler/test/src/allowed_machine_ views.cc line 178 at r3 (raw file):

    std::unordered_set<TensorToMachineViewInjection> result =
        get_all_tensor_to_machine_view_injections(view, shape);
    CHECK(correct == result);

Suggestion:

    std::unordered_set<TensorToMachineViewInjection> correct = {
        TensorToMachineViewInjection{b1}, TensorToMachineViewInjection{b2}};
    std::unordered_set<TensorToMachineViewInjection> result =
        get_all_tensor_to_machine_view_injections(view, shape);
        
    CHECK(correct == result);

lib/compiler/test/src/machine_mapping.cc line 7 at r3 (raw file):

TEST_SUITE(FF_TEST_SUITE) {

  // TEST_CASE("MachineMapping::combine") {

What's happening with this file?


lib/op-attrs/include/op-attrs/parallel_tensor_dim_idx.variant.toml line 2 at r3 (raw file):

namespace = "FlexFlow"
name = "parallel_tensor_dim_idx"

Naming convention-wise we usually either do ParallelTensorDimIdx or parallel_tensor_dim_idx_t, generally to allow parallel_tensor_dim_idx to be a variable name

Suggestion:

name = "parallel_tensor_dim_idx_t"

lib/op-attrs/include/op-attrs/parallel_tensor_dim_idx.variant.toml line 10 at r3 (raw file):

  "fmt",
]
explicit_constructors = false

Prefer explicit_constructors = true unless there's a good reason. What's the reason here?


lib/pcg/include/pcg/device_coordinates.struct.toml line 18 at r2 (raw file):

Previously, Marsella8 wrote…

Got it, so FFOrdered is only for Tensor dims?

Yes, to distinguish between the FlexFlow/pytorch ordering of dimensions and the flipped legion ordering of dimensions


lib/pcg/include/pcg/device_coordinates.struct.toml line 17 at r3 (raw file):

  "utils/fmt/vector.h",
  
]

Suggestion:

includes = [
  "<vector>",
]

src_includes = [
  "utils/hash/vector.h",
  "utils/fmt/vector.h",
]

lib/pcg/include/pcg/device_coordinates.struct.toml line 20 at r3 (raw file):

[[fields]]
name = "coords"

Suggestion:

name = "raw_coords"

lib/pcg/include/pcg/machine_view_dim_idx.struct.toml line 2 at r3 (raw file):

namespace = "FlexFlow"
name = "machine_view_dim_idx"

For convention we usually either use MachineViewDimIdx or machine_view_dim_idx_t so that machine_view_dim_idx can be a variable name

Suggestion:

name = "machine_view_dim_idx_t"

lib/pcg/include/pcg/strided_rectangle.h line 17 at r3 (raw file):

  std::vector<StridedRectangleSide> sides;
  std::tuple<std::vector<StridedRectangleSide> const &> tie() const;
  friend struct std::hash<StridedRectangle>;

Usually we put member variables (i.e., sides) into their own private: or public: section for readability

Suggestion:

private: 
  std::tuple<std::vector<StridedRectangleSide> const &> tie() const;
  friend struct std::hash<StridedRectangle>;

lib/pcg/src/pcg/device_id.cc line 36 at r3 (raw file):

    default:
      throw mk_runtime_error(
          fmt::format("Unsupported DeviceType {}", get_device_type(device_id)));

Why remove logging of the device_id_t?


lib/pcg/src/pcg/device_id.cc line 43 at r3 (raw file):

  switch (device_type) {
    case DeviceType::GPU:
      return device_id_t(gpu_id_t(idx));

Prefer {...} initialization for consistency

Suggestion:

      return device_id_t{gpu_id_t{idx}};

lib/pcg/src/pcg/machine_view.cc line 33 at r3 (raw file):

      sum(transform(zip(coefficients, as_vector(point.coords)),
                    [](auto const pair) { return pair.first * pair.second; })) +
      get_raw_id(mv.start);

Suggestion:

  size_t coord_offset = sum(transform(zip(coefficients, as_vector(point.coords)),
                    [](auto const pair) { return pair.first * pair.second; }));
  size_t raw_id = get_raw_id(mv.start) + coord_offset;

lib/pcg/src/pcg/machine_view.cc line 37 at r3 (raw file):

  return ((get_device_type(mv) == DeviceType::CPU)
              ? device_id_t(cpu_id_t(raw_id))
              : device_id_t(gpu_id_t(raw_id)));

Suggestion:

  return device_id_from_index(raw_id, get_device_type(mv));

lib/pcg/src/pcg/machine_view.cc line 43 at r3 (raw file):

  std::vector<std::vector<int>> ranges =
      transform(mv.rect.get_sides(), [](StridedRectangleSide const &side) {
        return range(0, get_side_size(side).unwrapped, side.stride.unwrapped);

Pull the lambda out into a separate function (get_points or something) in strided_rectangle_side.h

Probably also worth pulling out a get_coords or something in strided_rectangle.h

Code quote:

      transform(mv.rect.get_sides(), [](StridedRectangleSide const &side) {
        return range(0, get_side_size(side).unwrapped, side.stride.unwrapped);

lib/pcg/src/pcg/machine_view.cc line 47 at r3 (raw file):

  std::unordered_set<DeviceCoordinates> devices_as_points = unordered_set_of(
      transform(cartesian_product(ranges),
                [](auto const &point) { return DeviceCoordinates(point); }));

Prefer concrete type

Suggestion:

(std::vector<int> const &point) 

lib/pcg/src/pcg/machine_view.cc line 55 at r3 (raw file):

}

device_id_t get_last_device_id(MachineView const &mv) {

Suggestion:

device_id_t get_maximum_device_id(MachineView const &mv) {

lib/pcg/src/pcg/machine_view.cc line 59 at r3 (raw file):

  //     transform(mv.rect.get_sides(), [](StridedRectangleSide const &s) {
  //       return s.stride.unwrapped;
  //     }));

Delete comment

Code quote:

  // DeviceCoordinates last_device = DeviceCoordinates(
  //     transform(mv.rect.get_sides(), [](StridedRectangleSide const &s) {
  //       return s.stride.unwrapped;
  //     }));

lib/pcg/src/pcg/machine_view.cc line 67 at r3 (raw file):

}

std::vector<num_points_t> get_num_devices_per_dim(MachineView const &mv) {

Do MachineViews have ordered sides?

Code quote:

std::vector<num_points_t>

lib/pcg/src/pcg/machine_view.cc line 73 at r3 (raw file):

}

std::vector<side_size_t> get_side_size_per_dim(MachineView const &mv) {

Do MachineViews have ordered sides?

Code quote:

std::vector<side_size_t>

lib/pcg/src/pcg/strided_rectangle.cc line 24 at r3 (raw file):

std::tuple<std::vector<StridedRectangleSide> const &>
    StridedRectangle::tie() const {
  return std::tie(sides);

Suggestion:

  return std::tie(this->sides);

lib/pcg/src/pcg/strided_rectangle.cc line 52 at r3 (raw file):

std::vector<StridedRectangleSide> const &StridedRectangle::get_sides() const {
  return sides;

Always use explicit this to reduce ambiguity

Suggestion:

  return this->sides;

lib/pcg/test/src/pcg/machine_view.cc line 56 at r3 (raw file):

      }};
      gpu_id_t start(0);
      MachineView mv{device_id_t{start}, rect};

Suggestion:

      MachineView mv = MachineView{
        gpu_id_t{0},
        StridedRectangle{{
          StridedRectangleSide(num_points_t(2), stride_t{3}),
          StridedRectangleSide(num_points_t(2), stride_t{2}),
        }},
      };

lib/pcg/test/src/pcg/machine_view.cc line 59 at r3 (raw file):

      SUBCASE("get_device_ids") {
        std::unordered_set<device_id_t> expected =
            make_gpu_device_ids({0, 2, 12, 14});

Why aren't the device ids here {0, 2, 3, 5}? Or at the very least not {0, 2, 6, 8}? I feel like I don't understand the calculation here


lib/pcg/test/src/pcg/start_invariant_machine_view.cc line 16 at r3 (raw file):

    MachineView input = MachineView{start, rect};

    MachineView result = machine_view_from_start_invariant(

This is a good property for a rapidcheck test, but it would be good to also have a concrete test of each of the directions that isn't just testing invertibility


lib/pcg/test/src/pcg/strided_rectangle.cc line 17 at r3 (raw file):

      StridedRectangle r1 = StridedRectangle{{s1, s0}};
      CHECK(r0 == r1);
      CHECK(r1.get_sides() == std::vector<StridedRectangleSide>{s0, s1});

How much do we care that the sides are sorted vs just caring that there is a canonical order?


lib/utils/include/utils/bidict/bidict.h line 25 at r3 (raw file):

  }

  bidict(std::vector<std::pair<L, R>> init)

Any particular reason for adding this over the existing constructors?


lib/utils/include/utils/bidict/bidict.h line 26 at r3 (raw file):

  bidict(std::vector<std::pair<L, R>> init)
      : bidict(init.begin(), init.end()) {}

Suggestion:

  explicit bidict(std::vector<std::pair<L, R>> init)
      : bidict(init.begin(), init.end()) {}

lib/utils/include/utils/containers/cartesian_product.h line 17 at r3 (raw file):

  std::function<void(V &, size_t)> recurse = [&](V &current, size_t depth) {
    if (depth == containers.size()) {

Suggestion:

template <typename C, typename Elem = typename C::value_type>
std::unordered_multiset<std::vector<Elem>> 
  cartesian_product(std::vector<C> const &containers) {
  
  std::unordered_multiset<std::vector<Elem>> result;

  std::function<void(std::vector<Elem> &, size_t)> recurse = [&](std::vector<Elem> &current, size_t depth) {
    if (depth == containers.size()) {

lib/utils/include/utils/containers/cartesian_product.h line 22 at r3 (raw file):

    }

    for (const auto &item : containers.at(depth)) {

Suggestion:

    for (Elem const &item : containers.at(depth)) {

lib/utils/include/utils/containers/filter.h line 53 at r3 (raw file):

  std::copy_if(m.cbegin(), m.cend(), std::inserter(result, result.begin()), f);
  return result;
}

Suggestion:

template <typename Elem, typename F>
std::unordered_multiset<Elem> filter(std::unordered_multiset<Elem> const &m,
                                     F const &f) {
  std::unordered_multiset<Elem> result;
  std::copy_if(m.cbegin(), m.cend(), std::inserter(result, result.begin()), f);
  return result;
}

lib/utils/include/utils/containers/permutations.h line 13 at r3 (raw file):

template <typename C, typename V = std::vector<typename C::value_type>>
auto permutations(C const &container) {

FYI there's also a lazy implementation of permutations on the substitutions-fix branch at https://github.com/lockshaw/FlexFlow/blob/substitutions-fix/lib/utils/include/utils/containers/get_all_permutations.h (note that having this version too is fine, just wanted to make you aware)


lib/utils/include/utils/containers/permutations.h line 13 at r3 (raw file):

template <typename C, typename V = std::vector<typename C::value_type>>
auto permutations(C const &container) {

Suggestion:

template <typename C, typename Elem = typename C::value_type>
std::unordered_set<std::vector<Elem>> permutations(C const &container) {

lib/utils/include/utils/containers/replicate.h line 9 at r1 (raw file):

Previously, Marsella8 wrote…

Ahh makes sense thanks.

Not seeing this fix?


lib/utils/test/src/utils/containers/cartesian_product.cc line 10 at r1 (raw file):

TEST_SUITE(FF_TEST_SUITE) {
  TEST_CASE("cartesian_product") {
    SUBCASE("empty") {

Why remove this test case?


lib/utils/test/src/utils/containers/permutations.cc line 5 at r3 (raw file):

#include <doctest/doctest.h>
#include <unordered_set>
#include <vector>

Suggestion:

#include <unordered_set>
#include <vector>
#include "utils/fmt/vector.h"
#include "utils/fmt/unordered_set.h"

lib/compiler/src/machine_mapping.cc line 393 at r2 (raw file):

Previously, Marsella8 wrote…

yes, since different dimensions might have the same degree (we want to see if there exists a mapping between the 2 (multi)sets of dimensions, so order does not matter)

Might be good to rename that function to multiset_of to remove ambiguity


lib/compiler/src/machine_mapping.cc line 404 at r2 (raw file):

Previously, Marsella8 wrote…

I've simplified, but it's still a bit tricky to explain:
Naively, we could think that, given a (2,3) stride for example, it would result in 3*2=6 as many tiles occupied with respect to having all strides be 1, so we could say max_stride_product = num_total_devices/num_devices_used_by_tensor (where num_devices_used_by_tensor is the product of the parallel dims) and so we could simply say that the max stride across any dimension is max_stride_product, so we do the cartesian product and then filter downstream.
This however, doesn't quite work: consider, for example, a 2D MachineView with 2x2 points, and stride 2 across each dimension, and suppose there are 9 total devices. While the "volume" of the MachineView is technically 4x4, it can really fit into a 3x3 and thus we could fit it with the existing devices (since the outer volume can be cut without messing with any of the devices). To address this, I find not the number of total devices used by the tensor, but, the total number of "inner" devices, essentially the ones such that they have associated with them a full stride "volume". So I find the max stride for these using the naive procedure (which works since they all have full stride volume) and I know that if a given stride is too large for them then surely it's too large for the full set of devices, which essentially contains them. (Note that this solutions is incredibly underoptimized, if the generation turns out to be to slow I'll fix it).

Ah okay, yeah I'd add that explanation in as a comment. The code makes a lot more sense with that bit of context


lib/compiler/src/machine_mapping.cc line 405 at r2 (raw file):

Previously, Marsella8 wrote…

fixed. is there a way to cast a container containing type X to the same container but with type Y (so not using transform and casting within the lambda)?

I don't think so, and generally we avoid casting anyway in favor of explicit construction


lib/compiler/src/machine_mapping.cc line 416 at r2 (raw file):

I'm a bit hesitant using dtgen since I don't want to give too much visibility to components that are only used within a very narrow piece of the codebase

That makes sense, but I think having a MultiDimensionalStride or something like that would be fine, especially as it's a pretty simple/obvious concept and could be used elsewhere e.g., to move get_strided_rectangle out into its own function that constructs a StridedRectangle from a starting point, a number of repetitions in each dim, and a MultiDimensionalStride

fixed it up a bit, is this enough?

It's definitely a big improvement! I do think breaking it up into stages that produce different named data structures (i.e., first generating the candidate MultiDimensionalStrides, then using them to generate the StartInvariantMachineViews, and then using them to generate the MachineViews) would help, along with a brief comment explaining the stride generation logic which is inherently a bit complicated. It might also be a good idea to add a brief comment to the candidate generation functions explaining the guarantees (if any) they provide on their outputs: for example, are each of the outputs guaranteed to have at least one valid MachineView, or can there be outputs for which no valid MachineViews exist?

Copy link
Contributor Author

@Marsella8 Marsella8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 87 of 102 files reviewed, 38 unresolved discussions (waiting on @lockshaw and @wmdi)


lib/pcg/include/pcg/machine_view.struct.toml line 29 at r8 (raw file):

Previously, lockshaw (Colin Unger) wrote…

I'd go with one vector due to the "make invalid states unrepresentable" reasoning (one vector prevents the two from having different lengths). I'd add a custom well-named (maybe MachineDimensionStride? or StridedMachineDimension? or even something as basic as MachineDimensionWithStride? Feel free to use something else if you can come up with something, I'm not exactly in love with any of the above names) .struct.toml datatype to represent the combination of the two

done

Copy link
Collaborator

@lockshaw lockshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 13 of 15 files at r11, all commit messages.
Reviewable status: 100 of 102 files reviewed, 7 unresolved discussions (waiting on @Marsella8 and @wmdi)


lib/compiler/test/src/allowed_machine_views.cc line 45 at r10 (raw file):

Previously, Marsella8 wrote…

Done.

Doesn't appear to have changed?


lib/compiler/test/src/allowed_machine_views.cc line 49 at r10 (raw file):

Previously, Marsella8 wrote…

Machine space is the 2D space described by machine specification. (!= Task Space, which is N-dimensional, with N dimension of the parallel tensor). So we have a vector of strides, a vector of dimensions to which to map each dimension of the task, and then a 2D coordinate (start_x, start_y) which is the node and device indices.

Then why not just call it start_node_idx and start_device_idx?


lib/pcg/test/src/pcg/machine_specification.cc line 35 at r10 (raw file):

Previously, Marsella8 wrote…

Done.

I think you're missing "out of bounds" in the new name?


lib/pcg/test/src/pcg/machine_view.cc line 49 at r10 (raw file):

Previously, Marsella8 wrote…

made a docstring for each case that explains it pretty extensively.

My concern was more that the names weren't really justifying why that point was included for testing, but as it appears that all of the possible points are tested exhaustively I guess this is fine


lib/pcg/test/src/pcg/machine_view.cc line 245 at r11 (raw file):

        /**
         * The tasks will thus be distributed like this:

proj format appears to have butchered your nice diagram--there should be a clang-format comment you can insert to disable clang-format for these lines


lib/pcg/test/src/pcg/operator_task_space.cc line 9 at r10 (raw file):

Previously, Marsella8 wrote…

Returns a single empty coordinate (vacuously makes sense?). Not sure what a zero dimensional TaskSpaceOperator would represent, whether something with no parallelism or a straight up wrong data-type (so something with no parallelism would be a TaskSpaceOperator with 1 or more ones. Let me know what you think is best.

Yeah this is fine, I was mostly just curious and it's a bit nice to have edgecases like this documented

Copy link
Contributor Author

@Marsella8 Marsella8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 100 of 102 files reviewed, 7 unresolved discussions (waiting on @lockshaw and @wmdi)


lib/compiler/test/src/allowed_machine_views.cc line 45 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Doesn't appear to have changed?

Done.


lib/compiler/test/src/allowed_machine_views.cc line 49 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Then why not just call it start_node_idx and start_device_idx?

Done.


lib/pcg/test/src/pcg/machine_specification.cc line 35 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

I think you're missing "out of bounds" in the new name?

Done.


lib/pcg/test/src/pcg/machine_view.cc line 245 at r11 (raw file):

Previously, lockshaw (Colin Unger) wrote…

proj format appears to have butchered your nice diagram--there should be a clang-format comment you can insert to disable clang-format for these lines

fixed

Copy link
Collaborator

@lockshaw lockshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 15 files at r11, 3 of 3 files at r12, all commit messages.
Reviewable status: 101 of 102 files reviewed, 9 unresolved discussions (waiting on @Marsella8 and @wmdi)


lib/pcg/include/pcg/machine_view_dimension.struct.toml line 20 at r12 (raw file):

[[fields]]
name = "stride"
type = "::FlexFlow::stride_t"

Any particular reason to keep this as a stride_t rather than just dropping it to an int? Not a big issue, just curious


lib/pcg/src/pcg/machine_view.cc line 33 at r12 (raw file):

static std::vector<int>
    get_projection_indices(MachineView const &mv,

Suggestion:

    get_projection_indices_for_dimension(MachineView const &mv,

lib/pcg/src/pcg/machine_view.cc line 39 at r12 (raw file):

  std::vector<MachineSpecificationDimension> projections = get_projections(mv);
  for (size_t i = 0; i < projections.size(); ++i) {
    if (projections[i] == dimension) {

Suggestion:

projections.at(i)

lib/pcg/src/pcg/machine_view.cc line 44 at r12 (raw file):

  }
  return projection_indices;
}

Suggestion:

  std::vector<int> projection_indices
    = filter(count(projections.size()),
             [&](int idx) {
               return projections.at(idx) == dimension;
             });

  return projection_indices;
}

lib/pcg/src/pcg/machine_view.cc line 46 at r12 (raw file):

}

static int compute_index(int start_idx,

Try to aim for clearer function names even if they're static, especially if they take this number of parameters (in fact, arguably the sheer number of shared parameters imply that this should probably just be a lambda in get_machine_space_coordinate as it doesn't really seem to stand alone


lib/pcg/src/pcg/machine_view.cc line 62 at r12 (raw file):

    sizes.push_back(dim_size);
    coord_points.push_back(coord.raw_coord[i]);
    strides.push_back(mv_strides[i].unwrapped);

Code-style wise probably cleaner to generate sizes, coord_points, and strides as three separate transform expressions rather than one big combined loop

Code quote:

  for (int i : projection_indices) {
    int dim_size = task.degrees[i] * mv_strides[i].unwrapped;
    sizes.push_back(dim_size);
    coord_points.push_back(coord.raw_coord[i]);
    strides.push_back(mv_strides[i].unwrapped);

lib/pcg/src/pcg/machine_view.cc line 78 at r12 (raw file):

                                 MachineView const &mv,
                                 TaskSpaceCoordinate const &coord,
                                 MachineSpecification const &ms) {

What about splitting this into a function that takes a MachineView and generates a DeduplicatedMachineView (which has all of the strides multiplied so you can't produce redundant points), and then a corresponding function that takes a DeduplicatedMachineView which then can do a much simpler computation to find the MachineSpaceCoordinate. Thoughts?

Overall one of the best way you can make code more readable is to introduce (sane and conceptually clear) intermediate types, and then break a complicated function that does A -> D into A -> B -> C -> D

Code quote:

std::optional<MachineSpaceCoordinate>
    get_machine_space_coordinate(OperatorTaskSpace const &task,
                                 MachineView const &mv,
                                 TaskSpaceCoordinate const &coord,
                                 MachineSpecification const &ms) {

lib/pcg/src/pcg/machine_view.cc line 96 at r12 (raw file):

  return ms_coord;
}
std::unordered_set<MachineSpaceCoordinate>

Is it possible that this should be an std::unordered_multiset, or is MachineView defined such as to make this impossible?


lib/pcg/src/pcg/machine_view.cc line 103 at r12 (raw file):

  return transform(
      get_task_space_coordinates(task), [&](TaskSpaceCoordinate const &c) {
        return get_machine_space_coordinate(task, mv, c, ms).value();

Try not to abbreviate these variables too much or the code starts to become rather hard to skim as I keep having to chase down definitions unnecessarily

Suggestion:

       return get_machine_space_coordinate(task, machine_view, coord, machine_spec).value();

Copy link
Collaborator

@lockshaw lockshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 15 files at r11.
Reviewable status: all files reviewed, 12 unresolved discussions (waiting on @Marsella8)


lib/compiler/src/compiler/allowed_machine_views.cc line 67 at r6 (raw file):

Previously, Marsella8 wrote…

I've tried to split it up a bit more, but it's probably not enough. Not super sure how to express the candidate_strides computation more cleanly otherwise.

No worries, the key thing is that the high-level flow is clear. As long as you name get_candidate_strides clearly so I know what it does I can just ignore the details of it and still get a good idea of how the overall code works. Breaking things into separate lambdas, naming things clearly, having solid intermediate types, and having a clear logical flow (generate candidate values, generate an overapproximation of the space, then filter) like you did here is great and much more important than low-level style.


lib/compiler/src/compiler/allowed_machine_views.cc line 27 at r12 (raw file):

                           OperatorTaskSpace const &task,
                           MachineSpecification const &ms) {
  std::optional<MachineSpaceCoordinate> maximum_device_coords =

Suggestion:

  std::optional<MachineSpaceCoordinate> maximum_device_coord 

lib/compiler/src/compiler/allowed_machine_views.cc line 36 at r12 (raw file):

 * The returned set includes all valid machine views, and might contain
 invalid ones. This function should never be used externally (see
 * `get_allowed_partial_machine_view_mappings` instead). There is no

Is this the right name?

Code quote:

`get_allowed_partial_machine_view_mappings

lib/compiler/src/compiler/allowed_machine_views.cc line 40 at r12 (raw file):

 its possible for all
 * `MachineView`s to be invalid)
 */

Suggestion:

/* Generates a set of candidate `MachineView`s
 * The returned set includes all valid machine views, and might contain
 * invalid ones. This function should not be used externally (see
 * `get_allowed_partial_machine_view_mappings` instead). There is no
 * guarantee that a non-empty returned set contains a valid machine view (i.e.
 * it's possible for all the returned `MachineView`s to be invalid)
 */

lib/compiler/src/compiler/allowed_machine_views.cc line 61 at r12 (raw file):

    std::vector<stride_t> single_stride_range =
        transform(range(1, max_stride_upper_bound + 1),
                  [](int stride) { return stride_t(stride); });

for consistency

Suggestion:

                  [](int stride) { return stride_t{stride}; });

lib/compiler/src/compiler/allowed_machine_views.cc line 75 at r12 (raw file):

    std::unordered_set<MachineSpaceCoordinate> result;
    for (int i : range(ms.num_nodes)) {
      for (int j : range(get_num_devices_per_node(ms, device_type))) {

Suggestion:

    for (int node_idx : range(ms.num_nodes)) {
      for (int device_idx : range(get_num_devices_per_node(ms, device_type))) {

lib/compiler/src/compiler/allowed_machine_views.cc line 99 at r12 (raw file):

         candidate_starts(machine_spec, device_type)) {
      for (std::vector<MachineSpecificationDimension> const &proj :
           candidate_projections(task)) {

FYI the whole projection vs machine-spec-dim is naming-wise a bit inconsistent and weird. Not a big issue as at least neither term is overloaded to also mean something else (you just have two words for one concept), but it can be a bit confusing at times

Code quote:

      for (std::vector<MachineSpecificationDimension> const &proj :
           candidate_projections(task)) {

lib/compiler/src/compiler/allowed_machine_views.cc line 103 at r12 (raw file):

            transform(zip(strides.raw_strides, proj), [&](auto const &p) {
              return MachineViewDimension{p.first, p.second};
            });

I would add a function to MachineView for constructing a MachineView from a start and separate lists of strides and projections and then use it here to simplify this code a bit

Code quote:

        std::vector<MachineViewDimension> dimensions =
            transform(zip(strides.raw_strides, proj), [&](auto const &p) {
              return MachineViewDimension{p.first, p.second};
            });

Copy link
Contributor Author

@Marsella8 Marsella8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 12 unresolved discussions (waiting on @lockshaw)


lib/compiler/src/compiler/allowed_machine_views.cc line 67 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

No worries, the key thing is that the high-level flow is clear. As long as you name get_candidate_strides clearly so I know what it does I can just ignore the details of it and still get a good idea of how the overall code works. Breaking things into separate lambdas, naming things clearly, having solid intermediate types, and having a clear logical flow (generate candidate values, generate an overapproximation of the space, then filter) like you did here is great and much more important than low-level style.

great tip, will keep it in mind in the future


lib/compiler/src/compiler/allowed_machine_views.cc line 27 at r12 (raw file):

                           OperatorTaskSpace const &task,
                           MachineSpecification const &ms) {
  std::optional<MachineSpaceCoordinate> maximum_device_coords =

Done.


lib/compiler/src/compiler/allowed_machine_views.cc line 36 at r12 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Is this the right name?

changed to the correct one


lib/compiler/src/compiler/allowed_machine_views.cc line 40 at r12 (raw file):

 its possible for all
 * `MachineView`s to be invalid)
 */

Done.


lib/compiler/src/compiler/allowed_machine_views.cc line 75 at r12 (raw file):

    std::unordered_set<MachineSpaceCoordinate> result;
    for (int i : range(ms.num_nodes)) {
      for (int j : range(get_num_devices_per_node(ms, device_type))) {

Done.


lib/compiler/src/compiler/allowed_machine_views.cc line 99 at r12 (raw file):

Previously, lockshaw (Colin Unger) wrote…

FYI the whole projection vs machine-spec-dim is naming-wise a bit inconsistent and weird. Not a big issue as at least neither term is overloaded to also mean something else (you just have two words for one concept), but it can be a bit confusing at times

agreed, replaced projection since it's not only unclear but a bit of a misnomer, since the operation we are doing is not really projection as much as it is flattening an N-dim space into a 2D one.


lib/compiler/src/compiler/allowed_machine_views.cc line 103 at r12 (raw file):

Previously, lockshaw (Colin Unger) wrote…

I would add a function to MachineView for constructing a MachineView from a start and separate lists of strides and projections and then use it here to simplify this code a bit

done


lib/pcg/include/pcg/machine_view_dimension.struct.toml line 20 at r12 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Any particular reason to keep this as a stride_t rather than just dropping it to an int? Not a big issue, just curious

no strong reason / preference, I don't mind it since it makes things a bit more explicit / clear. I can change it if needed


lib/pcg/src/pcg/machine_view.cc line 33 at r12 (raw file):

static std::vector<int>
    get_projection_indices(MachineView const &mv,

Done.


lib/pcg/src/pcg/machine_view.cc line 39 at r12 (raw file):

  std::vector<MachineSpecificationDimension> projections = get_projections(mv);
  for (size_t i = 0; i < projections.size(); ++i) {
    if (projections[i] == dimension) {

Done.


lib/pcg/src/pcg/machine_view.cc line 44 at r12 (raw file):

  }
  return projection_indices;
}

Done.


lib/pcg/src/pcg/machine_view.cc line 46 at r12 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Try to aim for clearer function names even if they're static, especially if they take this number of parameters (in fact, arguably the sheer number of shared parameters imply that this should probably just be a lambda in get_machine_space_coordinate as it doesn't really seem to stand alone

yeah, put into a lambda, much clearer now


lib/pcg/src/pcg/machine_view.cc line 62 at r12 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Code-style wise probably cleaner to generate sizes, coord_points, and strides as three separate transform expressions rather than one big combined loop

done


lib/pcg/src/pcg/machine_view.cc line 78 at r12 (raw file):

Previously, lockshaw (Colin Unger) wrote…

What about splitting this into a function that takes a MachineView and generates a DeduplicatedMachineView (which has all of the strides multiplied so you can't produce redundant points), and then a corresponding function that takes a DeduplicatedMachineView which then can do a much simpler computation to find the MachineSpaceCoordinate. Thoughts?

Overall one of the best way you can make code more readable is to introduce (sane and conceptually clear) intermediate types, and then break a complicated function that does A -> D into A -> B -> C -> D

not sure I fully agree, what would DeduplicatedMachineView represent exactly (the partial strides /degrees for each dimension)? I think the current logic is decently clear, at least at the top level. Cleaned up a bit, let me know if this is better


lib/pcg/src/pcg/machine_view.cc line 96 at r12 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Is it possible that this should be an std::unordered_multiset, or is MachineView defined such as to make this impossible?

each TaskSpace coordinate maps to a unique MachineSpaceCoordinate, so it's good.


lib/pcg/src/pcg/machine_view.cc line 103 at r12 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Try not to abbreviate these variables too much or the code starts to become rather hard to skim as I keep having to chase down definitions unnecessarily

Done.

lockshaw
lockshaw previously approved these changes Oct 9, 2024
Copy link
Collaborator

@lockshaw lockshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 7 of 7 files at r13, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @Marsella8)


lib/pcg/include/pcg/start_invariant_machine_view.h line 39 at r13 (raw file):

                                 MachineSpecification const &ms);

std::unordered_set<MachineSpaceCoordinate>

Seems a bit odd to return a MachineSpaceCoordinate here rather than some other type like MachineSpaceOffset or something to represent the fact that StartInvariantMachineView doesn't actually have a start


lib/pcg/src/pcg/machine_view.cc line 78 at r12 (raw file):

Previously, Marsella8 wrote…

not sure I fully agree, what would DeduplicatedMachineView represent exactly (the partial strides /degrees for each dimension)? I think the current logic is decently clear, at least at the top level. Cleaned up a bit, let me know if this is better

My understanding of how TaskSpaceCoordinate -> MachineSpaceCoordinate works is that for the most part it's a pretty simple calculation: given a machine view, you have machine_space_coord_x = task_space_coord_a * stride_a + start_x. The complexity comes in when you start to have multiple task space dimensions projected to the same machine space dimension, so you end up with, e.g., machine_space_coord_x = task_space_coord_a * stride_a + task_space_coord_b * stride_b + start_x, and then you have an issue where you can have two non-equal pairs (task_space_coord_a, task_space_coord_b) and (task_space_coord_a', task_space_coord_b') such that task_space_coord_a * stride_a + task_space_coord_b * stride_b == task_space_coord_a' * stride_a + task_space_coord_b' * stride_b, leading to multiple TaskSpaceCoordinates mapping to the same device. Correct me if I'm wrong, but the way we fix this is by introducing an order within the task space dimensions that map to a single machine space dimension (e.g., [a, b] here), and then we change the computation to task_space_coord_a * stride_a * (task_space_size_b * stride_b) + task_space_coord_b * stride_b + start_x, which makes the mapping one-to-one. As the logic for computing this is more complex, one way to simplify things would be to argue that stride_a * task_space_size_b * stride_b simply functions as a "normalized"/"deduplicated" stride stride_a', and then you could divide the computation into MachineView -> NormalizedMachineView, and then use the simpler computation for transforming a TaskSpaceCoordinate into a MachineSpaceCoordinatethrough aNormalizedMachineView`

The current code is reasonably nice so this isn't critical, just throwing it out there as a potential improvement as I don't think you'll be able to get the existing code much nicer via only style changes

Copy link
Contributor Author

@Marsella8 Marsella8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @Marsella8)


lib/pcg/include/pcg/start_invariant_machine_view.h line 39 at r13 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Seems a bit odd to return a MachineSpaceCoordinate here rather than some other type like MachineSpaceOffset or something to represent the fact that StartInvariantMachineView doesn't actually have a start

yeah agree, added new dtgen type.


lib/pcg/src/pcg/machine_view.cc line 78 at r12 (raw file):

Previously, lockshaw (Colin Unger) wrote…

My understanding of how TaskSpaceCoordinate -> MachineSpaceCoordinate works is that for the most part it's a pretty simple calculation: given a machine view, you have machine_space_coord_x = task_space_coord_a * stride_a + start_x. The complexity comes in when you start to have multiple task space dimensions projected to the same machine space dimension, so you end up with, e.g., machine_space_coord_x = task_space_coord_a * stride_a + task_space_coord_b * stride_b + start_x, and then you have an issue where you can have two non-equal pairs (task_space_coord_a, task_space_coord_b) and (task_space_coord_a', task_space_coord_b') such that task_space_coord_a * stride_a + task_space_coord_b * stride_b == task_space_coord_a' * stride_a + task_space_coord_b' * stride_b, leading to multiple TaskSpaceCoordinates mapping to the same device. Correct me if I'm wrong, but the way we fix this is by introducing an order within the task space dimensions that map to a single machine space dimension (e.g., [a, b] here), and then we change the computation to task_space_coord_a * stride_a * (task_space_size_b * stride_b) + task_space_coord_b * stride_b + start_x, which makes the mapping one-to-one. As the logic for computing this is more complex, one way to simplify things would be to argue that stride_a * task_space_size_b * stride_b simply functions as a "normalized"/"deduplicated" stride stride_a', and then you could divide the computation into MachineView -> NormalizedMachineView, and then use the simpler computation for transforming a TaskSpaceCoordinate into a MachineSpaceCoordinatethrough aNormalizedMachineView`

The current code is reasonably nice so this isn't critical, just throwing it out there as a potential improvement as I don't think you'll be able to get the existing code much nicer via only style changes

aah ok got it know thanks! I think I'll leave it like this for now, but will take this into consideration when going over that CPU kernel tests PR you tagged me in to see if it makes sense to have a centralized set of functions for working with these indices transformation.

lockshaw
lockshaw previously approved these changes Oct 9, 2024
Copy link
Collaborator

@lockshaw lockshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 6 of 6 files at r14, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @Marsella8)

@lockshaw lockshaw self-requested a review October 9, 2024 23:15
@lockshaw lockshaw enabled auto-merge (squash) October 9, 2024 23:15
@lockshaw lockshaw changed the title Unordered StridedRectangle, get_allowed_machine_views New MachineView representation Oct 9, 2024
@lockshaw lockshaw merged commit 65c3911 into flexflow:repo-refactor Oct 9, 2024
7 of 8 checks passed
@victorli2002 victorli2002 mentioned this pull request Mar 12, 2025
@victorli2002 victorli2002 mentioned this pull request Apr 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants