Local execution e2e training #1472

reyna-abhyankar · 2024-08-25T17:44:03Z

Description of changes:

Implement end-to-end training loop in local execution.

Related Issues:

Linked Issues:

Issue #

Issues closed by this PR:

Closes #

This change is

codecov · 2024-08-25T17:51:52Z

Codecov Report

Attention: Patch coverage is 2.72727% with 535 lines in your changes missing coverage. Please review.

Project coverage is 60.45%. Comparing base (5e1f349) to head (dde9496).

Files	Patch %	Lines
lib/local-execution/src/optimizer.cc	0.00%	111 Missing ⚠️
lib/local-execution/test/src/test_update_e2e.cc	1.19%	83 Missing ⚠️
lib/local-execution/test/src/test_loss_e2e.cc	1.25%	79 Missing ⚠️
lib/local-execution/src/local_training_backing.cc	0.00%	57 Missing ⚠️
lib/local-execution/src/local_slots_backing.cc	21.73%	36 Missing ⚠️
lib/local-execution/src/loss_functions.cc	0.00%	30 Missing ⚠️
.../local-execution/src/generic_task_impl_function.cc	0.00%	23 Missing ⚠️
lib/local-execution/src/task_invocation.cc	0.00%	23 Missing ⚠️
lib/kernels/src/array_shape.cc	8.69%	21 Missing ⚠️
lib/local-execution/src/model_training_instance.cc	0.00%	15 Missing ⚠️
... and 9 more

Additional details and impacted files

@@                Coverage Diff                @@
##           repo-refactor    #1472      +/-   ##
=================================================
- Coverage          62.84%   60.45%   -2.40%     
=================================================
  Files                497      507      +10     
  Lines              14150    14726     +576     
  Branches             415      431      +16     
=================================================
+ Hits                8893     8902       +9     
- Misses              5257     5824     +567

Flag	Coverage Δ
unittests	`60.45% <2.72%> (-2.40%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
lib/local-execution/src/ops/attention.cc	`24.64% <ø> (ø)`
...cal-execution/test/src/test_local_slots_backing.cc	`98.36% <100.00%> (ø)`
lib/local-execution/test/src/test_task_registry.cc	`100.00% <ø> (ø)`
lib/local-execution/src/task_signature_impl.cc	`4.93% <0.00%> (ø)`
lib/local-execution/src/local_cost_estimator.cc	`0.00% <0.00%> (ø)`
lib/local-execution/src/ops/element_unary.cc	`0.00% <0.00%> (ø)`
lib/op-attrs/src/loss_functions.cc	`0.00% <0.00%> (ø)`
lib/pcg/src/pcg/optimizer_attrs.cc	`0.00% <0.00%> (ø)`
...execution/include/local-execution/task_signature.h	`0.00% <0.00%> (ø)`
lib/pcg/src/pcg/computation_graph.cc	`66.66% <0.00%> (-15.69%)`	⬇️
... and 12 more

... and 3 files with indirect coverage changes

lockshaw

Reviewed 1 of 2 files at r1, 49 of 50 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: all files reviewed, 47 unresolved discussions (waiting on @reyna-abhyankar)

lib/kernels/CMakeLists.txt line 10 at r2 (raw file):

     LIST_DIRECTORIES False
     src/*.cc
     src/cuda/cuda_helper.cu

Why is this not just src/cuda/*.cu?

lib/kernels/include/kernels/array_shape.h line 45 at r2 (raw file):

  std::optional<std::size_t> at_maybe(ff_dim_t) const;

  ArrayShape sub_shape(legion_dim_t start, ff_dim_t end) const;

legion_dim_t for start and ff_dim_t for end? Is this signature intended? It's not illogical, just a bit weird and wanted to check it's not a typo 🙂

lib/kernels/src/array_shape.cc line 59 at r3 (raw file):

ArrayShape ArrayShape::sub_shape(std::optional<ff_dim_t> start,
                                 std::optional<ff_dim_t> end) const {
  std::vector<size_t> new_shape;

Can probably be replaced by slice(DimOrdered, ...): https://github.com/flexflow/FlexFlow/blob/repo-refactor/lib/op-attrs/include/op-attrs/dim_ordered/slice.h#L9

lib/kernels/src/device.h line 74 at r3 (raw file):

__global__ void scale_kernel(float *ptr, size_t size, float a, float b);
__global__ void scale_kernel(float *ptr, unsigned long size, float a, float b);

Why?

lib/local-execution/include/local-execution/local_training_backing.h line 18 at r3 (raw file):

                       TensorBackingMap const &,
                       RuntimeArgConfig const &,
                       std::optional<ModelTrainingInstance> &);

Why is this passed as mutable?

lib/local-execution/include/local-execution/model_training_instance.struct.toml line 2 at r3 (raw file):

namespace = "FlexFlow"
name = "ModelTrainingInstance"

Why isn't the model part of ModelTrainingInstance? The members seem to have changed non-trivially since the old version:

struct ModelTrainingInstance {
  ComputationGraph computation_graph;
  Optimizer optimizer;
  EnableProfiling enable_profiling;
  TrainingPCG training_pcg;
  TensorMapping tensor_map;
  LegionBacking legion_backing;
};

lib/local-execution/include/local-execution/model_training_instance.struct.toml line 15 at r3 (raw file):

  "pcg/tensor_guid_t.dtg.h",
  "pcg/optimizer_attrs.dtg.h",
]

Suggestion:

includes = [
  "op-attrs/ops/loss_attrs.dtg.h",
  "pcg/tensor_guid_t.dtg.h",
  "pcg/optimizer_attrs.dtg.h",
]

lib/local-execution/include/local-execution/optimizer.h line 16 at r3 (raw file):

TaskInvocation get_update_invocation(OptimizerAttrs const &,
                                     tensor_guid_t const &weight,
                                     std::vector<tensor_guid_t> const &);

Some parameter names would be helpful here

Code quote:

                                     std::vector<tensor_guid_t> const &);

lib/local-execution/include/local-execution/optimizer.h line 22 at r3 (raw file):

TaskInvocation sgd_update(SGDOptimizerAttrs const &,
                          tensor_guid_t const &weight,
                          tensor_guid_t const &);

Some parameter names would be helpful here

lib/local-execution/include/local-execution/optimizer.h line 29 at r3 (raw file):

                           tensor_guid_t const &weight,
                           tensor_guid_t const &,
                           tensor_guid_t const &);

Some parameter names would be helpful here

Code quote:

                           tensor_guid_t const &,
                           tensor_guid_t const &);

lib/local-execution/include/local-execution/task_impl_function.variant.toml line 22 at r3 (raw file):

[[values]]
type = "::FlexFlow::FwdBwdTaskImplFunction"
key = "fwd_bwd_task_impl_function"

Would these be better renamed InitOpTaskImplFunction and FwdBwdOpTaskImplFunction, or do they make sense outside of a op context?

Code quote:

[[values]]
type = "::FlexFlow::InitTaskImplFunction"
key = "init_task_impl_function"

[[values]]
type = "::FlexFlow::FwdBwdTaskImplFunction"
key = "fwd_bwd_task_impl_function"

lib/local-execution/include/local-execution/task_invocation.h line 65 at r3 (raw file):

  task_id_t task_id;
  TaskBinding binding;
};

Can be moved over to dtgen

Code quote:

struct TaskInvocation {
public:
  TaskInvocation() = delete;
  TaskInvocation(task_id_t task_id, TaskBinding const &binding)
      : task_id(task_id), binding(binding) {}

public:
  task_id_t task_id;
  TaskBinding binding;
};

lib/local-execution/include/local-execution/task_signature.h line 4 at r3 (raw file):

#define _FLEXFLOW_LOCAL_EXECUTION_TASK_SIGNATURE_H

// #include "local-execution/tensor_guid_slot_spec.dtg.h"

Delete?

lib/local-execution/include/local-execution/task_signature.h line 41 at r3 (raw file):

}

// adds arg_slot without checking is_serializable, used for arguments that are

Change to proper doxygen docstring?

lib/local-execution/include/local-execution/task_signature.h line 49 at r3 (raw file):

// adds arg_slot without checking is_serializable, used for arguments that are
// deviceSpecific

Change to proper doxygen docstring?

lib/local-execution/include/local-execution/task_signature.struct.toml line 5 at r3 (raw file):

features = [
  "eq",
  "fmt",

Why not "hash"?

lib/local-execution/include/local-execution/task_signature.struct.toml line 10 at r3 (raw file):

includes = [
  "local-execution/tensor_guid_slot_spec.dtg.h",
  "utils/type_index.h",

Suggestion:

  "<typeindex>",

lib/local-execution/include/local-execution/task_signature.struct.toml line 12 at r3 (raw file):

  "utils/type_index.h",
  "utils/optional.h"
]

Suggestion:

includes = [
  "local-execution/tensor_guid_slot_spec.dtg.h",
  "utils/type_index.h",
  "<optional>",
]

lib/local-execution/include/local-execution/task_signature.struct.toml line 29 at r3 (raw file):

[[fields]]
name = "tensor_guid_slots"
type = "std::unordered_set<::FlexFlow::TensorGuidSlotSpec>"

Why not a map of slot_id_t -> something?

Code quote:

type = "std::unordered_set<::FlexFlow::TensorGuidSlotSpec>"

lib/local-execution/include/local-execution/tensor_guid_spec.struct.toml line 2 at r3 (raw file):

namespace = "FlexFlow"
name = "TensorGuidSpec"

This name seems a bit ambiguous? Maybe something clearer?

lib/local-execution/src/local_cost_estimator.cc line 80 at r3 (raw file):

                                     tensor_backing_map,
                                     this->runtime_arg_config,
                                     model_training_instance);

Would it make more sense to disaggregate LocalTrainingBacking rather than having an optional field for the training instance?

Code quote:

  std::optional<ModelTrainingInstance> model_training_instance = std::nullopt;
  LocalTrainingBacking local_backing(allocator,
                                     cg_builder.computation_graph,
                                     tensor_backing_map,
                                     this->runtime_arg_config,
                                     model_training_instance);

lib/local-execution/src/local_slots_backing.cc line 61 at r3 (raw file):

      sig.tensor_guid_slots.size() - 2; // ignore 2 (weight and weight_grad)
  std::vector<tensor_guid_t> buffer_tensors =
      get_new_tensor_guids_for_layer_without_graph_insertion(

Not a fan of this, come up with some new tensor id type

lib/local-execution/src/local_slots_backing.cc line 63 at r3 (raw file):

      get_new_tensor_guids_for_layer_without_graph_insertion(
          cg, weight_layer, num_buffer_tensors);
  for (auto const &tensor_guid : buffer_tensors) {

Suggestion:

  for (tensor_guid_t const &tensor_guid : buffer_tensors)

lib/local-execution/src/local_slots_backing.cc line 100 at r3 (raw file):

    OpTaskBinding const &binding, layer_guid_t const &op_guid) const {
  TensorSlotsBacking mapping;
  int num_inputs = 0;

Either use filter or add a count_if function to containers or something like it

lib/local-execution/src/local_slots_backing.cc line 126 at r3 (raw file):

      default:
        throw mk_runtime_error(
            fmt::format("Invalid TensorRole")); // inserting role yields

It shouldn't, can you try again?

lib/local-execution/src/local_slots_backing.cc line 132 at r3 (raw file):

    IsGrad is_grad = slot_grad_id.is_grad;
    GenericTensorAccessorW tensor_backing = this->get_tensor_backing(
        tensor_guids.at(weight_adjusted_idx + tensor_spec.idx), is_grad);

What's going on here with weight_adjusted_idx?

lib/local-execution/src/local_slots_backing.cc line 189 at r3 (raw file):

                        },
                        [](ConcreteArgSpec const &s) { return s; },
                    })});

Isn't this just map_values?

Code quote:

  for (auto const &arg_binding : binding.get_arg_bindings()) {
    slot_id_t arg_slot = arg_binding.first;
    TaskArgSpec task_arg_spec = arg_binding.second;

    mapping.insert({arg_slot,
                    task_arg_spec.visit<ConcreteArgSpec>(overload{
                        [&](RuntimeArgRefSpec const &s) {
                          return this->resolve_runtime_arg_ref_spec(s);
                        },
                        [](ConcreteArgSpec const &s) { return s; },
                    })});

lib/local-execution/src/local_training_backing.cc line 26 at r3 (raw file):

  for (layer_guid_t const &node :
       topological_ordering(this->computation_graph)) {

Much appreciated 🙂

lib/local-execution/src/local_training_backing.cc line 165 at r3 (raw file):

      std::vector<tensor_guid_t> buffer_tensors =
          this->local_slots_backing.weight_optimizer_tensor_guids.at(
              weight_tensor);

Why not key this by the layer_guid_t of the WeightAttrs?

lib/local-execution/src/local_training_backing.cc line 179 at r3 (raw file):

  }

  this->training_instance = next(this->training_instance.value());

I'm not convinced that the next() abstraction is the cleanest way to handle this? Idk, the ModelTrainingInstance changing during training is a bit conceptually weird. I don't have a super clear alternative idea, but I suppose it would be something along the lines of moving the mutable aspects (I think in this case some optimizer state?) into some other data structure so that ModelTrainingInstance stays the same

lib/local-execution/src/loss_functions.cc line 29 at r2 (raw file):

  add_slot(sig, LOGIT, IsGrad::NO);
  add_slot(sig, LABEL, IsGrad::NO);
  add_slot(sig, LOGIT, IsGrad::YES);

Why is LOGIT added twice? Does this mean pass both the gradient and the non-gradient into the function?

lib/local-execution/src/loss_functions.cc line 40 at r2 (raw file):

  b.bind(LOGIT, TensorGuidSpec{logit, IsGrad::NO});
  b.bind(LABEL, TensorGuidSpec{label, IsGrad::NO});
  b.bind(LOGIT, TensorGuidSpec{logit, IsGrad::YES});

Same here, why is LOGIT bound twice?

lib/local-execution/src/loss_functions.cc line 64 at r2 (raw file):

  if (loss_type == LossFunction::SPARSE_CATEGORICAL_CROSSENTROPY) {
    // label shape is [parallel dim(?), batch dim, 1]

Label shape is an ArrayShape, so it shouldn't have any parallel dims, right?

Code quote:

    // label shape is [parallel dim(?), batch dim, 1]

lib/local-execution/src/model_training_instance.cc line 9 at r2 (raw file):

    AdamOptimizerAttrs old =
        old_training_instance.optimizer_attrs.get<AdamOptimizerAttrs>();
    double new_beta1_t = old.beta_t * old.beta1;

Should probably get pulled out of the function at least. Also, does it make sense to actually modifying the ModelTrainingInstance or would we want a separate optimizer state object/field to hold these mutable values?

lib/local-execution/src/model_training_instance.cc line 26 at r2 (raw file):

                                 new_attrs};
  }
  return old_training_instance;

Might be shorted to just copy the old_training_instance?

Suggestion:

  if (old_training_instance.optimizer_attrs.has<AdamOptimizerAttrs>()) {
    old_training_instance.optimizer.attrs.get<AdamOptimizerAttrs>() = AdamOptimizerAttrs{
      ...
    };
  } else {
    return old_training_instance;
}

lib/local-execution/src/optimizer.cc line 39 at r2 (raw file):

    return {task_id_t::SGD_UPD_NCCL_TASK_ID, b};
  }
  return {task_id_t::SGD_UPD_PS_TASK_ID, b};

Bit more readable

Suggestion:

  if (CHOSEN_SYNC_TYPE == ParamSync::NCCL) {
    b.bind_arg(HANDLE, ff_handle());
    return {task_id_t::SGD_UPD_NCCL_TASK_ID, b};
  } else {
    return {task_id_t::SGD_UPD_PS_TASK_ID, b};
  }

lib/local-execution/src/optimizer.cc line 127 at r2 (raw file):

    return {task_id_t::ADAM_UPD_NCCL_TASK_ID, b};
  }
  return {task_id_t::ADAM_UPD_PS_TASK_ID, b};

Suggestion:

  if (CHOSEN_SYNC_TYPE == ParamSync::NCCL) {
    b.bind_arg(HANDLE, ff_handle());
    return {task_id_t::ADAM_UPD_NCCL_TASK_ID, b};
  } else {
    return {task_id_t::ADAM_UPD_PS_TASK_ID, b};
  }

lib/local-execution/src/optimizer.cc line 189 at r2 (raw file):

      [&](AdamOptimizerAttrs const &s) {
        return get_adam_update_signature();
      }});

Suggestion:

TaskSignature get_update_signature(OptimizerAttrs const &attrs) {
  return attrs.visit<TaskSignature>(overload{
      [&](SGDOptimizerAttrs const &) { return get_sgd_update_signature(); },
      [&](AdamOptimizerAttrs const &) {
        return get_adam_update_signature();
      }});

lib/local-execution/src/optimizer.cc line 195 at r2 (raw file):

    get_update_invocation(OptimizerAttrs const &attrs,
                          tensor_guid_t const &weight,
                          std::vector<tensor_guid_t> const &buffer_tensors) {

Is buffer usually the word for these?

Code quote:

 &buffer_tensors)

lib/local-execution/src/optimizer.cc line 211 at r2 (raw file):

      [&](AdamOptimizerAttrs const &s) {
        return get_adam_update_task_impl();
      }});

Suggestion:

TaskImplFunction get_update_task_impl(OptimizerAttrs const &attrs) {
  return attrs.visit<TaskImplFunction>(overload{
      [&](SGDOptimizerAttrs const &) { return get_sgd_update_task_impl(); },
      [&](AdamOptimizerAttrs const &) {
        return get_adam_update_task_impl();
      }});

lib/local-execution/src/task_invocation.cc line 46 at r2 (raw file):

bool is_invocation_valid(TaskSignature const &sig, TaskInvocation const &inv) {
  // TODO: implement signature checking
  return true;

Suggestion:

  NOT_IMPLEMENTED();

lib/local-execution/test/src/test_loss_e2e.cc line 45 at r3 (raw file):

    tensor_backing_map.insert({input_tensor, input_backing});

    SUBCASE("SparseCategoricalCrossEntropyLossAttrs") {

The scoping of these tests seems a bit off--I get having one test to check that local execution works with a loss function, but it seems like the code should be decomposed such that the loss functions can be tested by themselves and then all this is need is the one integration test to make sure loss functions in general work with local execution.

It's fine if these are just temporary placeholders, but in the final draft it would be nice to have a slightly cleaner approach to testing than "spin up the whole local execution training backend and make it run full passes". If this is in fact the best way to test this then I'm happy to be convinced as some tests do just need a whole bunch of surrounding state to make sense, but more often than not that surrounding state can be minimized by having cleaner abstraction boundaries

lib/local-execution/test/src/test_update_e2e.cc line 0 at r3 (raw file):
https://reviewable.io/reviews/flexflow/FlexFlow/1472#-O5tZe9t2K_AvSQBCzOY applies to this file as well

lib/op-attrs/include/op-attrs/ops/loss_functions.h line 8 at r3 (raw file):

#include "loss_function.dtg.h"
#include "other_loss_attrs.dtg.h"
#include "sparse_categorical_ce_loss_attrs.dtg.h"

Suggestion:

#include "op-attrs/ops/core.h"
#include "op-attrs/ops/loss_attrs.dtg.h"
#include "op-attrs/ops/loss_function.dtg.h"
#include "op-attrs/ops/other_loss_attrs.dtg.h"
#include "op-attrs/ops/sparse_categorical_ce_loss_attrs.dtg.h"

lib/op-attrs/include/op-attrs/ops/other_loss_attrs.struct.toml line 2 at r3 (raw file):

namespace = "FlexFlow"
name = "OtherLossAttrs"

Maybe NonconfigurableLossAttrs or NonparameterizedLossAttrs? OtherLossAttrs is just a bit ambiguous

Code quote:

name = "OtherLossAttrs"

lib/pcg/include/pcg/computation_graph.h line 36 at r3 (raw file):

std::vector<tensor_guid_t>
    get_new_tensor_guids_for_layer_without_graph_insertion(

Remove

lib/pcg/include/pcg/optimizer_attrs.h line 8 at r3 (raw file):

namespace FlexFlow {

OptimizerAttrs make_empty_sgd_attrs();

What is an "empty" SGD? I'm not entirely convinced these are values that make sense to have their own functions.

lib/local-execution/include/local-execution/loss_functions.h line 29 at r3 (raw file):

TaskSignature get_loss_bwd_signature();
TaskInvocation
    backward(LossAttrs const &, tensor_guid_t logit, tensor_guid_t label);

Why change the signature of backward?

reyna-abhyankar

Reviewable status: all files reviewed, 47 unresolved discussions (waiting on @lockshaw)

lib/kernels/CMakeLists.txt line 10 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why is this not just src/cuda/*.cu?

There are a couple of files in that folder that haven't been refactored/don't build: metrics_kernels.cu and embedding_kernels.cu. I think maybe Dylan was going to tackle those?

lib/kernels/include/kernels/array_shape.h line 45 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

legion_dim_t for start and ff_dim_t for end? Is this signature intended? It's not illogical, just a bit weird and wanted to check it's not a typo 🙂

It's called like this in concat_kernels.cu:
blk_size = shape.sub_shape(legion_dim_t{0}, axis).num_elements(); where axis is an ff_dim_t

I've left it as not implemented for now because I thought it was weird too.

lib/kernels/src/device.h line 74 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why?

Was getting a linker error, but I realized it was for a different reason. This is removed and fixed in cuda_helper.cu

lib/local-execution/include/local-execution/model_training_instance.struct.toml line 15 at r3 (raw file):

  "pcg/tensor_guid_t.dtg.h",
  "pcg/optimizer_attrs.dtg.h",
]

Done.

lib/local-execution/include/local-execution/optimizer.h line 16 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Some parameter names would be helpful here

Done.

lib/local-execution/include/local-execution/optimizer.h line 22 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Some parameter names would be helpful here

Done.

lib/local-execution/include/local-execution/optimizer.h line 29 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Some parameter names would be helpful here

Done.

lib/local-execution/include/local-execution/task_impl_function.variant.toml line 22 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Would these be better renamed InitOpTaskImplFunction and FwdBwdOpTaskImplFunction, or do they make sense outside of a op context?

Done.

lib/local-execution/include/local-execution/task_signature.h line 4 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Delete?

Done.

lib/local-execution/include/local-execution/task_signature.h line 41 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Change to proper doxygen docstring?

Done.

lib/local-execution/include/local-execution/task_signature.h line 49 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Change to proper doxygen docstring?

Done.

lib/local-execution/include/local-execution/task_signature.struct.toml line 5 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why not "hash"?

Done.

lib/local-execution/include/local-execution/task_signature.struct.toml line 10 at r3 (raw file):

includes = [
  "local-execution/tensor_guid_slot_spec.dtg.h",
  "utils/type_index.h",

This is fine, but I have to move the deleted part to the src_includes since the fmt is defined in that file

lib/local-execution/include/local-execution/task_signature.struct.toml line 12 at r3 (raw file):

  "utils/type_index.h",
  "utils/optional.h"
]

This is fine, but I have to move the deleted part to the src_includes since the fmt is defined in that file

lib/local-execution/src/local_slots_backing.cc line 63 at r3 (raw file):

      get_new_tensor_guids_for_layer_without_graph_insertion(
          cg, weight_layer, num_buffer_tensors);
  for (auto const &tensor_guid : buffer_tensors) {

Done.

lib/local-execution/src/local_slots_backing.cc line 126 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

It shouldn't, can you try again?

Done.

lib/local-execution/src/local_slots_backing.cc line 132 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

What's going on here with weight_adjusted_idx?

This is the support I had previously for getting the right weight tensor. But this will be superceded by the new weights/inputs reps PR so this will get removed.

lib/local-execution/src/local_slots_backing.cc line 189 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Isn't this just map_values?

That's right. Fixed. Didn't even know that existed.

lib/local-execution/src/loss_functions.cc line 29 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why is LOGIT added twice? Does this mean pass both the gradient and the non-gradient into the function?

Yes.

lib/local-execution/src/loss_functions.cc line 40 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Same here, why is LOGIT bound twice?

Same as above, the function needs both gradient and non-gradient tensor. It's the same as when we use bind_grad for op tasks, but I didn't feel like that specification was necessary for non-op tasks.

lib/local-execution/src/model_training_instance.cc line 26 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Might be shorted to just copy the old_training_instance?

Will be superceded by changes to the next and ModelTrainingInstance abstraction

lib/local-execution/src/optimizer.cc line 39 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Bit more readable

Done.

lib/local-execution/src/optimizer.cc line 127 at r2 (raw file):

    return {task_id_t::ADAM_UPD_NCCL_TASK_ID, b};
  }
  return {task_id_t::ADAM_UPD_PS_TASK_ID, b};

Done.

lib/local-execution/src/optimizer.cc line 189 at r2 (raw file):

      [&](AdamOptimizerAttrs const &s) {
        return get_adam_update_signature();
      }});

Done.

lib/local-execution/src/optimizer.cc line 195 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Is buffer usually the word for these?

I think so? https://pytorch.org/docs/stable/generated/torch.optim.SGD.html uses momentum_buffer though I guess buffer itself is broad. I can make this grad_buffer_tensors for more clarity.

lib/local-execution/src/optimizer.cc line 211 at r2 (raw file):

      [&](AdamOptimizerAttrs const &s) {
        return get_adam_update_task_impl();
      }});

Done.

lib/local-execution/src/task_invocation.cc line 46 at r2 (raw file):

bool is_invocation_valid(TaskSignature const &sig, TaskInvocation const &inv) {
  // TODO: implement signature checking
  return true;

Done.

lib/op-attrs/include/op-attrs/ops/loss_functions.h line 8 at r3 (raw file):

#include "loss_function.dtg.h"
#include "other_loss_attrs.dtg.h"
#include "sparse_categorical_ce_loss_attrs.dtg.h"

Done.

lib/op-attrs/include/op-attrs/ops/other_loss_attrs.struct.toml line 2 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Maybe NonconfigurableLossAttrs or NonparameterizedLossAttrs? OtherLossAttrs is just a bit ambiguous

Done.

lib/pcg/include/pcg/optimizer_attrs.h line 8 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

What is an "empty" SGD? I'm not entirely convinced these are values that make sense to have their own functions.

That's fair. Removed.

lib/local-execution/include/local-execution/loss_functions.h line 29 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why change the signature of backward?

I thought local execution won't deal with tensor_guid_t

lockshaw

Reviewed 49 of 49 files at r4, all commit messages.
Reviewable status: all files reviewed, 24 unresolved discussions (waiting on @oOTigger and @reyna-abhyankar)

lib/kernels/CMakeLists.txt line 10 at r2 (raw file):

Previously, reyna-abhyankar (Reyna Abhyankar) wrote…

There are a couple of files in that folder that haven't been refactored/don't build: metrics_kernels.cu and embedding_kernels.cu. I think maybe Dylan was going to tackle those?

Sounds good, tracked in #1502 and hopefully will get addressed in a week or two once @oOTigger gets back

lib/kernels/include/kernels/array_shape.h line 45 at r2 (raw file):

Previously, reyna-abhyankar (Reyna Abhyankar) wrote…

It's called like this in concat_kernels.cu:
blk_size = shape.sub_shape(legion_dim_t{0}, axis).num_elements(); where axis is an ff_dim_t

I've left it as not implemented for now because I thought it was weird too.

Probably better to instead use the function that converts between an ff_dim_t and a legion_dim_t using the number of tensor dims (which iirc was defined...somewhere, maybe in kernels where the LegionDimOrdered alias is defined?) and then just use the sub_shape(legion_dim_t, legion_dim_t) overload

lib/local-execution/include/local-execution/task_signature.struct.toml line 10 at r3 (raw file):

Previously, reyna-abhyankar (Reyna Abhyankar) wrote…

This is fine, but I have to move the deleted part to the src_includes since the fmt is defined in that file

That's good, the more things we can put in src_includes instead of includes the better for build times as things in src_includes won't get propagated through #includes as they're only in the .dtg.cc file and not the .dtg.h file

lib/local-execution/src/local_slots_backing.cc line 132 at r3 (raw file):

Previously, reyna-abhyankar (Reyna Abhyankar) wrote…

This is the support I had previously for getting the right weight tensor. But this will be superceded by the new weights/inputs reps PR so this will get removed.

So is the plan to merge this and then have a separate PR with the fixed representation, or to replace this PR with a new fixed PR, or something else entirely? Just so I know the context in which to review this PR

lib/local-execution/src/local_slots_backing.cc line 189 at r3 (raw file):

Previously, reyna-abhyankar (Reyna Abhyankar) wrote…

That's right. Fixed. Didn't even know that existed.

🙂

lib/local-execution/src/local_slots_backing.cc line 157 at r4 (raw file):

ArgSlotsBacking LocalSlotsBacking::construct_arg_slots_backing(
    OpTaskBinding const &binding, layer_guid_t const &op_guid) const {
  return map_values(binding.get_arg_bindings(), [&](auto const &arg_binding) {

Prefer explicit type names over auto

Suggestion:

  return map_values(binding.get_arg_bindings(), [&](TaskArgSpec const &task_arg_spec) {

lib/local-execution/src/local_slots_backing.cc line 171 at r4 (raw file):

ArgSlotsBacking LocalSlotsBacking::construct_arg_slots_backing(
    TaskBinding const &binding) const {
  return map_values(binding.get_arg_bindings(), [&](auto const &arg_binding) {

Prefer explicit type names over auto

Suggestion:

  return map_values(binding.get_arg_bindings(), [&](TaskArgSpec const &task_arg_spec) {

lib/local-execution/src/loss_functions.cc line 29 at r2 (raw file):

Previously, reyna-abhyankar (Reyna Abhyankar) wrote…

Yes.

Hmm, fair enough. Feels a bit unintuitive to allow binding to a slot twice, might be better to have a way to define that slot as taking both grad and non-grad? Or have separate slots for the grad and non-grad? Doesn't need to be fixed immediately and I'll leave proposals for an exact solution (or the absence of one) up to you as you're more familiar with the current overall state of this interface than I am, just wanted to bring this up because naively this feels a bit odd (like in a function you can't bind an argument twice--the answer could be that SLOTs aren't exactly equivalent to arguments, or maybe just a wrapper function that adds both the grad and the non-grad to the slot?). Not sure, let me know what you think (and either way probably shouldn't be fixed in this PR)

lib/local-execution/src/optimizer.cc line 127 at r2 (raw file):

Previously, reyna-abhyankar (Reyna Abhyankar) wrote…

Done.

Thanks--I usually tend to reserve the if (...) { ...; return ...; } ...; return ...; pattern for cases like error handling etc. where the behavior is more "early termination" or (probably equivalently) where the else case is very large

lib/local-execution/src/optimizer.cc line 195 at r2 (raw file):

Previously, reyna-abhyankar (Reyna Abhyankar) wrote…

I think so? https://pytorch.org/docs/stable/generated/torch.optim.SGD.html uses momentum_buffer though I guess buffer itself is broad. I can make this grad_buffer_tensors for more clarity.

Sure, that sounds good. I have no issue with buffer as long as it's standard terminology, which it sounds like it is

lib/local-execution/include/local-execution/loss_functions.h line 29 at r3 (raw file):

Previously, reyna-abhyankar (Reyna Abhyankar) wrote…

I thought local execution won't deal with tensor_guid_t

You mean parallel_tensor_guid_t?

reyna-abhyankar

Reviewable status: 65 of 82 files reviewed, 24 unresolved discussions (waiting on @lockshaw and @oOTigger)

lib/kernels/CMakeLists.txt line 10 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Sounds good, tracked in #1502 and hopefully will get addressed in a week or two once @oOTigger gets back

Done.

lib/kernels/include/kernels/array_shape.h line 45 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Probably better to instead use the function that converts between an ff_dim_t and a legion_dim_t using the number of tensor dims (which iirc was defined...somewhere, maybe in kernels where the LegionDimOrdered alias is defined?) and then just use the sub_shape(legion_dim_t, legion_dim_t) overload

Done.

lib/kernels/src/array_shape.cc line 59 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Can probably be replaced by slice(DimOrdered, ...): https://github.com/flexflow/FlexFlow/blob/repo-refactor/lib/op-attrs/include/op-attrs/dim_ordered/slice.h#L9

Done.

lib/local-execution/include/local-execution/task_invocation.h line 65 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Can be moved over to dtgen

Done. Just added eq for now, created an issue for hash and fmt #1503

lib/local-execution/src/local_slots_backing.cc line 100 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Either use filter or add a count_if function to containers or something like it

This will be superceded by pcg input/weight differentiation since then we don't need to count inputs.

lib/local-execution/src/local_slots_backing.cc line 132 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

So is the plan to merge this and then have a separate PR with the fixed representation, or to replace this PR with a new fixed PR, or something else entirely? Just so I know the context in which to review this PR

Now that PCG input/weight is merged, I can fix in this PR itself.

lib/local-execution/src/local_slots_backing.cc line 157 at r4 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Prefer explicit type names over auto

Done. (It's OpArgSpec here btw)

lib/local-execution/src/local_slots_backing.cc line 171 at r4 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Prefer explicit type names over auto

Done.

lib/local-execution/src/loss_functions.cc line 29 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Hmm, fair enough. Feels a bit unintuitive to allow binding to a slot twice, might be better to have a way to define that slot as taking both grad and non-grad? Or have separate slots for the grad and non-grad? Doesn't need to be fixed immediately and I'll leave proposals for an exact solution (or the absence of one) up to you as you're more familiar with the current overall state of this interface than I am, just wanted to bring this up because naively this feels a bit odd (like in a function you can't bind an argument twice--the answer could be that SLOTs aren't exactly equivalent to arguments, or maybe just a wrapper function that adds both the grad and the non-grad to the slot?). Not sure, let me know what you think (and either way probably shouldn't be fixed in this PR)

Functionally, a slot = a tensor, so I think it's ok to refer to multiple slots with the same name (the name is just a concept) as long as they have different specifications (e.g. grad and non-grad). Sometimes you don't want the grad version, so I don't think the wrapper function would be too helpful. I actually think the current interface is the most clear.

lib/local-execution/src/loss_functions.cc line 64 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Label shape is an ArrayShape, so it shouldn't have any parallel dims, right?

I think you're right, yes.

lib/local-execution/include/local-execution/loss_functions.h line 29 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

You mean parallel_tensor_guid_t?

Yes.

reyna-abhyankar

Reviewable status: 54 of 83 files reviewed, 24 unresolved discussions (waiting on @lockshaw and @oOTigger)

lib/local-execution/include/local-execution/local_training_backing.h line 18 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why is this passed as mutable?

Done.

lib/local-execution/include/local-execution/model_training_instance.struct.toml line 2 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why isn't the model part of ModelTrainingInstance? The members seem to have changed non-trivially since the old version:
struct ModelTrainingInstance {
  ComputationGraph computation_graph;
  Optimizer optimizer;
  EnableProfiling enable_profiling;
  TrainingPCG training_pcg;
  TensorMapping tensor_map;
  LegionBacking legion_backing;
};

It's been kind of reversed and disaggregated. You can think of the backing itself as the driver and the training instance as just the fields necessary for training (logit, label, loss, optimizer). This makes it a lot easier to test.

lib/local-execution/include/local-execution/task_signature.struct.toml line 29 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why not a map of slot_id_t -> something?

Done.

lib/local-execution/include/local-execution/tensor_guid_spec.struct.toml line 2 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

This name seems a bit ambiguous? Maybe something clearer?

Hm, but it is literally a specification for a tensor. Like the tensor_guid and the grad will tell you exactly which tensor to access.

lib/local-execution/src/local_cost_estimator.cc line 80 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Would it make more sense to disaggregate LocalTrainingBacking rather than having an optional field for the training instance?

I was thinking about this, but to me it makes sense to have the training backing be the driver and then the optional instance provides fields that are only necessary for training. For example, when I'm testing the local cost estimator, I don't need the loss/optimizer stuff.

lib/local-execution/src/local_slots_backing.cc line 61 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Not a fan of this, come up with some new tensor id type

Done. It's now a non_graph_tensor_guid_t and not handled by the computation graph

lib/local-execution/src/local_slots_backing.cc line 100 at r3 (raw file):

Previously, reyna-abhyankar (Reyna Abhyankar) wrote…

This will be superceded by pcg input/weight differentiation since then we don't need to count inputs.

Done.

lib/local-execution/src/local_slots_backing.cc line 132 at r3 (raw file):

Previously, reyna-abhyankar (Reyna Abhyankar) wrote…

Now that PCG input/weight is merged, I can fix in this PR itself.

Done.

lib/local-execution/src/local_training_backing.cc line 165 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why not key this by the layer_guid_t of the WeightAttrs?

Done.

lib/local-execution/src/local_training_backing.cc line 179 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

I'm not convinced that the next() abstraction is the cleanest way to handle this? Idk, the ModelTrainingInstance changing during training is a bit conceptually weird. I don't have a super clear alternative idea, but I suppose it would be something along the lines of moving the mutable aspects (I think in this case some optimizer state?) into some other data structure so that ModelTrainingInstance stays the same

Done.

lib/local-execution/src/model_training_instance.cc line 9 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Should probably get pulled out of the function at least. Also, does it make sense to actually modifying the ModelTrainingInstance or would we want a separate optimizer state object/field to hold these mutable values?

Done. Now, next only changes the optimizer attrs.

lib/local-execution/src/model_training_instance.cc line 26 at r2 (raw file):

Previously, reyna-abhyankar (Reyna Abhyankar) wrote…

Will be superceded by changes to the next and ModelTrainingInstance abstraction

Done. Now, next only changes the optimizer attrs.

lib/local-execution/test/src/test_loss_e2e.cc line 45 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

The scoping of these tests seems a bit off--I get having one test to check that local execution works with a loss function, but it seems like the code should be decomposed such that the loss functions can be tested by themselves and then all this is need is the one integration test to make sure loss functions in general work with local execution.

It's fine if these are just temporary placeholders, but in the final draft it would be nice to have a slightly cleaner approach to testing than "spin up the whole local execution training backend and make it run full passes". If this is in fact the best way to test this then I'm happy to be convinced as some tests do just need a whole bunch of surrounding state to make sense, but more often than not that surrounding state can be minimized by having cleaner abstraction boundaries

I agree generally, but in this case (for loss function), the real test is just the CPU kernels one. It doesn't really make sense to test the backing without just running a full forward+backward pass (otherwise it's just like testing that an operator works).

lib/local-execution/test/src/test_update_e2e.cc line at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

https://reviewable.io/reviews/flexflow/FlexFlow/1472#-O5tZe9t2K_AvSQBCzOY applies to this file as well

Same principle here but there's a lot more interaction with the mechanism of the local backing for update (different tensor storage location, access, etc.) that I can add tests for in the respective local slots backing / local task arg accessor tests. Working on that now

lib/pcg/include/pcg/computation_graph.h line 36 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Remove

Done.

temporary weight adjust index

6adb290

reyna-abhyankar added repo-refactor Topics related to the repo and search refactors rr-runtime labels Aug 25, 2024

reyna-abhyankar self-assigned this Aug 25, 2024

reyna-abhyankar changed the title ~~temporary weight adjust index~~ Local execution e2e training Aug 25, 2024

reyna-abhyankar added 5 commits August 27, 2024 01:45

Loss function

61697c2

Add cuda test for loss function

b56c046

Format

f75a3d4

Refactor and build optimizer kernels, op

f74711f

Finish optimizer local backing

40c6252

reyna-abhyankar marked this pull request as ready for review August 27, 2024 21:03

reyna-abhyankar added 2 commits August 27, 2024 14:04

Format

ad9b9ea

E2E update test

1ddfade

lockshaw self-requested a review August 27, 2024 21:37

Format

dde9496

lockshaw requested changes Sep 3, 2024

View reviewed changes

reyna-abhyankar added 2 commits September 11, 2024 12:59

Small fixes

59635d8

Format

103ef07

reyna-abhyankar commented Sep 11, 2024

View reviewed changes

lockshaw requested changes Sep 13, 2024

View reviewed changes

reyna-abhyankar added 2 commits September 17, 2024 17:43

Fix test and small issues

f48f9ff

Format

189c9c8

reyna-abhyankar commented Sep 18, 2024

View reviewed changes

reyna-abhyankar added 6 commits October 1, 2024 08:46

Merge branch 'repo-refactor' into local-e2e-training

d93f464

Pass tests after merge

b5647c8

Fix input/weight differentiation

f5ff91e

Fix signature to use unified rep

7470e71

Fix model training instance abstraction

deece1b

Change subcase test name

1d3cc94

reyna-abhyankar commented Oct 1, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Local execution e2e training #1472

Local execution e2e training #1472

reyna-abhyankar commented Aug 25, 2024 •

edited by wmdi

Loading

codecov bot commented Aug 25, 2024 •

edited

Loading

lockshaw left a comment

reyna-abhyankar left a comment

lockshaw left a comment

reyna-abhyankar left a comment

reyna-abhyankar left a comment

Local execution e2e training #1472

Are you sure you want to change the base?

Local execution e2e training #1472

Conversation

reyna-abhyankar commented Aug 25, 2024 • edited by wmdi Loading

codecov bot commented Aug 25, 2024 • edited Loading

Codecov Report

lockshaw left a comment

Choose a reason for hiding this comment

reyna-abhyankar left a comment

Choose a reason for hiding this comment

lockshaw left a comment

Choose a reason for hiding this comment

reyna-abhyankar left a comment

Choose a reason for hiding this comment

reyna-abhyankar left a comment

Choose a reason for hiding this comment

reyna-abhyankar commented Aug 25, 2024 •

edited by wmdi

Loading

codecov bot commented Aug 25, 2024 •

edited

Loading