-
Notifications
You must be signed in to change notification settings - Fork 224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Local execution e2e training #1472
base: repo-refactor
Are you sure you want to change the base?
Local execution e2e training #1472
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 2 files at r1, 49 of 50 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: all files reviewed, 47 unresolved discussions (waiting on @reyna-abhyankar)
lib/kernels/CMakeLists.txt
line 10 at r2 (raw file):
LIST_DIRECTORIES False src/*.cc src/cuda/cuda_helper.cu
Why is this not just src/cuda/*.cu
?
lib/kernels/include/kernels/array_shape.h
line 45 at r2 (raw file):
std::optional<std::size_t> at_maybe(ff_dim_t) const; ArrayShape sub_shape(legion_dim_t start, ff_dim_t end) const;
legion_dim_t
for start and ff_dim_t
for end? Is this signature intended? It's not illogical, just a bit weird and wanted to check it's not a typo 🙂
lib/kernels/src/array_shape.cc
line 59 at r3 (raw file):
ArrayShape ArrayShape::sub_shape(std::optional<ff_dim_t> start, std::optional<ff_dim_t> end) const { std::vector<size_t> new_shape;
Can probably be replaced by slice(DimOrdered, ...)
: https://github.com/flexflow/FlexFlow/blob/repo-refactor/lib/op-attrs/include/op-attrs/dim_ordered/slice.h#L9
lib/kernels/src/device.h
line 74 at r3 (raw file):
__global__ void scale_kernel(float *ptr, size_t size, float a, float b); __global__ void scale_kernel(float *ptr, unsigned long size, float a, float b);
Why?
lib/local-execution/include/local-execution/local_training_backing.h
line 18 at r3 (raw file):
TensorBackingMap const &, RuntimeArgConfig const &, std::optional<ModelTrainingInstance> &);
Why is this passed as mutable?
lib/local-execution/include/local-execution/model_training_instance.struct.toml
line 2 at r3 (raw file):
namespace = "FlexFlow" name = "ModelTrainingInstance"
Why isn't the model part of ModelTrainingInstance
? The members seem to have changed non-trivially since the old version:
struct ModelTrainingInstance {
ComputationGraph computation_graph;
Optimizer optimizer;
EnableProfiling enable_profiling;
TrainingPCG training_pcg;
TensorMapping tensor_map;
LegionBacking legion_backing;
};
lib/local-execution/include/local-execution/model_training_instance.struct.toml
line 15 at r3 (raw file):
"pcg/tensor_guid_t.dtg.h", "pcg/optimizer_attrs.dtg.h", ]
Suggestion:
includes = [
"op-attrs/ops/loss_attrs.dtg.h",
"pcg/tensor_guid_t.dtg.h",
"pcg/optimizer_attrs.dtg.h",
]
lib/local-execution/include/local-execution/optimizer.h
line 16 at r3 (raw file):
TaskInvocation get_update_invocation(OptimizerAttrs const &, tensor_guid_t const &weight, std::vector<tensor_guid_t> const &);
Some parameter names would be helpful here
Code quote:
std::vector<tensor_guid_t> const &);
lib/local-execution/include/local-execution/optimizer.h
line 22 at r3 (raw file):
TaskInvocation sgd_update(SGDOptimizerAttrs const &, tensor_guid_t const &weight, tensor_guid_t const &);
Some parameter names would be helpful here
lib/local-execution/include/local-execution/optimizer.h
line 29 at r3 (raw file):
tensor_guid_t const &weight, tensor_guid_t const &, tensor_guid_t const &);
Some parameter names would be helpful here
Code quote:
tensor_guid_t const &,
tensor_guid_t const &);
lib/local-execution/include/local-execution/task_impl_function.variant.toml
line 22 at r3 (raw file):
[[values]] type = "::FlexFlow::FwdBwdTaskImplFunction" key = "fwd_bwd_task_impl_function"
Would these be better renamed InitOpTaskImplFunction
and FwdBwdOpTaskImplFunction
, or do they make sense outside of a op context?
Code quote:
[[values]]
type = "::FlexFlow::InitTaskImplFunction"
key = "init_task_impl_function"
[[values]]
type = "::FlexFlow::FwdBwdTaskImplFunction"
key = "fwd_bwd_task_impl_function"
lib/local-execution/include/local-execution/task_invocation.h
line 65 at r3 (raw file):
task_id_t task_id; TaskBinding binding; };
Can be moved over to dtgen
Code quote:
struct TaskInvocation {
public:
TaskInvocation() = delete;
TaskInvocation(task_id_t task_id, TaskBinding const &binding)
: task_id(task_id), binding(binding) {}
public:
task_id_t task_id;
TaskBinding binding;
};
lib/local-execution/include/local-execution/task_signature.h
line 4 at r3 (raw file):
#define _FLEXFLOW_LOCAL_EXECUTION_TASK_SIGNATURE_H // #include "local-execution/tensor_guid_slot_spec.dtg.h"
Delete?
lib/local-execution/include/local-execution/task_signature.h
line 41 at r3 (raw file):
} // adds arg_slot without checking is_serializable, used for arguments that are
Change to proper doxygen docstring?
lib/local-execution/include/local-execution/task_signature.h
line 49 at r3 (raw file):
// adds arg_slot without checking is_serializable, used for arguments that are // deviceSpecific
Change to proper doxygen docstring?
lib/local-execution/include/local-execution/task_signature.struct.toml
line 5 at r3 (raw file):
features = [ "eq", "fmt",
Why not "hash"
?
lib/local-execution/include/local-execution/task_signature.struct.toml
line 10 at r3 (raw file):
includes = [ "local-execution/tensor_guid_slot_spec.dtg.h", "utils/type_index.h",
Suggestion:
"<typeindex>",
lib/local-execution/include/local-execution/task_signature.struct.toml
line 12 at r3 (raw file):
"utils/type_index.h", "utils/optional.h" ]
Suggestion:
includes = [
"local-execution/tensor_guid_slot_spec.dtg.h",
"utils/type_index.h",
"<optional>",
]
lib/local-execution/include/local-execution/task_signature.struct.toml
line 29 at r3 (raw file):
[[fields]] name = "tensor_guid_slots" type = "std::unordered_set<::FlexFlow::TensorGuidSlotSpec>"
Why not a map of slot_id_t -> something
?
Code quote:
type = "std::unordered_set<::FlexFlow::TensorGuidSlotSpec>"
lib/local-execution/include/local-execution/tensor_guid_spec.struct.toml
line 2 at r3 (raw file):
namespace = "FlexFlow" name = "TensorGuidSpec"
This name seems a bit ambiguous? Maybe something clearer?
lib/local-execution/src/local_cost_estimator.cc
line 80 at r3 (raw file):
tensor_backing_map, this->runtime_arg_config, model_training_instance);
Would it make more sense to disaggregate LocalTrainingBacking
rather than having an optional
field for the training instance?
Code quote:
std::optional<ModelTrainingInstance> model_training_instance = std::nullopt;
LocalTrainingBacking local_backing(allocator,
cg_builder.computation_graph,
tensor_backing_map,
this->runtime_arg_config,
model_training_instance);
lib/local-execution/src/local_slots_backing.cc
line 61 at r3 (raw file):
sig.tensor_guid_slots.size() - 2; // ignore 2 (weight and weight_grad) std::vector<tensor_guid_t> buffer_tensors = get_new_tensor_guids_for_layer_without_graph_insertion(
Not a fan of this, come up with some new tensor id type
lib/local-execution/src/local_slots_backing.cc
line 63 at r3 (raw file):
get_new_tensor_guids_for_layer_without_graph_insertion( cg, weight_layer, num_buffer_tensors); for (auto const &tensor_guid : buffer_tensors) {
Suggestion:
for (tensor_guid_t const &tensor_guid : buffer_tensors)
lib/local-execution/src/local_slots_backing.cc
line 100 at r3 (raw file):
OpTaskBinding const &binding, layer_guid_t const &op_guid) const { TensorSlotsBacking mapping; int num_inputs = 0;
Either use filter
or add a count_if
function to containers
or something like it
lib/local-execution/src/local_slots_backing.cc
line 126 at r3 (raw file):
default: throw mk_runtime_error( fmt::format("Invalid TensorRole")); // inserting role yields
It shouldn't, can you try again?
lib/local-execution/src/local_slots_backing.cc
line 132 at r3 (raw file):
IsGrad is_grad = slot_grad_id.is_grad; GenericTensorAccessorW tensor_backing = this->get_tensor_backing( tensor_guids.at(weight_adjusted_idx + tensor_spec.idx), is_grad);
What's going on here with weight_adjusted_idx
?
lib/local-execution/src/local_slots_backing.cc
line 189 at r3 (raw file):
}, [](ConcreteArgSpec const &s) { return s; }, })});
Isn't this just map_values
?
Code quote:
for (auto const &arg_binding : binding.get_arg_bindings()) {
slot_id_t arg_slot = arg_binding.first;
TaskArgSpec task_arg_spec = arg_binding.second;
mapping.insert({arg_slot,
task_arg_spec.visit<ConcreteArgSpec>(overload{
[&](RuntimeArgRefSpec const &s) {
return this->resolve_runtime_arg_ref_spec(s);
},
[](ConcreteArgSpec const &s) { return s; },
})});
lib/local-execution/src/local_training_backing.cc
line 26 at r3 (raw file):
for (layer_guid_t const &node : topological_ordering(this->computation_graph)) {
Much appreciated 🙂
lib/local-execution/src/local_training_backing.cc
line 165 at r3 (raw file):
std::vector<tensor_guid_t> buffer_tensors = this->local_slots_backing.weight_optimizer_tensor_guids.at( weight_tensor);
Why not key this by the layer_guid_t
of the WeightAttrs
?
lib/local-execution/src/local_training_backing.cc
line 179 at r3 (raw file):
} this->training_instance = next(this->training_instance.value());
I'm not convinced that the next()
abstraction is the cleanest way to handle this? Idk, the ModelTrainingInstance
changing during training is a bit conceptually weird. I don't have a super clear alternative idea, but I suppose it would be something along the lines of moving the mutable aspects (I think in this case some optimizer state?) into some other data structure so that ModelTrainingInstance
stays the same
lib/local-execution/src/loss_functions.cc
line 29 at r2 (raw file):
add_slot(sig, LOGIT, IsGrad::NO); add_slot(sig, LABEL, IsGrad::NO); add_slot(sig, LOGIT, IsGrad::YES);
Why is LOGIT
added twice? Does this mean pass both the gradient and the non-gradient into the function?
lib/local-execution/src/loss_functions.cc
line 40 at r2 (raw file):
b.bind(LOGIT, TensorGuidSpec{logit, IsGrad::NO}); b.bind(LABEL, TensorGuidSpec{label, IsGrad::NO}); b.bind(LOGIT, TensorGuidSpec{logit, IsGrad::YES});
Same here, why is LOGIT
bound twice?
lib/local-execution/src/loss_functions.cc
line 64 at r2 (raw file):
if (loss_type == LossFunction::SPARSE_CATEGORICAL_CROSSENTROPY) { // label shape is [parallel dim(?), batch dim, 1]
Label shape is an ArrayShape
, so it shouldn't have any parallel dims, right?
Code quote:
// label shape is [parallel dim(?), batch dim, 1]
lib/local-execution/src/model_training_instance.cc
line 9 at r2 (raw file):
AdamOptimizerAttrs old = old_training_instance.optimizer_attrs.get<AdamOptimizerAttrs>(); double new_beta1_t = old.beta_t * old.beta1;
Should probably get pulled out of the function at least. Also, does it make sense to actually modifying the ModelTrainingInstance
or would we want a separate optimizer state object/field to hold these mutable values?
lib/local-execution/src/model_training_instance.cc
line 26 at r2 (raw file):
new_attrs}; } return old_training_instance;
Might be shorted to just copy the old_training_instance
?
Suggestion:
if (old_training_instance.optimizer_attrs.has<AdamOptimizerAttrs>()) {
old_training_instance.optimizer.attrs.get<AdamOptimizerAttrs>() = AdamOptimizerAttrs{
...
};
} else {
return old_training_instance;
}
lib/local-execution/src/optimizer.cc
line 39 at r2 (raw file):
return {task_id_t::SGD_UPD_NCCL_TASK_ID, b}; } return {task_id_t::SGD_UPD_PS_TASK_ID, b};
Bit more readable
Suggestion:
if (CHOSEN_SYNC_TYPE == ParamSync::NCCL) {
b.bind_arg(HANDLE, ff_handle());
return {task_id_t::SGD_UPD_NCCL_TASK_ID, b};
} else {
return {task_id_t::SGD_UPD_PS_TASK_ID, b};
}
lib/local-execution/src/optimizer.cc
line 127 at r2 (raw file):
return {task_id_t::ADAM_UPD_NCCL_TASK_ID, b}; } return {task_id_t::ADAM_UPD_PS_TASK_ID, b};
Suggestion:
if (CHOSEN_SYNC_TYPE == ParamSync::NCCL) {
b.bind_arg(HANDLE, ff_handle());
return {task_id_t::ADAM_UPD_NCCL_TASK_ID, b};
} else {
return {task_id_t::ADAM_UPD_PS_TASK_ID, b};
}
lib/local-execution/src/optimizer.cc
line 189 at r2 (raw file):
[&](AdamOptimizerAttrs const &s) { return get_adam_update_signature(); }});
Suggestion:
TaskSignature get_update_signature(OptimizerAttrs const &attrs) {
return attrs.visit<TaskSignature>(overload{
[&](SGDOptimizerAttrs const &) { return get_sgd_update_signature(); },
[&](AdamOptimizerAttrs const &) {
return get_adam_update_signature();
}});
lib/local-execution/src/optimizer.cc
line 195 at r2 (raw file):
get_update_invocation(OptimizerAttrs const &attrs, tensor_guid_t const &weight, std::vector<tensor_guid_t> const &buffer_tensors) {
Is buffer
usually the word for these?
Code quote:
&buffer_tensors)
lib/local-execution/src/optimizer.cc
line 211 at r2 (raw file):
[&](AdamOptimizerAttrs const &s) { return get_adam_update_task_impl(); }});
Suggestion:
TaskImplFunction get_update_task_impl(OptimizerAttrs const &attrs) {
return attrs.visit<TaskImplFunction>(overload{
[&](SGDOptimizerAttrs const &) { return get_sgd_update_task_impl(); },
[&](AdamOptimizerAttrs const &) {
return get_adam_update_task_impl();
}});
lib/local-execution/src/task_invocation.cc
line 46 at r2 (raw file):
bool is_invocation_valid(TaskSignature const &sig, TaskInvocation const &inv) { // TODO: implement signature checking return true;
Suggestion:
NOT_IMPLEMENTED();
lib/local-execution/test/src/test_loss_e2e.cc
line 45 at r3 (raw file):
tensor_backing_map.insert({input_tensor, input_backing}); SUBCASE("SparseCategoricalCrossEntropyLossAttrs") {
The scoping of these tests seems a bit off--I get having one test to check that local execution works with a loss function, but it seems like the code should be decomposed such that the loss functions can be tested by themselves and then all this is need is the one integration test to make sure loss functions in general work with local execution.
It's fine if these are just temporary placeholders, but in the final draft it would be nice to have a slightly cleaner approach to testing than "spin up the whole local execution training backend and make it run full passes". If this is in fact the best way to test this then I'm happy to be convinced as some tests do just need a whole bunch of surrounding state to make sense, but more often than not that surrounding state can be minimized by having cleaner abstraction boundaries
lib/local-execution/test/src/test_update_e2e.cc
line 0 at r3 (raw file):
https://reviewable.io/reviews/flexflow/FlexFlow/1472#-O5tZe9t2K_AvSQBCzOY applies to this file as well
lib/op-attrs/include/op-attrs/ops/loss_functions.h
line 8 at r3 (raw file):
#include "loss_function.dtg.h" #include "other_loss_attrs.dtg.h" #include "sparse_categorical_ce_loss_attrs.dtg.h"
Suggestion:
#include "op-attrs/ops/core.h"
#include "op-attrs/ops/loss_attrs.dtg.h"
#include "op-attrs/ops/loss_function.dtg.h"
#include "op-attrs/ops/other_loss_attrs.dtg.h"
#include "op-attrs/ops/sparse_categorical_ce_loss_attrs.dtg.h"
lib/op-attrs/include/op-attrs/ops/other_loss_attrs.struct.toml
line 2 at r3 (raw file):
namespace = "FlexFlow" name = "OtherLossAttrs"
Maybe NonconfigurableLossAttrs
or NonparameterizedLossAttrs
? OtherLossAttrs
is just a bit ambiguous
Code quote:
name = "OtherLossAttrs"
lib/pcg/include/pcg/computation_graph.h
line 36 at r3 (raw file):
std::vector<tensor_guid_t> get_new_tensor_guids_for_layer_without_graph_insertion(
Remove
lib/pcg/include/pcg/optimizer_attrs.h
line 8 at r3 (raw file):
namespace FlexFlow { OptimizerAttrs make_empty_sgd_attrs();
What is an "empty" SGD? I'm not entirely convinced these are values that make sense to have their own functions.
lib/local-execution/include/local-execution/loss_functions.h
line 29 at r3 (raw file):
TaskSignature get_loss_bwd_signature(); TaskInvocation backward(LossAttrs const &, tensor_guid_t logit, tensor_guid_t label);
Why change the signature of backward
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 47 unresolved discussions (waiting on @lockshaw)
lib/kernels/CMakeLists.txt
line 10 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why is this not just
src/cuda/*.cu
?
There are a couple of files in that folder that haven't been refactored/don't build: metrics_kernels.cu
and embedding_kernels.cu
. I think maybe Dylan was going to tackle those?
lib/kernels/include/kernels/array_shape.h
line 45 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
legion_dim_t
for start andff_dim_t
for end? Is this signature intended? It's not illogical, just a bit weird and wanted to check it's not a typo 🙂
It's called like this in concat_kernels.cu
:
blk_size = shape.sub_shape(legion_dim_t{0}, axis).num_elements();
where axis
is an ff_dim_t
I've left it as not implemented for now because I thought it was weird too.
lib/kernels/src/device.h
line 74 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why?
Was getting a linker error, but I realized it was for a different reason. This is removed and fixed in cuda_helper.cu
lib/local-execution/include/local-execution/model_training_instance.struct.toml
line 15 at r3 (raw file):
"pcg/tensor_guid_t.dtg.h", "pcg/optimizer_attrs.dtg.h", ]
Done.
lib/local-execution/include/local-execution/optimizer.h
line 16 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Some parameter names would be helpful here
Done.
lib/local-execution/include/local-execution/optimizer.h
line 22 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Some parameter names would be helpful here
Done.
lib/local-execution/include/local-execution/optimizer.h
line 29 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Some parameter names would be helpful here
Done.
lib/local-execution/include/local-execution/task_impl_function.variant.toml
line 22 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Would these be better renamed
InitOpTaskImplFunction
andFwdBwdOpTaskImplFunction
, or do they make sense outside of a op context?
Done.
lib/local-execution/include/local-execution/task_signature.h
line 4 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Delete?
Done.
lib/local-execution/include/local-execution/task_signature.h
line 41 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Change to proper doxygen docstring?
Done.
lib/local-execution/include/local-execution/task_signature.h
line 49 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Change to proper doxygen docstring?
Done.
lib/local-execution/include/local-execution/task_signature.struct.toml
line 5 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why not
"hash"
?
Done.
lib/local-execution/include/local-execution/task_signature.struct.toml
line 10 at r3 (raw file):
includes = [ "local-execution/tensor_guid_slot_spec.dtg.h", "utils/type_index.h",
This is fine, but I have to move the deleted part to the src_includes
since the fmt is defined in that file
lib/local-execution/include/local-execution/task_signature.struct.toml
line 12 at r3 (raw file):
"utils/type_index.h", "utils/optional.h" ]
This is fine, but I have to move the deleted part to the src_includes
since the fmt is defined in that file
lib/local-execution/src/local_slots_backing.cc
line 63 at r3 (raw file):
get_new_tensor_guids_for_layer_without_graph_insertion( cg, weight_layer, num_buffer_tensors); for (auto const &tensor_guid : buffer_tensors) {
Done.
lib/local-execution/src/local_slots_backing.cc
line 126 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
It shouldn't, can you try again?
Done.
lib/local-execution/src/local_slots_backing.cc
line 132 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
What's going on here with
weight_adjusted_idx
?
This is the support I had previously for getting the right weight tensor. But this will be superceded by the new weights/inputs reps PR so this will get removed.
lib/local-execution/src/local_slots_backing.cc
line 189 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Isn't this just
map_values
?
That's right. Fixed. Didn't even know that existed.
lib/local-execution/src/loss_functions.cc
line 29 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why is
LOGIT
added twice? Does this mean pass both the gradient and the non-gradient into the function?
Yes.
lib/local-execution/src/loss_functions.cc
line 40 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Same here, why is
LOGIT
bound twice?
Same as above, the function needs both gradient and non-gradient tensor. It's the same as when we use bind_grad
for op tasks, but I didn't feel like that specification was necessary for non-op tasks.
lib/local-execution/src/model_training_instance.cc
line 26 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Might be shorted to just copy the
old_training_instance
?
Will be superceded by changes to the next
and ModelTrainingInstance
abstraction
lib/local-execution/src/optimizer.cc
line 39 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Bit more readable
Done.
lib/local-execution/src/optimizer.cc
line 127 at r2 (raw file):
return {task_id_t::ADAM_UPD_NCCL_TASK_ID, b}; } return {task_id_t::ADAM_UPD_PS_TASK_ID, b};
Done.
lib/local-execution/src/optimizer.cc
line 189 at r2 (raw file):
[&](AdamOptimizerAttrs const &s) { return get_adam_update_signature(); }});
Done.
lib/local-execution/src/optimizer.cc
line 195 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Is
buffer
usually the word for these?
I think so? https://pytorch.org/docs/stable/generated/torch.optim.SGD.html uses momentum_buffer
though I guess buffer itself is broad. I can make this grad_buffer_tensors
for more clarity.
lib/local-execution/src/optimizer.cc
line 211 at r2 (raw file):
[&](AdamOptimizerAttrs const &s) { return get_adam_update_task_impl(); }});
Done.
lib/local-execution/src/task_invocation.cc
line 46 at r2 (raw file):
bool is_invocation_valid(TaskSignature const &sig, TaskInvocation const &inv) { // TODO: implement signature checking return true;
Done.
lib/op-attrs/include/op-attrs/ops/loss_functions.h
line 8 at r3 (raw file):
#include "loss_function.dtg.h" #include "other_loss_attrs.dtg.h" #include "sparse_categorical_ce_loss_attrs.dtg.h"
Done.
lib/op-attrs/include/op-attrs/ops/other_loss_attrs.struct.toml
line 2 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Maybe
NonconfigurableLossAttrs
orNonparameterizedLossAttrs
?OtherLossAttrs
is just a bit ambiguous
Done.
lib/pcg/include/pcg/optimizer_attrs.h
line 8 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
What is an "empty" SGD? I'm not entirely convinced these are values that make sense to have their own functions.
That's fair. Removed.
lib/local-execution/include/local-execution/loss_functions.h
line 29 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why change the signature of
backward
?
I thought local execution won't deal with tensor_guid_t
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 49 of 49 files at r4, all commit messages.
Reviewable status: all files reviewed, 24 unresolved discussions (waiting on @oOTigger and @reyna-abhyankar)
lib/kernels/CMakeLists.txt
line 10 at r2 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
There are a couple of files in that folder that haven't been refactored/don't build:
metrics_kernels.cu
andembedding_kernels.cu
. I think maybe Dylan was going to tackle those?
Sounds good, tracked in #1502 and hopefully will get addressed in a week or two once @oOTigger gets back
lib/kernels/include/kernels/array_shape.h
line 45 at r2 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
It's called like this in
concat_kernels.cu
:
blk_size = shape.sub_shape(legion_dim_t{0}, axis).num_elements();
whereaxis
is anff_dim_t
I've left it as not implemented for now because I thought it was weird too.
Probably better to instead use the function that converts between an ff_dim_t
and a legion_dim_t
using the number of tensor dims (which iirc was defined...somewhere, maybe in kernels where the LegionDimOrdered
alias is defined?) and then just use the sub_shape(legion_dim_t, legion_dim_t)
overload
lib/local-execution/include/local-execution/task_signature.struct.toml
line 10 at r3 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
This is fine, but I have to move the deleted part to the
src_includes
since the fmt is defined in that file
That's good, the more things we can put in src_includes
instead of includes
the better for build times as things in src_includes
won't get propagated through #include
s as they're only in the .dtg.cc
file and not the .dtg.h
file
lib/local-execution/src/local_slots_backing.cc
line 132 at r3 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
This is the support I had previously for getting the right weight tensor. But this will be superceded by the new weights/inputs reps PR so this will get removed.
So is the plan to merge this and then have a separate PR with the fixed representation, or to replace this PR with a new fixed PR, or something else entirely? Just so I know the context in which to review this PR
lib/local-execution/src/local_slots_backing.cc
line 189 at r3 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
That's right. Fixed. Didn't even know that existed.
🙂
lib/local-execution/src/local_slots_backing.cc
line 157 at r4 (raw file):
ArgSlotsBacking LocalSlotsBacking::construct_arg_slots_backing( OpTaskBinding const &binding, layer_guid_t const &op_guid) const { return map_values(binding.get_arg_bindings(), [&](auto const &arg_binding) {
Prefer explicit type names over auto
Suggestion:
return map_values(binding.get_arg_bindings(), [&](TaskArgSpec const &task_arg_spec) {
lib/local-execution/src/local_slots_backing.cc
line 171 at r4 (raw file):
ArgSlotsBacking LocalSlotsBacking::construct_arg_slots_backing( TaskBinding const &binding) const { return map_values(binding.get_arg_bindings(), [&](auto const &arg_binding) {
Prefer explicit type names over auto
Suggestion:
return map_values(binding.get_arg_bindings(), [&](TaskArgSpec const &task_arg_spec) {
lib/local-execution/src/loss_functions.cc
line 29 at r2 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
Yes.
Hmm, fair enough. Feels a bit unintuitive to allow binding to a slot twice, might be better to have a way to define that slot as taking both grad and non-grad? Or have separate slots for the grad and non-grad? Doesn't need to be fixed immediately and I'll leave proposals for an exact solution (or the absence of one) up to you as you're more familiar with the current overall state of this interface than I am, just wanted to bring this up because naively this feels a bit odd (like in a function you can't bind an argument twice--the answer could be that SLOTs aren't exactly equivalent to arguments, or maybe just a wrapper function that adds both the grad and the non-grad to the slot?). Not sure, let me know what you think (and either way probably shouldn't be fixed in this PR)
lib/local-execution/src/optimizer.cc
line 127 at r2 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
Done.
Thanks--I usually tend to reserve the if (...) { ...; return ...; } ...; return ...;
pattern for cases like error handling etc. where the behavior is more "early termination" or (probably equivalently) where the else
case is very large
lib/local-execution/src/optimizer.cc
line 195 at r2 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
I think so? https://pytorch.org/docs/stable/generated/torch.optim.SGD.html uses
momentum_buffer
though I guess buffer itself is broad. I can make thisgrad_buffer_tensors
for more clarity.
Sure, that sounds good. I have no issue with buffer
as long as it's standard terminology, which it sounds like it is
lib/local-execution/include/local-execution/loss_functions.h
line 29 at r3 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
I thought local execution won't deal with
tensor_guid_t
You mean parallel_tensor_guid_t
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 65 of 82 files reviewed, 24 unresolved discussions (waiting on @lockshaw and @oOTigger)
lib/kernels/CMakeLists.txt
line 10 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Sounds good, tracked in #1502 and hopefully will get addressed in a week or two once @oOTigger gets back
Done.
lib/kernels/include/kernels/array_shape.h
line 45 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Probably better to instead use the function that converts between an
ff_dim_t
and alegion_dim_t
using the number of tensor dims (which iirc was defined...somewhere, maybe in kernels where theLegionDimOrdered
alias is defined?) and then just use thesub_shape(legion_dim_t, legion_dim_t)
overload
Done.
lib/kernels/src/array_shape.cc
line 59 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Can probably be replaced by
slice(DimOrdered, ...)
: https://github.com/flexflow/FlexFlow/blob/repo-refactor/lib/op-attrs/include/op-attrs/dim_ordered/slice.h#L9
Done.
lib/local-execution/include/local-execution/task_invocation.h
line 65 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Can be moved over to
dtgen
Done. Just added eq
for now, created an issue for hash
and fmt
#1503
lib/local-execution/src/local_slots_backing.cc
line 100 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Either use
filter
or add acount_if
function tocontainers
or something like it
This will be superceded by pcg input/weight differentiation since then we don't need to count inputs.
lib/local-execution/src/local_slots_backing.cc
line 132 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
So is the plan to merge this and then have a separate PR with the fixed representation, or to replace this PR with a new fixed PR, or something else entirely? Just so I know the context in which to review this PR
Now that PCG input/weight is merged, I can fix in this PR itself.
lib/local-execution/src/local_slots_backing.cc
line 157 at r4 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Prefer explicit type names over auto
Done. (It's OpArgSpec
here btw)
lib/local-execution/src/local_slots_backing.cc
line 171 at r4 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Prefer explicit type names over auto
Done.
lib/local-execution/src/loss_functions.cc
line 29 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Hmm, fair enough. Feels a bit unintuitive to allow binding to a slot twice, might be better to have a way to define that slot as taking both grad and non-grad? Or have separate slots for the grad and non-grad? Doesn't need to be fixed immediately and I'll leave proposals for an exact solution (or the absence of one) up to you as you're more familiar with the current overall state of this interface than I am, just wanted to bring this up because naively this feels a bit odd (like in a function you can't bind an argument twice--the answer could be that SLOTs aren't exactly equivalent to arguments, or maybe just a wrapper function that adds both the grad and the non-grad to the slot?). Not sure, let me know what you think (and either way probably shouldn't be fixed in this PR)
Functionally, a slot = a tensor, so I think it's ok to refer to multiple slots with the same name (the name is just a concept) as long as they have different specifications (e.g. grad and non-grad). Sometimes you don't want the grad version, so I don't think the wrapper function would be too helpful. I actually think the current interface is the most clear.
lib/local-execution/src/loss_functions.cc
line 64 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Label shape is an
ArrayShape
, so it shouldn't have any parallel dims, right?
I think you're right, yes.
lib/local-execution/include/local-execution/loss_functions.h
line 29 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
You mean
parallel_tensor_guid_t
?
Yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 54 of 83 files reviewed, 24 unresolved discussions (waiting on @lockshaw and @oOTigger)
lib/local-execution/include/local-execution/local_training_backing.h
line 18 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why is this passed as mutable?
Done.
lib/local-execution/include/local-execution/model_training_instance.struct.toml
line 2 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why isn't the model part of
ModelTrainingInstance
? The members seem to have changed non-trivially since the old version:struct ModelTrainingInstance { ComputationGraph computation_graph; Optimizer optimizer; EnableProfiling enable_profiling; TrainingPCG training_pcg; TensorMapping tensor_map; LegionBacking legion_backing; };
It's been kind of reversed and disaggregated. You can think of the backing itself as the driver and the training instance as just the fields necessary for training (logit, label, loss, optimizer). This makes it a lot easier to test.
lib/local-execution/include/local-execution/task_signature.struct.toml
line 29 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why not a map of
slot_id_t -> something
?
Done.
lib/local-execution/include/local-execution/tensor_guid_spec.struct.toml
line 2 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
This name seems a bit ambiguous? Maybe something clearer?
Hm, but it is literally a specification for a tensor. Like the tensor_guid and the grad will tell you exactly which tensor to access.
lib/local-execution/src/local_cost_estimator.cc
line 80 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Would it make more sense to disaggregate
LocalTrainingBacking
rather than having anoptional
field for the training instance?
I was thinking about this, but to me it makes sense to have the training backing be the driver and then the optional instance provides fields that are only necessary for training. For example, when I'm testing the local cost estimator, I don't need the loss/optimizer stuff.
lib/local-execution/src/local_slots_backing.cc
line 61 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Not a fan of this, come up with some new tensor id type
Done. It's now a non_graph_tensor_guid_t
and not handled by the computation graph
lib/local-execution/src/local_slots_backing.cc
line 100 at r3 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
This will be superceded by pcg input/weight differentiation since then we don't need to count inputs.
Done.
lib/local-execution/src/local_slots_backing.cc
line 132 at r3 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
Now that PCG input/weight is merged, I can fix in this PR itself.
Done.
lib/local-execution/src/local_training_backing.cc
line 165 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why not key this by the
layer_guid_t
of theWeightAttrs
?
Done.
lib/local-execution/src/local_training_backing.cc
line 179 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
I'm not convinced that the
next()
abstraction is the cleanest way to handle this? Idk, theModelTrainingInstance
changing during training is a bit conceptually weird. I don't have a super clear alternative idea, but I suppose it would be something along the lines of moving the mutable aspects (I think in this case some optimizer state?) into some other data structure so thatModelTrainingInstance
stays the same
Done.
lib/local-execution/src/model_training_instance.cc
line 9 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Should probably get pulled out of the function at least. Also, does it make sense to actually modifying the
ModelTrainingInstance
or would we want a separate optimizer state object/field to hold these mutable values?
Done. Now, next
only changes the optimizer attrs.
lib/local-execution/src/model_training_instance.cc
line 26 at r2 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
Will be superceded by changes to the
next
andModelTrainingInstance
abstraction
Done. Now, next only changes the optimizer attrs.
lib/local-execution/test/src/test_loss_e2e.cc
line 45 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
The scoping of these tests seems a bit off--I get having one test to check that local execution works with a loss function, but it seems like the code should be decomposed such that the loss functions can be tested by themselves and then all this is need is the one integration test to make sure loss functions in general work with local execution.
It's fine if these are just temporary placeholders, but in the final draft it would be nice to have a slightly cleaner approach to testing than "spin up the whole local execution training backend and make it run full passes". If this is in fact the best way to test this then I'm happy to be convinced as some tests do just need a whole bunch of surrounding state to make sense, but more often than not that surrounding state can be minimized by having cleaner abstraction boundaries
I agree generally, but in this case (for loss function), the real test is just the CPU kernels one. It doesn't really make sense to test the backing without just running a full forward+backward pass (otherwise it's just like testing that an operator works).
lib/local-execution/test/src/test_update_e2e.cc
line at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
https://reviewable.io/reviews/flexflow/FlexFlow/1472#-O5tZe9t2K_AvSQBCzOY applies to this file as well
Same principle here but there's a lot more interaction with the mechanism of the local backing for update (different tensor storage location, access, etc.) that I can add tests for in the respective local slots backing / local task arg accessor tests. Working on that now
lib/pcg/include/pcg/computation_graph.h
line 36 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Remove
Done.
Description of changes:
Implement end-to-end training loop in local execution.
Related Issues:
Linked Issues:
Issues closed by this PR:
This change is