Skip to content

Training Schedule / CurriculumΒ #17

@voegtlel

Description

@voegtlel

Allow defining a training schedule. I.e. shifting weights of datasets during training per step.

As an example visualization of this feature, the resulting sample distribution of batches along time steps could e.g. look like this:

Image

One example using blend could be:

__module__: megatron.energon
__class__: MetadatasetV2
splits:
  train:
    blend:
      - weight: 1
        path: ds1
      - weight:
           linear:  # Maybe "linear" or "step"?
             0: 100  # At iteration 0 (i.e. 0 items yielded on each rank), the weight is 100
             100: 10  # At iteration 100, the weight is 10
             1000: 0  # At iteration 1000 (and onwards), the weight is 0
        path: ds2

Here, the weight specifies in what ratio the samples from ds1 and ds2 are sampled at a point in time (time: see discussion below).

Also for epochized_blend:

__module__: megatron.energon
__class__: MetadatasetV2
splits:
  train:
    epochized_blend:
      - repetitions: 1
        path: ds1
      - repetitions: 2
        weight:  # Combination with "weight" to fade in / out a dataset? The outer repetitions still hold (except when weight becomes 0).
           linear:  # Maybe "linear" or "step"?
             0: 100  # At iteration 0 (i.e. 0 items yielded on each rank), the weight is 100
             100: 10  # At iteration 100, the weight is 10
             1000: 0  # At iteration 1000 (and onwards), the weight is 0
        path: ds2

Issues here: In plain epochized blend, we want the whole epoch be filled equally with samples from ds1 and ds2 (there already is an issue with the ending of that, where a single dataset may remain due to the random sampling, but that needs to be solved in another issue). There are two issues to solve here: To be intuitive, the weight should be relative to the other dataset weights. But also, the total area under the curve must equal the total number of samples (i.e. multiply by constant scaling factor relative to the other datasets). These two are a bit contradicting: E.g. what if both reduce their weight at the same time, but keep it high at the other steps? Thus we must compute the relative weight at each step to get the sampling ratio. E.g. the following constraint should hold:

$$ \text{totallen} = \text{len}(\text{ds}_1) \cdot \text{repetitions}(\text{ds}_1) + \text{len}(\text{ds}_2) \cdot \text{repetitions}(\text{ds}_2) $$

$$ \forall_{\text{ds} \in { \text{ds}1, \text{ds}2 } } \sum_{i=0}^\text{totallen} k_d{}_s \cdot \frac{\text{weight}(\text{ds}, i)}{\text{weight}(\text{ds}_1, i) + \text{weight}(\text{ds}_2, i)} = \text{len}(\text{ds}) \cdot \text{repetitions}(\text{ds}) $$

And be solved for $k_{ds}$ for each dataset.

With an outer schedule, corresponding to "concat", but with the inner datasets being randomized:

__module__: megatron.energon
__class__: MetadatasetV2
splits:
  train:
    sequential_schedule:  # Or just "schedule" or just "sequential" or "curriculum"?
    # Does this need an option to end iterating at the end of a stage? Otherwise, the shufbuf will mix stages.
    # I guess, inside we cannot handle "blend", but only "blend_epochized" or a dataset directly.
      # This is stage 1 of the training, until the repetitions are done.
      - epochized_blend:  # Blend the first part consisting of these datasets
        - repetitions: 1
          path: ds1
        - repetitions: 2
          path: ds2
      # This is stage 2 of the training, until the repetitions are done.
      - path: ds3

Discussion:

  • Schedule is depending on the number of dataset iterations. This may not equal the number of gradient updates, e.g. for gradient accumulation. Should we make gradacc / steps_per_iter configurable?
  • maybe make it rather type: linear instead of linear: and step:? Should unify this with typical lr-schedulers.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions