Training Schedule / Curriculum

Allow defining a training schedule. I.e. shifting weights of datasets during training per step.

As an example visualization of this feature, the resulting sample distribution of batches along time steps could e.g. look like this:

![Image](https://github.com/user-attachments/assets/13d3b29c-ab8a-4b28-8289-fd1cec30da03)

One example using `blend` could be:
```yaml
__module__: megatron.energon
__class__: MetadatasetV2
splits:
  train:
    blend:
      - weight: 1
        path: ds1
      - weight:
           linear:  # Maybe "linear" or "step"?
             0: 100  # At iteration 0 (i.e. 0 items yielded on each rank), the weight is 100
             100: 10  # At iteration 100, the weight is 10
             1000: 0  # At iteration 1000 (and onwards), the weight is 0
        path: ds2
```
Here, the weight specifies in what ratio the samples from ds1 and ds2 are sampled at a point in time (time: see discussion below).

Also for `epochized_blend`:
```yaml
__module__: megatron.energon
__class__: MetadatasetV2
splits:
  train:
    epochized_blend:
      - repetitions: 1
        path: ds1
      - repetitions: 2
        weight:  # Combination with "weight" to fade in / out a dataset? The outer repetitions still hold (except when weight becomes 0).
           linear:  # Maybe "linear" or "step"?
             0: 100  # At iteration 0 (i.e. 0 items yielded on each rank), the weight is 100
             100: 10  # At iteration 100, the weight is 10
             1000: 0  # At iteration 1000 (and onwards), the weight is 0
        path: ds2
```
Issues here: In plain epochized blend, we want the whole epoch be filled equally with samples from ds1 and ds2 (there already is an issue with the ending of that, where a single dataset may remain due to the random sampling, but that needs to be solved in another issue). There are two issues to solve here: To be intuitive, the weight should be relative to the other dataset weights. But also, the total area under the curve must equal the total number of samples (i.e. multiply by constant scaling factor relative to the other datasets). These two are a bit contradicting: E.g. what if both reduce their weight at the same time, but keep it high at the other steps? Thus we must compute the relative weight at each step to get the sampling ratio. E.g. the following constraint should hold:

$$ \text{totallen} = \text{len}(\text{ds}_1) \cdot \text{repetitions}(\text{ds}_1) + \text{len}(\text{ds}_2) \cdot \text{repetitions}(\text{ds}_2) $$

$$ \forall_{\text{ds} \in \{ \text{ds}1, \text{ds}2 \} } \sum_{i=0}^\text{totallen} k_d{}_s \cdot \frac{\text{weight}(\text{ds}, i)}{\text{weight}(\text{ds}_1, i) + \text{weight}(\text{ds}_2, i)} = \text{len}(\text{ds}) \cdot \text{repetitions}(\text{ds}) $$

And be solved for $k_{ds}$ for each dataset.


With an outer schedule, corresponding to "concat", but with the inner datasets being randomized:
```yaml
__module__: megatron.energon
__class__: MetadatasetV2
splits:
  train:
    sequential_schedule:  # Or just "schedule" or just "sequential" or "curriculum"?
    # Does this need an option to end iterating at the end of a stage? Otherwise, the shufbuf will mix stages.
    # I guess, inside we cannot handle "blend", but only "blend_epochized" or a dataset directly.
      # This is stage 1 of the training, until the repetitions are done.
      - epochized_blend:  # Blend the first part consisting of these datasets
        - repetitions: 1
          path: ds1
        - repetitions: 2
          path: ds2
      # This is stage 2 of the training, until the repetitions are done.
      - path: ds3
```


Discussion:
* Schedule is depending on the number of dataset iterations. This may not equal the number of gradient updates, e.g. for gradient accumulation. Should we make gradacc / steps_per_iter configurable?
* maybe make it rather `type: linear`  instead of `linear:` and `step:`? Should unify this with typical lr-schedulers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Training Schedule / Curriculum #17

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Training Schedule / Curriculum #17

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions