Skip to content

Commit acf222b

Browse files
authored
Merge pull request #96 from NVIDIA/develop
Release 5.3.0
2 parents a8d4894 + 486c0b3 commit acf222b

File tree

109 files changed

+10605
-4713
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

109 files changed

+10605
-4713
lines changed

.github/actions/setupjust/action.yml

+16
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
name: 'Setup just'
2+
author: 'Ross MacArthur'
3+
description: 'Install the just command runner'
4+
branding:
5+
icon: 'play'
6+
color: 'blue'
7+
inputs:
8+
just-version:
9+
description: 'A valid semver specifier of the just version to install'
10+
github-token:
11+
description: 'Github token to use to authenticate downloads'
12+
required: false
13+
default: ${{ github.token }}
14+
runs:
15+
using: 'node20'
16+
main: 'index.js'

.github/actions/setupjust/index.js

+9
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

.github/workflows/documentation.yml

+14-8
Original file line numberDiff line numberDiff line change
@@ -20,21 +20,27 @@ jobs:
2020
build:
2121
runs-on: ubuntu-latest
2222
steps:
23-
- name: Checkout
23+
- name: Checkout code
2424
uses: actions/checkout@v4
25-
- uses: actions/setup-python@v5
26-
with:
27-
python-version: '3.10'
28-
- name: Setup Pages
29-
uses: actions/configure-pages@v5
25+
26+
- name: Install uv
27+
uses: astral-sh/setup-uv@v5
28+
29+
- name: Install just
30+
uses: ./.github/actions/setupjust
31+
3032
- name: Install dependencies
3133
run: |
32-
pip install -U sphinx-rtd-theme sphinx sphinxcontrib-napoleon myst-parser sphinx-click
34+
just dev-sync
35+
3336
- name: Sphinx build
3437
run: |
35-
sphinx-build -b html docs/source _site
38+
just docs
39+
3640
- name: Upload artifact
3741
uses: actions/upload-pages-artifact@v3
42+
with:
43+
path: docs/build
3844

3945
# Deployment job
4046
deploy:

.github/workflows/isort.yml

-31
This file was deleted.

.github/workflows/release.yml

+13-9
Original file line numberDiff line numberDiff line change
@@ -14,18 +14,22 @@ jobs:
1414
permissions:
1515
id-token: write # This permission is mandatory for trusted publishing
1616
steps:
17-
- uses: actions/checkout@v4
18-
- name: Set up Python
19-
uses: actions/setup-python@v5
20-
with:
21-
python-version: '3.8'
17+
- name: Checkout code
18+
uses: actions/checkout@v4
19+
20+
- name: Install uv
21+
uses: astral-sh/setup-uv@v5
22+
23+
- name: Install just
24+
uses: ./.github/actions/setupjust
25+
2226
- name: Install dependencies
2327
run: |
24-
python -m pip install --upgrade pip
25-
pip install build
28+
just dev-sync
29+
2630
- name: Build package
2731
run: |
28-
python -m build -w
29-
python -m build -s
32+
just build
33+
3034
- name: Publish package
3135
uses: pypa/gh-action-pypi-publish@release/v1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
name: black formatting
1+
name: ruff checks
22

33
on:
44
push:
@@ -10,22 +10,23 @@ on:
1010
- develop
1111

1212
jobs:
13-
black:
13+
ruff:
1414
runs-on: ubuntu-latest
1515

1616
steps:
1717
- name: Checkout code
1818
uses: actions/checkout@v4
1919

20-
- name: Set up Python
21-
uses: actions/setup-python@v5
22-
with:
23-
python-version: '3.10'
20+
- name: Install uv
21+
uses: astral-sh/setup-uv@v5
22+
23+
- name: Install just
24+
uses: ./.github/actions/setupjust
2425

2526
- name: Install dependencies
2627
run: |
27-
python -m pip install --upgrade pip
28-
pip install black
28+
just dev-sync
2929
30-
- name: Run Black
31-
run: black --check .
30+
- name: Check code
31+
run: |
32+
just check

.github/workflows/tests.yml

+21-12
Original file line numberDiff line numberDiff line change
@@ -12,16 +12,25 @@ on:
1212
jobs:
1313
unittest:
1414
runs-on: ubuntu-latest
15+
1516
steps:
16-
- uses: actions/checkout@v4
17-
- name: Set up Python
18-
uses: actions/setup-python@v5
19-
with:
20-
python-version: '3.9'
21-
- name: Install dependencies
22-
run: |
23-
python -m pip install --upgrade pip
24-
python -m pip install -e .[transforms]
25-
- name: Run unit tests
26-
run: |
27-
python -m unittest discover -s tests
17+
- name: Checkout code
18+
uses: actions/checkout@v4
19+
20+
- name: Install uv
21+
uses: astral-sh/setup-uv@v5
22+
23+
- name: Install just
24+
uses: ./.github/actions/setupjust
25+
26+
- name: Install minimum supported python version
27+
run: |
28+
uv python pin 3.9
29+
30+
- name: Install dependencies
31+
run: |
32+
just dev-sync
33+
34+
- name: Run unit tests
35+
run: |
36+
just test

README.md

+2-4
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,7 @@ pip install git+https://github.com/NVIDIA/Megatron-Energon.git
5353

5454
**NOTE**: We encourage you to install the package (and not just import a local copy). This will ensure you have all the needed dependencies and that you can use the command line tool.
5555

56-
For more details on installing this package, see [here](https://nvidia.github.io/Megatron-Energon/installation.html).
56+
For more details on installing this package, see [here](https://nvidia.github.io/Megatron-Energon/intro/installation.html).
5757

5858
## Usage of command line tool
5959

@@ -82,7 +82,7 @@ simple_worker_config = WorkerConfig(rank=0, world_size=1, num_workers=2)
8282

8383
train_ds = get_train_dataset(
8484
'/my/dataset/path',
85-
batch_size=32,
85+
batch_size=2,
8686
shuffle_buffer_size=None,
8787
max_samples_per_sequence=None,
8888
worker_config=simple_worker_config,
@@ -97,5 +97,3 @@ for batch in train_loader:
9797
```
9898

9999
For more details, read the [documentation](https://nvidia.github.io/Megatron-Energon/).
100-
101-
Most likely, you'll need your own [task encoder](https://nvidia.github.io/Megatron-Energon/task_encoders.html).

docs/source/advanced/joining_datasets.md

+65-2
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ ds2
4343
```
4444

4545
In this example, `ds1/shard1.tar` and `ds2/shard1.tar` contain files with the same base names / keys.
46-
When joining, the files from both datasets are joined based on their keys, which must be in the same order. Each dataset must be [prepared](../basic/data_prep) (i.e. .nv-meta created).
46+
When joining, the files from both datasets are joined based on their keys, which must be unique within each dataset across all the shards. Each dataset must be [prepared](../basic/data_prep) (i.e. .nv-meta created) with a recent version of energon.
4747

4848

4949
## Minimal Example
@@ -63,7 +63,69 @@ splits:
6363
__module__: megatron.energon
6464
__class__: CaptioningSample
6565
```
66-
Note that the blending is omitted here, as it is optional. The joining is of course also optional.
66+
Note that the blending is omitted here, as it is optional. Joining is of course also optional.
67+
68+
```{warning}
69+
All metadatasets that contain a `join` must be prepared with the `energon prepare` command.
70+
This will compute the join index and store it next to the metadataset in a folder with a similar name.
71+
```
72+
73+
74+
## Join Modes
75+
76+
When joining two datasets, it may happen that the first dataset (primary dataset) has more samples or fewer samples than the secondary dataset(s).
77+
In this case, we have to decide how to handle these samples that do not match.
78+
79+
The primary dataset always serves as the reference and there will never be more samples in the join result than in the primary dataset. However if a primary sample has no match in a secondary dataset, it may be skipped as explained below.
80+
81+
For each of the secondary datasets, the user can specify a `nonmatch` setting.
82+
With one of the following options, the user can decide what happens, if a sample from the primary dataset is not found in the given secondary dataset:
83+
84+
* `error` (default): An error is raised
85+
* `skip`: The whole sample is skipped
86+
* `none`: The column for the current secondary dataset is filled with `None` if there's no match
87+
88+
Example `metadataset_nomatch.yaml`:
89+
90+
```yaml
91+
__module__: megatron.energon
92+
__class__: MetadatasetV2
93+
splits:
94+
train:
95+
join:
96+
- path: ds1
97+
- path: ds2
98+
nonmatch: skip
99+
- path: ds3
100+
nonmatch: none
101+
joiner:
102+
__module__: megatron.energon
103+
__class__: CaptioningSample
104+
```
105+
106+
To illustrate the effect, let's look at some example data:
107+
108+
* `ds1` samples: `s1`, `s2`, `s3`, `s5`, `s6`
109+
* `ds2` samples: `s1`, `s3`, `s4`, `s6`, `s7`
110+
* `ds3` samples: `s1`, `s2`, `s3`, `s100`
111+
112+
The resulting joined data would contain the following samples, one in each row:
113+
114+
| ds1 | ds2 | ds3 |
115+
| --- | --- | ---- |
116+
| s1 | s1 | s1 |
117+
| s3 | s3 | s3 |
118+
| s6 | s6 | None |
119+
120+
Explanation:
121+
122+
* The sample key `s1` is available in all dataset.
123+
* `s2` is missing from `ds2` and nonmatch is set to `skip`, so the sample will not appear in the result.
124+
* `s3` is available in all datasets.
125+
* `s4` is not in the primary dataset. Only samples from the primary dataset will be included.
126+
* `s5` is missing from `ds2` again, and this time also from `ds3`
127+
* `s6` is missing from `ds3` and `ds3` has `nonmatch` set to `none`, so the sample is not skipped, but the column for `ds3` is set to `None`
128+
67129

68130
## Extensive Example
69131

@@ -72,6 +134,7 @@ Here is a more extensive example that shows multiple things at once:
72134
* Joining can be used inside blending
73135
* The datasets to be joined can have custom subflavors or dataset yamls specified
74136
* A custom "joiner" can be specified to define how samples are joined and what the resulting type is
137+
* The `nonmatch` setting is not included here, but would work just like shown above
75138

76139
`metadataset_extended.yaml`:
77140
```yaml

docs/source/advanced/remote_dataset.md

+17-22
Original file line numberDiff line numberDiff line change
@@ -3,44 +3,39 @@ SPDX-License-Identifier: BSD-3-Clause -->
33

44
# Remote Dataset
55

6-
Since version 2.0.0, Megatron Energon supports the use of remote datasets that are stored in an object store bucket with an S3-compatible interface.
7-
This means you can train or validate with your data right from that object store by simply swapping the dataset path for a so-called _rclone URL_.
8-
9-
```{admonition} Note
10-
:class: important
11-
We are planning to migrate to a more advanced backend for remote datasets. Please contact us if you are planning to use this feature.
12-
```
6+
Since version 2.0.0, Megatron Energon supports the use of remote datasets. Since version >5.2.0, Energon file access is based on [Multi Storage Client (MSC)](https://github.com/NVIDIA/multi-storage-client).
7+
This means you can train or validate with your data right from any storage by simply swapping the dataset path for a so-called _MSC URL_.
138

149
## Prerequisites
1510

16-
To use this feature, you need to set up an [Rclone](https://rclone.org/) configuration. Rclone is an open source software to manage files on cloud storage. While Energon does not depend on the Rclone software itself, we rely on the same configuration mechanism.
11+
For using a remote dataset, install one or more of the extras:
12+
* `s3`
13+
* `aistore`
14+
* `azure-blob-storage`
15+
* `google-cloud-storage`
16+
* `oci`
1717

18-
So if you don't like to install or use Rclone, that's fine, but you will need to set up a config file that is compatible. We still recommend using Rclone, since it's a great tool.
18+
like this:
19+
```sh
20+
pip install megatron-energon[s3,oci]
21+
```
1922

20-
Once you set up your config at `~/.config/rclone/rclone.conf`, it may look like this:
23+
Set up the msc config as described in [Multi Storage Client documentation](https://nvidia.github.io/multi-storage-client/).
2124

22-
```
23-
[coolstore]
24-
type = s3
25-
provider = Other
26-
access_key_id = MY_ACCESS_KEY_ID
27-
secret_access_key = MY_SECRET_ACCESS_KEY
28-
region = us-east-1
29-
endpoint = pdx.s8k.io
30-
```
25+
You can also use the rclone config with msc, as was described prior to 5.2.0.
3126

3227
## The URL syntax
3328

3429
The syntax is a simple as
3530

3631
```
37-
rclone://RCLONE_NAME/BUCKET/PATH
32+
msc://CONFIG_NAME/PATH
3833
```
3934

4035
For example:
4136

4237
```
43-
rclone://coolstore/mainbucket/datasets/somedata
38+
msc://coolstore/mainbucket/datasets/somedata
4439
```
4540

4641
You can use this URL instead of paths to datasets in
@@ -53,7 +48,7 @@ Example usage:
5348

5449
```python
5550
ds = get_train_dataset(
56-
'rclone://coolstore/mainbucket/datasets/somedata',
51+
'msc://coolstore/mainbucket/datasets/somedata',
5752
batch_size=1,
5853
shuffle_buffer_size=100,
5954
max_samples_per_sequence=100,

0 commit comments

Comments
 (0)