NVIDIA
diff --git a/‎.github/actions/setupjust/action.yml
Lines changed: 16 additions & 0 deletions b/‎.github/actions/setupjust/action.yml
Lines changed: 16 additions & 0 deletions
diff --git a/‎.github/actions/setupjust/index.js
Lines changed: 9 additions & 0 deletions b/‎.github/actions/setupjust/index.js
Lines changed: 9 additions & 0 deletions
diff --git a/‎.github/workflows/documentation.yml
Lines changed: 14 additions & 8 deletions b/‎.github/workflows/documentation.yml
Lines changed: 14 additions & 8 deletions
diff --git a/‎.github/workflows/isort.yml
Lines changed: 0 additions & 31 deletions b/‎.github/workflows/isort.yml
Lines changed: 0 additions & 31 deletions
diff --git a/‎.github/workflows/release.yml
Lines changed: 13 additions & 9 deletions b/‎.github/workflows/release.yml
Lines changed: 13 additions & 9 deletions
diff --git a/‎.github/workflows/black.yml renamed to ‎.github/workflows/ruff.yml
Lines changed: 11 additions & 10 deletions b/‎.github/workflows/black.yml renamed to ‎.github/workflows/ruff.yml
Lines changed: 11 additions & 10 deletions
diff --git a/‎.github/workflows/tests.yml
Lines changed: 21 additions & 12 deletions b/‎.github/workflows/tests.yml
Lines changed: 21 additions & 12 deletions
diff --git a/‎README.md
Lines changed: 2 additions & 4 deletions b/‎README.md
Lines changed: 2 additions & 4 deletions
diff --git a/‎docs/source/advanced/joining_datasets.md
Lines changed: 65 additions & 2 deletions b/‎docs/source/advanced/joining_datasets.md
Lines changed: 65 additions & 2 deletions
diff --git a/‎docs/source/advanced/remote_dataset.md
Lines changed: 17 additions & 22 deletions b/‎docs/source/advanced/remote_dataset.md
Lines changed: 17 additions & 22 deletions
@@ -0,0 +1,16 @@
+name: 'Setup just'
+author: 'Ross MacArthur'
+description: 'Install the just command runner'
+branding:
+  icon: 'play'
+  color: 'blue'
+inputs:
+  just-version:
+    description: 'A valid semver specifier of the just version to install'
+  github-token:
+    description: 'Github token to use to authenticate downloads'
+    required: false
+    default: ${{ github.token }}
+runs:
+  using: 'node20'
+  main: 'index.js'
@@ -20,21 +20,27 @@ jobs:
   build:
     runs-on: ubuntu-latest
     steps:
-      - name: Checkout
+      - name: Checkout code
         uses: actions/checkout@v4
-      - uses: actions/setup-python@v5
-        with:
-          python-version: '3.10' 
-      - name: Setup Pages
-        uses: actions/configure-pages@v5
+
+      - name: Install uv
+        uses: astral-sh/setup-uv@v5
+  
+      - name: Install just
+        uses: ./.github/actions/setupjust
+  
       - name: Install dependencies
         run: |
-          pip install -U sphinx-rtd-theme sphinx sphinxcontrib-napoleon myst-parser sphinx-click
+          just dev-sync
+
       - name: Sphinx build
         run: |
-          sphinx-build -b html docs/source _site
+          just docs
+      
       - name: Upload artifact
         uses: actions/upload-pages-artifact@v3
+        with:
+          path: docs/build
 
   # Deployment job
   deploy:
 
@@ -14,18 +14,22 @@ jobs:
     permissions:
       id-token: write  # This permission is mandatory for trusted publishing
     steps:
-      - uses: actions/checkout@v4
-      - name: Set up Python
-        uses: actions/setup-python@v5
-        with:
-          python-version: '3.8'
+      - name: Checkout code
+        uses: actions/checkout@v4
+
+      - name: Install uv
+        uses: astral-sh/setup-uv@v5
+  
+      - name: Install just
+        uses: ./.github/actions/setupjust
+  
       - name: Install dependencies
         run: |
-          python -m pip install --upgrade pip
-          pip install build
+          just dev-sync
+
       - name: Build package
         run: |
-          python -m build -w
-          python -m build -s
+          just build
+
       - name: Publish package
         uses: pypa/gh-action-pypi-publish@release/v1
@@ -1,4 +1,4 @@
-name: black formatting
+name: ruff checks
 
 on:
   push:
@@ -10,22 +10,23 @@ on:
       - develop
 
 jobs:
-  black:
+  ruff:
     runs-on: ubuntu-latest
 
     steps:
     - name: Checkout code
       uses: actions/checkout@v4
 
-    - name: Set up Python
-      uses: actions/setup-python@v5
-      with:
-        python-version: '3.10'
+    - name: Install uv
+      uses: astral-sh/setup-uv@v5
+
+    - name: Install just
+      uses: ./.github/actions/setupjust
 
     - name: Install dependencies
       run: |
-        python -m pip install --upgrade pip
-        pip install black
+        just dev-sync
 
-    - name: Run Black
-      run: black --check .
+    - name: Check code
+      run: |
+        just check
@@ -12,16 +12,25 @@ on:
 jobs:
   unittest:
     runs-on: ubuntu-latest
+
     steps:
-      - uses: actions/checkout@v4
-      - name: Set up Python
-        uses: actions/setup-python@v5
-        with:
-          python-version: '3.9'
-      - name: Install dependencies
-        run: |
-          python -m pip install --upgrade pip
-          python -m pip install -e .[transforms]
-      - name: Run unit tests
-        run: |
-          python -m unittest discover -s tests
+    - name: Checkout code
+      uses: actions/checkout@v4
+
+    - name: Install uv
+      uses: astral-sh/setup-uv@v5
+
+    - name: Install just
+      uses: ./.github/actions/setupjust
+
+    - name: Install minimum supported python version
+      run: |
+        uv python pin 3.9
+
+    - name: Install dependencies
+      run: |
+        just dev-sync
+
+    - name: Run unit tests
+      run: |
+        just test
@@ -53,7 +53,7 @@ pip install git+https://github.com/NVIDIA/Megatron-Energon.git
 
 **NOTE**: We encourage you to install the package (and not just import a local copy). This will ensure you have all the needed dependencies and that you can use the command line tool.
 
-For more details on installing this package, see [here](https://nvidia.github.io/Megatron-Energon/installation.html).
+For more details on installing this package, see [here](https://nvidia.github.io/Megatron-Energon/intro/installation.html).
 
 ## Usage of command line tool
 
@@ -82,7 +82,7 @@ simple_worker_config = WorkerConfig(rank=0, world_size=1, num_workers=2)
 
 train_ds = get_train_dataset(
     '/my/dataset/path',
-    batch_size=32,
+    batch_size=2,
     shuffle_buffer_size=None,
     max_samples_per_sequence=None,
     worker_config=simple_worker_config,
@@ -97,5 +97,3 @@ for batch in train_loader:
 ```
 
 For more details, read the [documentation](https://nvidia.github.io/Megatron-Energon/).
-
-Most likely, you'll need your own [task encoder](https://nvidia.github.io/Megatron-Energon/task_encoders.html).
@@ -43,7 +43,7 @@ ds2
 ```
 
 In this example, `ds1/shard1.tar` and `ds2/shard1.tar` contain files with the same base names / keys.
-When joining, the files from both datasets are joined based on their keys, which must be in the same order. Each dataset must be [prepared](../basic/data_prep) (i.e. .nv-meta created).
+When joining, the files from both datasets are joined based on their keys, which must be unique within each dataset across all the shards. Each dataset must be [prepared](../basic/data_prep) (i.e. .nv-meta created) with a recent version of energon.
 
 
 ## Minimal Example
@@ -63,7 +63,69 @@ splits:
       __module__: megatron.energon
       __class__: CaptioningSample
 ```
-Note that the blending is omitted here, as it is optional. The joining is of course also optional.
+Note that the blending is omitted here, as it is optional. Joining is of course also optional.
+
+```{warning}
+All metadatasets that contain a `join` must be prepared with the `energon prepare` command.
+This will compute the join index and store it next to the metadataset in a folder with a similar name.
+```
+
+
+## Join Modes
+
+When joining two datasets, it may happen that the first dataset (primary dataset) has more samples or fewer samples than the secondary dataset(s).
+In this case, we have to decide how to handle these samples that do not match.
+
+The primary dataset always serves as the reference and there will never be more samples in the join result than in the primary dataset. However if a primary sample has no match in a secondary dataset, it may be skipped as explained below.
+
+For each of the secondary datasets, the user can specify a `nonmatch` setting.
+With one of the following options, the user can decide what happens, if a sample from the primary dataset is not found in the given secondary dataset:
+
+* `error` (default): An error is raised
+* `skip`: The whole sample is skipped
+* `none`: The column for the current secondary dataset is filled with `None` if there's no match
+
+Example `metadataset_nomatch.yaml`:
+
+```yaml
+__module__: megatron.energon
+__class__: MetadatasetV2
+splits:
+  train:
+    join:
+      - path: ds1
+      - path: ds2
+        nonmatch: skip
+      - path: ds3
+        nonmatch: none
+    joiner:
+      __module__: megatron.energon
+      __class__: CaptioningSample
+```
+
+To illustrate the effect, let's look at some example data:
+
+* `ds1` samples: `s1`, `s2`, `s3`, `s5`, `s6`
+* `ds2` samples: `s1`, `s3`, `s4`, `s6`, `s7`
+* `ds3` samples: `s1`, `s2`, `s3`, `s100`
+
+The resulting joined data would contain the following samples, one in each row:
+
+| ds1 | ds2 | ds3  |
+| --- | --- | ---- |
+| s1  | s1  | s1   |
+| s3  | s3  | s3   |
+| s6  | s6  | None |
+
+Explanation:
+
+* The sample key `s1` is available in all dataset.
+* `s2` is missing from `ds2` and nonmatch is set to `skip`, so the sample will not appear in the result.
+* `s3` is available in all datasets.
+* `s4` is not in the primary dataset. Only samples from the primary dataset will be included.
+* `s5` is missing from `ds2` again, and this time also from `ds3`
+* `s6` is missing from `ds3` and `ds3` has `nonmatch` set to `none`, so the sample is not skipped, but the column for `ds3` is set to `None`
+
 
 ## Extensive Example
 
@@ -72,6 +134,7 @@ Here is a more extensive example that shows multiple things at once:
 * Joining can be used inside blending
 * The datasets to be joined can have custom subflavors or dataset yamls specified
 * A custom "joiner" can be specified to define how samples are joined and what the resulting type is
+* The `nonmatch` setting is not included here, but would work just like shown above
 
 `metadataset_extended.yaml`:
 ```yaml
 
@@ -3,44 +3,39 @@ SPDX-License-Identifier: BSD-3-Clause -->
 
 # Remote Dataset
 
-Since version 2.0.0, Megatron Energon supports the use of remote datasets that are stored in an object store bucket with an S3-compatible interface.
-This means you can train or validate with your data right from that object store by simply swapping the dataset path for a so-called _rclone URL_.
-
-```{admonition} Note
-:class: important
-We are planning to migrate to a more advanced backend for remote datasets. Please contact us if you are planning to use this feature.
-```
+Since version 2.0.0, Megatron Energon supports the use of remote datasets. Since version >5.2.0, Energon file access is based on [Multi Storage Client (MSC)](https://github.com/NVIDIA/multi-storage-client).
+This means you can train or validate with your data right from any storage by simply swapping the dataset path for a so-called _MSC URL_.
 
 ## Prerequisites
 
-To use this feature, you need to set up an [Rclone](https://rclone.org/) configuration. Rclone is an open source software to manage files on cloud storage. While Energon does not depend on the Rclone software itself, we rely on the same configuration mechanism.
+For using a remote dataset, install one or more of the extras:
+* `s3`
+* `aistore`
+* `azure-blob-storage`
+* `google-cloud-storage`
+* `oci`
 
-So if you don't like to install or use Rclone, that's fine, but you will need to set up a config file that is compatible. We still recommend using Rclone, since it's a great tool.
+like this:
+```sh
+pip install megatron-energon[s3,oci]
+```
 
-Once you set up your config at `~/.config/rclone/rclone.conf`, it may look like this:
+Set up the msc config as described in [Multi Storage Client documentation](https://nvidia.github.io/multi-storage-client/).
 
-```
-[coolstore]
-type = s3
-provider = Other
-access_key_id = MY_ACCESS_KEY_ID
-secret_access_key = MY_SECRET_ACCESS_KEY
-region = us-east-1
-endpoint = pdx.s8k.io
-```
+You can also use the rclone config with msc, as was described prior to 5.2.0.
 
 ## The URL syntax
 
 The syntax is a simple as 
 
 ```
-rclone://RCLONE_NAME/BUCKET/PATH
+msc://CONFIG_NAME/PATH
 ```
 
 For example:
 
 ```
-rclone://coolstore/mainbucket/datasets/somedata
+msc://coolstore/mainbucket/datasets/somedata
 ```
 
 You can use this URL instead of paths to datasets in
@@ -53,7 +48,7 @@ Example usage:
 
 ```python
 ds = get_train_dataset(
-    'rclone://coolstore/mainbucket/datasets/somedata',
+    'msc://coolstore/mainbucket/datasets/somedata',
     batch_size=1,
     shuffle_buffer_size=100,
     max_samples_per_sequence=100,