Enable training on variable-sized batches #42

hanaol · 2025-12-11T20:01:44Z

Motivation

The current training pipeline assumes either

uniformly sized data across all samples in a batch, or
single-sample processing when dimensions vary

This restricts the ability to work with datasets that naturally contain heterogeneous shapes (e.g., grids at different resolutions), forcing users to manually pad or preprocess data.

Solution

This PR introduces support for training on heterogeneous batches by:

Implementing a dynamic collate mechanism that groups and prepares samples of varying sizes at runtime.
Updating the dataloader and relevant training components (e.g., model forward pass or loss function, where applicable) to handle variable-shaped inputs safely.
Ensuring training remains stable and consistent even when batch elements differ in their spatial dimensions.

These updates allow the training loop to handle variable-sized input data seamlessly, reducing the need for intrusive or manual preprocessing.

Notes

Handling heterogeneous batches may introduce additional overhead compared to fully vectorized uniform-data training. However, enabling this flexibility is valuable for datasets where variable-sized samples are inherent to the problem rather than avoidable.

forklady42 · 2025-12-12T19:22:48Z

src/electrai/entrypoints/train.py

+def collate_fn(batch):
+    try:
+        return default_collate(batch)
+    except Exception:


This catches all possible exceptions, which is very broad. Some of them could be real errors that should be thrown. Is there a particular class of exceptions that you want to catch here?

forklady42 · 2025-12-12T19:34:04Z

src/electrai/model/srgan_layernorm_pbc.py

        out1 = self.conv1(x)
        out = self.res_blocks(out1)
        out2 = self.conv2(out)
-        out = torch.add(out1, out2)


Ooc how is this relevant to the variable-sized batch changes?

forklady42 · 2025-12-12T19:38:00Z

src/electrai/lightning.py

+            loss = torch.stack(losses).mean()
+        else:
+            pred = self(x)
+            loss = self.loss_fn(pred, y)


This logic is a duplicate of that in training_step above. It would be good to create a separate function, e.g. _loss_calculation() that each of these functions call. That way, next time we update this code, we won't accidentally miss an update to one and cause them to drift.

forklady42 · 2025-12-12T19:52:13Z

src/electrai/entrypoints/train.py

+        return default_collate(batch)
+    except Exception:
+        # Separate and return as lists of tensors
+        x, y = zip(*batch, strict=False)


Setting strict=False means that data will be silently dropped if x and y somehow are different lengths. Can you say more about why this is safe and preferred to strict=True?

forklady42 · 2025-12-12T20:36:05Z

src/electrai/entrypoints/train.py

 from src.electrai.dataloader.registry import get_data
 from src.electrai.lightning import LightningGenerator
 from torch.utils.data import DataLoader
+from torch.utils.data._utils.collate import default_collate


What's the motivation for importing default_collate from torch.utils.data._utils.collate rather than torch.utils.data?

Hananeh Oliaei added 2 commits December 11, 2025 14:45

Add support for training on batches with variable data sizes

be62fbb

Support heterogeneous batches in loss function and model forward pass

4489be9

hanaol requested a review from forklady42 December 11, 2025 20:53

forklady42 reviewed Dec 12, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable training on variable-sized batches #42

Enable training on variable-sized batches #42

hanaol commented Dec 11, 2025 •

edited

Loading

Uh oh!

forklady42 Dec 12, 2025

Uh oh!

forklady42 Dec 12, 2025

Uh oh!

forklady42 Dec 12, 2025

Uh oh!

forklady42 Dec 12, 2025

Uh oh!

forklady42 Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Enable training on variable-sized batches #42

Are you sure you want to change the base?

Enable training on variable-sized batches #42

Conversation

hanaol commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Solution

Notes

Uh oh!

forklady42 Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

forklady42 Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

forklady42 Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

forklady42 Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

forklady42 Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hanaol commented Dec 11, 2025 •

edited

Loading