Distributed refactor by LouisYRYJ · Pull Request #63 · EleutherAI/bergson

LouisYRYJ · 2025-11-09T20:59:18Z

This PR rewrites distributed setup to be more flexible.

luciaquirke · 2025-11-12T21:05:20Z

tests/test_attributor.py

+    cfg = IndexConfig(run_path=str(tmp_path))
+    cfg.skip_preconditioners = True
+
+    kwargs = {


seems unnecessary

if you can't see the keys for the input args to functions in your IDE something is broken, they should show up in grey

luciaquirke · 2025-11-12T21:06:19Z

tests/test_build.py


-    assert result.returncode == 0
+    # Print the output to see what's failing
+    if result.returncode != 0:


I fixed this in the other query PR, I think it's a bit different than this fix. Basically returncode is 0 as long as the subprocess itself is happy so you parse the output text unconditionally

luciaquirke · 2025-11-12T21:07:25Z

tests/test_build.py

+    cfg = IndexConfig(run_path=str(tmp_path))
+    # This build hangs in pytest with preconditioners enabled.
+    # It works when run directly so it may be a pytest issue.
+    cfg.skip_preconditioners = True


set at init

luciaquirke · 2025-11-12T21:07:57Z

tests/test_build.py

+    # This build hangs in pytest with preconditioners enabled.
+    # It works when run directly so it may be a pytest issue.
+    cfg.skip_preconditioners = True
+    kwargs = {


kwargs are undesirable when not totally necessary because they add a layer of indirection. if the SWE we're working with wants this I'm open to other lines of reasoning

luciaquirke · 2025-11-12T21:10:00Z

bergson/build.py

+        ds = assert_type(Dataset, Dataset.from_json(data_str))
+    else:
+        try:
+            ds = load_dataset(data_str, split=cfg.data.split, streaming=cfg.streaming)


Need to add back the lost subset arg

By the way, LM eval harness uses a nice pattern where the CLI takes --model_kwargs "device='cuda',streaming=True,subset='hello'" --dataset_kwargs ". . . " such that they don't need to update their library whenever HF adds a new model or dataset kwarg. We could consider doing this too.

luciaquirke · 2025-11-12T21:10:29Z

bergson/build.py

@@ -91,7 +139,6 @@ def worker(
            device_map=device_map,
            quantization_config=quantization_config,
            dtype=dtype,


Need to add back the lost revision arg

luciaquirke · 2025-11-12T21:49:42Z

bergson/build.py

+                # Add a barrier to ensure all processes reach this point
+                dist.barrier()
+            except Exception:
+                pass  # Ignore barrier failures during cleanup


I'm removing this unless we have a good reason, I don't think we should suppress errors (?)

I did this for the .part rename call and it was a mistake, I'm gonna remove mine

luciaquirke · 2025-11-12T21:51:19Z

bergson/build.py

-    # Write index config to json
+def distributed_computing(
+    cfg: IndexConfig,
+    worker_fn: Callable,


Currently distributed_computing is not generic enough to be used in more than one place (it has build_index specific logic, like not having a QueryConfig parameter), and worker_fn is only ever set to collect_gradients, so it doesn't currently make sense to add a layer of indirection through an abstracted name there either.

worker_wrapper also has a generic name but does something specific (device-specific artifact setup & orchestration). I think the use of datasets is omnipresent in our workflows and we should embrace that as something to keep concrete. So we can name worker_wrapper to something like run_dataset_on_worker, and then we're not as surprised to find it's doing dataset processing.

Because both of our functions get the dataset in the main process, we can call maybe remove this conditional for now and add it back later if we need:

# Do all the data loading and preprocessing on the main process if setup_data: ds = setup_data_pipeline(cfg) else: # Create empty dataset for compatibility ds = assert_type(Dataset, Dataset.from_list([]))

Then we may also choose to reduce the conceptual nesting of our functions by having a clear function called build that does

ds = setup_data_pipeline(cfg) distributed_computing(worker_fn=collect_gradients, constant_worker_args=[cfg, ds], process_name="build"

and a query function that does

ds = setup_data_pipeline(index_cfg) distributed_computing(worker_fn=collect_gradients, constant_worker_args=[index_cfg, query_cfg, ds], process_name="query"

Then distributed_computing will be truly generic, simply running

args={
i: (i, world_size, *constant_worker_args)
for i in range(world_size)
}

If we ever need to use something other than collect_gradients it can become one of the constant worker args.

I'm going to draft this up

luciaquirke · 2025-11-13T00:00:36Z

I'm going to yolo merge this, happy to raise another PR if there are issues or we want to add the kwargs pattern back to the tests or something like that

Thanks for working on this!!

LouisYRYJ added 11 commits November 9, 2025 20:43

refactor distributed worker fn

a53b65f

fix parsing

72300f4

fix attributor test + imports

7be490a

correct pytest skip for faiss

63edd92

merge main

11d0e4f

merge main WIP

a910a50

merge with backup

2a17ec2

fix dtype issue

61c2b57

test query working and all tests passing!

e8fa756

precommit hook passing

86347de

correct skip test condition

5c60d34

luciaquirke reviewed Nov 12, 2025

View reviewed changes

luciaquirke added 2 commits November 12, 2025 23:30

modified launch structure

dea6cc7

extract out distributed logic

cf481d8

luciaquirke force-pushed the distributed-refactor branch from 45708b2 to cf481d8 Compare November 12, 2025 23:56

remove kwargs pattern for now

0985a49

luciaquirke merged commit 86ed68e into main Nov 13, 2025
3 checks passed

luciaquirke deleted the distributed-refactor branch November 17, 2025 07:16

LouisYRYJ mentioned this pull request Jan 13, 2026

Ekfac #13

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed refactor#63

Distributed refactor#63
luciaquirke merged 14 commits intomainfrom
distributed-refactor

LouisYRYJ commented Nov 9, 2025

Uh oh!

luciaquirke Nov 12, 2025 •

edited

Loading

Uh oh!

luciaquirke Nov 12, 2025

Uh oh!

luciaquirke Nov 12, 2025 •

edited

Loading

Uh oh!

luciaquirke Nov 12, 2025

Uh oh!

luciaquirke Nov 12, 2025 •

edited

Loading

Uh oh!

luciaquirke Nov 12, 2025

Uh oh!

luciaquirke Nov 12, 2025 •

edited

Loading

Uh oh!

luciaquirke Nov 12, 2025 •

edited

Loading

Uh oh!

luciaquirke Nov 12, 2025

Uh oh!

luciaquirke Nov 12, 2025 •

edited

Loading

Uh oh!

luciaquirke Nov 12, 2025 •

edited

Loading

Uh oh!

luciaquirke commented Nov 13, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

LouisYRYJ commented Nov 9, 2025

Uh oh!

luciaquirke Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

luciaquirke Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

luciaquirke Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

luciaquirke Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

luciaquirke Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

luciaquirke Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

luciaquirke Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

luciaquirke Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

luciaquirke Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

luciaquirke Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

luciaquirke Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

luciaquirke commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

luciaquirke Nov 12, 2025 •

edited

Loading

luciaquirke Nov 12, 2025 •

edited

Loading

luciaquirke Nov 12, 2025 •

edited

Loading

luciaquirke Nov 12, 2025 •

edited

Loading

luciaquirke Nov 12, 2025 •

edited

Loading

luciaquirke Nov 12, 2025 •

edited

Loading

luciaquirke Nov 12, 2025 •

edited

Loading

luciaquirke commented Nov 13, 2025 •

edited

Loading