Add on-the-fly queries by luciaquirke · Pull Request #47 · EleutherAI/bergson

luciaquirke · 2025-10-13T22:13:45Z

Add on the fly queries
Precompute query dataset if not already available
Use .part extension of in-progress runs (closes Use .part extension for in-progress index runs #49)
You can technically now torch.compile the model when projection_dim=0 and save_index=False but this slows down the build and projection_dim=0 is super memory hungry

TODO

Extract out the query dataset assembly into an example script and consider making it a more official tool in the future

Notes

If keeping the extra gradients in VRAM before the query callback causes problems we can add something like

offload_to_cpu: bool = False
"""If True, keep value gradients on CPU until the query callback is called."""

But it's only a small memory usage increase.

luciaquirke · 2025-10-13T22:14:27Z

bergson/query.py

@@ -0,0 +1,454 @@
+import os


modified copy of build.py

bergson/collection.py

luciaquirke · 2025-10-14T21:51:01Z

bergson/collection.py

-
-        # Asynchronously move the gradient to CPU and convert to fp16
-        mod_grads[name] = g.to(device="cpu", dtype=dtype, non_blocking=True)
+        if save_index:


Avoid the round trip to cpu

luciaquirke · 2025-10-14T21:51:17Z

bergson/data.py


    precision: Literal["auto", "bf16", "fp16", "fp32", "int4", "int8"] = "auto"
-    """Precision to use for the model parameters."""
+    """Precision (dtype) to use for the model parameters."""


improve searchability

luciaquirke · 2025-10-16T02:04:48Z

bergson/collection.py

        dtype=dtype,
        fill_value=0.0,
    )
+    per_doc_scores = torch.full(


Only support one score per doc, i.e. don't support computing module scores separately for now

luciaquirke · 2025-10-16T02:05:54Z

bergson/data.py

    """Number of examples to use for estimating processor statistics."""

-    drop_columns: bool = False
+    drop_columns: bool = True


Prevent duplicating entire dataset on disk by default

… builds

…'t stream query dataset

luciaquirke commented Oct 13, 2025

View reviewed changes

bergson/query.py

@@ -0,0 +1,454 @@

import os

Copy link

Collaborator Author

luciaquirke Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modified copy of build.py

luciaquirke commented Oct 13, 2025

View reviewed changes

bergson/collection.py Outdated Show resolved Hide resolved

luciaquirke force-pushed the query-2 branch from 4bbdef9 to 353f1ad Compare October 13, 2025 22:41

luciaquirke changed the base branch from query to main October 13, 2025 22:41

luciaquirke force-pushed the query-2 branch 2 times, most recently from 0b2edc9 to 07f6af8 Compare October 14, 2025 07:14

luciaquirke commented Oct 14, 2025

View reviewed changes

luciaquirke force-pushed the query-2 branch 3 times, most recently from 9fe9bff to 353a1d7 Compare October 16, 2025 02:04

luciaquirke commented Oct 16, 2025

View reviewed changes

luciaquirke changed the title ~~[WIP] Add on-the-fly queries~~ Add on-the-fly queries Oct 16, 2025

luciaquirke added 11 commits October 16, 2025 05:38

Add on-the-fly queries

2312b4f

Drop columns by default

b167c00

use partial run path for in-progress builds

18c387b

Update drop columns default behavior; use .part extension for partial…

008a27b

… builds

Update types

a1e5140

Use partial run path

4afa2ab

save changes

23dc991

Rename scan back to collection

94aa293

wrap run directory rename in try block; tweak shard size default; don…

193a0f3

…'t stream query dataset

Refactor query.py

368cf8d

Rename head_cfgs -> attention_cfgs

2fd25c2

luciaquirke force-pushed the query-2 branch 3 times, most recently from 6405510 to 9de0ecb Compare October 16, 2025 06:51

feat: add on-the-fly queries

294661e

luciaquirke force-pushed the query-2 branch from 9de0ecb to 294661e Compare October 16, 2025 06:57

luciaquirke merged commit d3dba3b into main Oct 16, 2025
3 checks passed

luciaquirke deleted the query-2 branch November 17, 2025 07:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add on-the-fly queries#47

Add on-the-fly queries#47
luciaquirke merged 12 commits intomainfrom
query-2

luciaquirke commented Oct 13, 2025 •

edited

Loading

Uh oh!

luciaquirke Oct 13, 2025

Uh oh!

Uh oh!

luciaquirke Oct 14, 2025

Uh oh!

luciaquirke Oct 14, 2025

Uh oh!

luciaquirke Oct 16, 2025

Uh oh!

luciaquirke Oct 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

luciaquirke commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

luciaquirke Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

luciaquirke Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

luciaquirke Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

luciaquirke Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

luciaquirke Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

luciaquirke commented Oct 13, 2025 •

edited

Loading