Skip to content

[Large Refactor] Enable gradient preprocessing in reduce and in score#153

Merged
luciaquirke merged 21 commits intomainfrom
claude/issue-142-20260217-0932
Mar 3, 2026
Merged

[Large Refactor] Enable gradient preprocessing in reduce and in score#153
luciaquirke merged 21 commits intomainfrom
claude/issue-142-20260217-0932

Conversation

@luciaquirke
Copy link
Collaborator

@luciaquirke luciaquirke commented Feb 17, 2026

Closes #154
Closes #158

  • Support split preconditioning when unit normalize is true
  • This will provide more flexibility around where preprocessing is applied and improve the documentation of gradient preprocessing
image

@luciaquirke
Copy link
Collaborator Author

@claude can you finish this off? change accumulate_grads to accumulation (which can be 'mean' 'sum' or 'none') and ensure the two config flags are in the query_config

@claude
Copy link

claude bot commented Feb 17, 2026

Claude finished @luciaquirke's task in 2s —— View job


Changes complete

  • Read current state of changed files
  • Add accumulation: Literal["mean", "sum", "none"] | None = None to ScoreConfig in config.py
  • Add both accumulation and normalize_accumulated_grad flags to QueryConfig in config.py
  • Update score_dataset() in score.py to use score_cfg.accumulation (falling back to deriving from score_cfg.score when None)
  • Tests pass (39 passed, 48 skipped)
  • Committed and pushed

Summary of changes:

bergson/config.py:

  • Added accumulation: Literal["mean", "sum", "none"] | None = None to ScoreConfig — explicit control over query gradient accumulation strategy
  • Kept normalize_accumulated_grad: bool | None = None in ScoreConfig
  • Added both accumulation and normalize_accumulated_grad to QueryConfig

bergson/score/score.py:

  • score_dataset() now uses score_cfg.accumulation if set, falling back to "mean" when score == "mean" and "none" otherwise (preserving existing default behavior)
    |

@luciaquirke luciaquirke changed the title Issue #142: Changes from Claude Expose accumulated grad normalization as a separate argument Feb 17, 2026
unit_normalize: bool = False
"""Whether to unit normalize the gradients before computing the scores."""

accumulation: Literal["mean", "sum", "none"] | None = None
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@claude we don't need a string "none" if we already have a real None

Copy link

@claude claude bot Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude finished @luciaquirke's task in 2s —— View job


Changes complete

  • Gather context
  • Read config.py and score.py
  • Remove "none" from Literal type in accumulation (both QueryConfig and ScoreConfig)
  • Update docstrings to remove references to "none" string
  • Update score_dataset() fallback: "none"None
  • Update preprocess_grads() signature and internal checks: "none"None
  • Tests pass (39 passed, 52 skipped)
  • Committed and pushed

Summary: Replaced Literal["mean", "sum", "none"] with Literal["mean", "sum"] | None in both QueryConfig and ScoreConfig. The "none" string option was redundant since None already means "no accumulation". Updated preprocess_grads() and score_dataset() accordingly.

@luciaquirke luciaquirke changed the title Expose accumulated grad normalization as a separate argument Move query processing from score to reduce Feb 17, 2026
"""Config for reducing the gradients."""
"""Config for reducing a dataset into a standalone query."""

method: Literal["mean", "sum"] = "mean"
Copy link
Collaborator Author

@luciaquirke luciaquirke Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can add a "none" option here, rename this to PreprocessConfig, remove the reduce script so it's just build script, and let users pass the PreprocessConfig into build if they want?

  • reduce CLI calls build under the hood
  • build CLI calls build ...
  • preconditioners CLI calls build

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it! Let's put it in the collector refactor

)
score_device = torch.device(f"cuda:{rank}")

# Compute preconditioner for index-side application
Copy link
Collaborator Author

@luciaquirke luciaquirke Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO index_cfg.skip_preconditioners usage looks sus. We should apply the preconditioner at ^-.0.5 if preconditioner_path is provided and unit normalize is true, and otherwise not

@torch.inference_mode()
def score(self, mod_grads: dict[str, torch.Tensor]) -> torch.Tensor:
"""Compute scores for a batch of gradients."""
# Apply per-module preconditioners to index grads if provided
Copy link
Collaborator Author

@luciaquirke luciaquirke Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO ensure only ^-1/2 are provided + only when unit normalize is True

@luciaquirke luciaquirke force-pushed the claude/issue-142-20260217-0932 branch from 54e6f32 to 52e8066 Compare February 24, 2026 06:48
@luciaquirke luciaquirke changed the title Move query processing from score to reduce Enable gradient preprocessing ~everywhere Feb 24, 2026
@luciaquirke luciaquirke changed the title Enable gradient preprocessing ~everywhere Enable gradient preprocessing in reduce and in score Feb 24, 2026
@luciaquirke luciaquirke force-pushed the claude/issue-142-20260217-0932 branch from 52e8066 to 33ebc4a Compare February 24, 2026 08:01
@luciaquirke
Copy link
Collaborator Author

@claude can you add a docs page discussing query and index preprocessing and when each of them should happen? Give case studies of how you would handle a full-gradients cosine similarity example with an optimizer normalizer, a full-gradients inner product example with an optimizer normalizer, a randomly projected gradients example with reduce and score, and a randomly projected gradients example with unit normalize, preconditioners, build, and score.

@claude
Copy link

claude bot commented Feb 24, 2026

Claude finished @luciaquirke's task in 6m 38s —— View job


Changes complete

  • Gather context
  • Read relevant source files (config.py, process_grads.py, score.py, reduce.py, build.py, scorer.py, main.py)
  • Find docs structure
  • Write docs page with case studies
  • Add to index.rst
  • Commit and push

Added docs/preprocessing.md with:

  • Operations overview — optimizer normalization, unit normalization, preconditioning, and what each does
  • Query vs index gradient table — which operations require symmetric application (unit norm and preconditioners with unit norm do; optimizer normalization and one-sided preconditioning don't)
  • 4 case studies:
    1. Cosine similarity + optimizer normalizer (full gradients): reduce --normalizer adafactorscore --normalizer adafactor --unit_normalize
    2. Inner product + optimizer normalizer (full gradients): same but no --unit_normalize; explains when magnitude matters vs direction only
    3. Random projection + reduce + score: reduce --projection_dim 4096score --projection_dim 4096; notes on preprocessing order and why normalizer must be applied at collection time, not after mean-reduction
    4. Random projection + unit normalization + preconditioners + build + score: full 4-step TrackStar pipeline; explains why H^(-1/2) must be applied to both sides (unit norm undoes one-sided application), and the --mixing_coefficient parameter
      |

@luciaquirke
Copy link
Collaborator Author

@claude it should be in a readthedocs/sphinx compatible file

@claude
Copy link

claude bot commented Feb 24, 2026

Claude finished @luciaquirke's task in 2m 46s —— View job


Changes complete

  • Read existing docs structure (myst_parser configured; index.rst references preprocessing)
  • Convert preprocessing.md to preprocessing.rst (native Sphinx RST format)
  • Remove preprocessing.md
  • Commit and push

Summary: Replaced docs/preprocessing.md with docs/preprocessing.rst. The index.rst toctree entry (preprocessing) already resolves to .rst files by priority, so no changes to index.rst were needed.

RST improvements over the previous markdown:

  • .. list-table:: directive for the query vs index operations table
  • :math: and .. math:: directives for formulas (:math:H^{-1}, :math:H^{-1/2}, the influence function, the mixing coefficient equation)
  • .. note:: directive for the preprocessing order callout
  • .. code-block:: bash directives for all shell examples
    |

@luciaquirke
Copy link
Collaborator Author

@claude in the docs can you move the experiment walkthroughs and gradient preprocessing sections above the automatically generated docstrings sections

@claude
Copy link

claude bot commented Feb 24, 2026

Claude finished @luciaquirke's task in 39s —— View job


Changes complete

  • Gather context
  • Read docs files to understand current structure
  • Reorder sections in docs/index.rst — moved Preprocessing and Experiments above API Reference
  • Commit and push changes

Summary: In docs/index.rst, moved the Preprocessing and Experiments toctree sections above the API Reference section (which contains the auto-generated CLI, API, and utils docstrings).

@luciaquirke
Copy link
Collaborator Author

@claude could you update the docs to make the normalizer definitions and explanations in the case study correct - the Adam normalizer divides by sqrt(avg_sq) + eps, where avg_sq is the second moment (mean of squared gradients), not the variance/standard deviation, and doesn't use a moving average like Adam does during training. So like:

Scales each gradient element by the inverse root-mean-square (RMS) of that parameter's gradient history — i.e., divides by sqrt(E[g²]) + eps.

"Inverse standard deviation" would imply centering (subtracting the mean), which Adam doesn't do. It's the raw second moment, not the variance. The distinction matters when the gradient mean is nonzero.

@claude
Copy link

claude bot commented Feb 24, 2026

Claude finished @luciaquirke's task in 2m 53s —— View job


Changes complete

  • Gather context
  • Read docs/preprocessing.rst to find incorrect normalizer descriptions
  • Fix normalizer definitions (second moment, not variance/std dev; no moving average)
  • Commit and push

Summary of changes to docs/preprocessing.rst:

  • Operations section: Changed "Scales each gradient element by an estimate of the inverse standard deviation of that parameter's gradient distribution. Applied elementwise during gradient collection using Adam or Adafactor running statistics." → "Scales each gradient element by the inverse root-mean-square (RMS) of that parameter's gradient history — i.e., divides by √E[g²] + ε, where E[g²] is the mean of squared gradients across the dataset. Applied elementwise during gradient collection. Unlike the Adam optimizer used during training, this uses a simple mean over the dataset rather than an exponential moving average."

  • Cosine similarity case study: Changed "where v is an exponential moving average of squared gradients" → "where v = E[g²] is the mean of squared gradients across the dataset", and updated the surrounding description to use "inverse RMS" language consistently.
    |

@luciaquirke
Copy link
Collaborator Author

@claude

Random projections (Johnson-Lindenstrauss) approximately preserve inner products and cosine similarities while reducing gradient dimensionality by orders of magnitude.
Should be
Random projections approximately preserve inner products and cosine similarities (Johnson-Lindenstrauss) while reducing gradient dimensionality by orders of magnitude.

reduce aggregates all query gradients into a single vector (mean or sum) without storing any per-example gradients.
Should be
reduce aggregates all query gradients into a single vector (mean or sum) without storing per-example gradients.

Optimizer normalization must be applied during gradient collection (set --normalizer at both reduce and score time). It cannot be applied after the mean-reduction in reduce, since applying the normalizer to the mean gradient is not the same as normalizing each gradient then taking the mean.
Should be
Optimizer normalization must be applied during gradient collection (set --normalizer at both reduce and score time). It cannot be applied after the mean-reduction in reduce - the normalizer is non-linear so applying it to the mean gradient is not the same as normalizing each gradient then taking the mean.

When query and index datasets come from different distributions, --mixing_coefficient (default 0.99) interpolates between their second moment matrices:
Should be
When query and index datasets come from different distributions, --mixing_coefficient (default 0.99) interpolates between their second moment matrices (i.e. the empirical Fisher information matrices):

Adjust this when the query dataset is small (causing noisy 𝐻query estimates) or when the query and index distributions diverge significantly.
Should be
Adjust this according to the guidelines in https://arxiv.org/abs/2410.17413

Otherwise good

@claude
Copy link

claude bot commented Feb 24, 2026

Claude finished @luciaquirke's task in 43s —— View job


Changes complete

  • Gather context
  • Read docs/preprocessing.rst
  • Apply text changes
  • Commit and push

Applied all five corrections to docs/preprocessing.rst:

  1. Moved (Johnson-Lindenstrauss) after "inner products and cosine similarities"
  2. Removed "any" from "without storing any per-example gradients"
  3. Changed "since" → " - the normalizer is non-linear so" in the optimizer normalization ordering note
  4. Added "(i.e. the empirical Fisher information matrices)" after "second moment matrices"
  5. Replaced the mixing coefficient adjustment guidance with a reference to https://arxiv.org/abs/2410.17413
    |

modules: list[str] = field(default_factory=list)
"""Modules to use for the query. If empty, all modules will be used."""

normalize_reduced_grad: bool = False
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@claude do we need this arg in ReduceConfig now we have the PreprocessConfig?

Copy link

@claude claude bot Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude finished @luciaquirke's task in 1m 42s —— View job


  • Read bergson/config.py to analyze ReduceConfig and PreprocessConfig
  • Determine if the arg at line 324 is redundant

normalize_reduced_grad (ReduceConfig, line 324) and PreprocessConfig.unit_normalize do different things and don't overlap:

  • PreprocessConfig.unit_normalize — normalizes each per-example gradient individually before accumulation (affects how examples are weighted in the mean/sum)
  • normalize_reduced_grad — normalizes the final flat concatenated gradient vector after reduction (just affects the stored artifact's magnitude)

So they're not redundant with each other. However, normalize_reduced_grad may be mostly pointless anyway:

  1. Its own docstring says it "has no effect on future score rankings"
  2. In score_dataset (score.py line 348), normalize_accumulated_grad is hardcoded to True when score == "mean" — meaning the score step already re-normalizes the query gradient regardless of what was stored by reduce

The only scenario where normalize_reduced_grad would have any effect is if the reduced artifact is used outside the score pipeline, or with score != "mean". So it's not made redundant by PreprocessConfig specifically, but it is arguably superseded by the hardcoded normalization in score_dataset. It could safely be removed unless there's a use case for the stored artifact being pre-normalized.

@luciaquirke
Copy link
Collaborator Author

luciaquirke commented Feb 24, 2026

@LouisYRYJ early cc for any high level feedback, will probably need another day to get this ready. I've added a documentation section for the TrackStar preconditioners. I should probably add EK-FAC/K-FAC there too.

@LouisYRYJ
Copy link
Contributor

LouisYRYJ commented Feb 24, 2026

will this subsume #158?
I also noticed we have a lot of redundancy across different collectors and that should be simplified probably. Not sure if it is too late, but I think it would be more convenient in the future to prioritize that

Comment on lines +297 to +300
attribute_tokens: bool = False,
path: Path | None = None,
reduce_cfg: ReduceConfig | None = None,
preprocess_cfg: PreprocessConfig | None = None,
Copy link
Contributor

@LouisYRYJ LouisYRYJ Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these seem unused? (other builders also specify unused args)

Comment on lines +481 to +485
* ``path`` given + ``attribute_tokens`` → :class:`TokenBuilder`
* ``path`` given → :class:`SequenceBuilder`
* no ``path`` + ``attribute_tokens`` → :class:`InMemoryTokenBuilder`
* no ``path`` → :class:`InMemorySequenceBuilder`
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, this is pretty helpful

@LouisYRYJ
Copy link
Contributor

This looks pretty solid 🚀
I think after we go through the current round of comments, we can merge it!

@luciaquirke
Copy link
Collaborator Author

did you benchmark the compile vs no compile? In my experience, it sometimes is actually slower

I used Claude to benchmark it and didn't check the results properly, it is slower. I think the new code is faster.

@luciaquirke luciaquirke force-pushed the claude/issue-142-20260217-0932 branch from c85e9ea to 655d0c8 Compare March 3, 2026 00:37
luciaquirke and others added 20 commits March 3, 2026 00:39
…d, reduce, score

- Create PreprocessConfig with unit_normalize, preconditioner paths, and
  mixing_coefficient fields extracted from ReduceConfig and ScoreConfig
- Fix broken imports in process_grads.py (gradient_processor -> gradients,
  utils.utils -> utils.math)
- Fix accumulate_grads -> aggregate_grads variable name bug in process_grads.py
- Add compute_preconditioner() returning H^(-1/2) for unit_normalize or
  H^(-1) otherwise, and normalize_flat_grad() for flat tensors
- Fix data.py import from nonexistent .reduce.process_query_grads
- Fix SequenceBuilder/InMemorySequenceBuilder missing h_inv computation
  and broken normalize_grad calls with undefined device
- Add preconditioner support to Scorer for index-side H^(-1/2) application
- Thread PreprocessConfig through build, reduce, score, and collection
- Fix accumulate_grads -> aggregate_grads in huggingface.py callback
- Add tests for compute_preconditioner and Scorer preconditioner support

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Lucia Quirke <luciaquirke@users.noreply.github.com>
Replace preprocessing.md with preprocessing.rst using native RST format:
- list-table directive for the query vs index operations table
- math directives for formulas (H^{-1}, H^{-1/2}, influence function)
- note directive for the preprocessing order callout
- code-block directives for bash examples

Co-authored-by: Lucia Quirke <luciaquirke@users.noreply.github.com>
Co-authored-by: Lucia Quirke <luciaquirke@users.noreply.github.com>
- Replace 'inverse standard deviation' with 'inverse root-mean-square (RMS)'
- Replace 'exponential moving average' with 'mean of squared gradients across the dataset'
- Clarify that bergson's normalizer uses a simple dataset mean, not EMA like Adam during training
- Update case study formula: 1/(sqrt(v) + eps) where v = E[g²]

Co-authored-by: Lucia Quirke <luciaquirke@users.noreply.github.com>
Co-authored-by: Lucia Quirke <luciaquirke@users.noreply.github.com>
…ction

Preconditioner mixing was previously done inline at scoring time inside
compute_preconditioner(). This extracts it into a standalone
mix_preconditioners() that saves a single mixed preconditioner to disk,
so PreprocessConfig just points at one preconditioner_path with no
mixing logic in downstream consumers.

- Replace query_preconditioner_path, index_preconditioner_path, and
  mixing_coefficient on PreprocessConfig with single preconditioner_path
- Add mix_preconditioners() to process_grads.py with provenance metadata
- Simplify compute_preconditioner() to load from a single path
- Move mixing_coefficient to TrackstarConfig
- Update Trackstar pipeline to 5 steps with explicit mixing step
- Update examples and tests for new API

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move preconditioner application in Scorer.score() to operate on slices
of the already-concatenated gradient tensor instead of per-module before
concatenation. This avoids holding both the original and preconditioned
module tensors on device simultaneously.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Rename unit_normalize -> apply_rsqrt in compute_preconditioner to
  describe what the arg does rather than referencing unit normalize
- Update all call sites to use keyword argument form for clarity
- Update test to use keyword form: apply_rsqrt=False
- compute_scores_with_bergson now accepts query_preconditioner_path and
  index_preconditioner_path separately, mixing internally when both given
- compute_metrics updated to pass separate preconditioner paths through

Co-authored-by: Lucia Quirke <luciaquirke@users.noreply.github.com>
Scorer now owns the full preconditioning lifecycle: loading from disk,
applying to query grads (once at init), and applying to index grads
(per-batch in score(), split mode only). This removes preconditioner
concerns from score_worker() and score_dataset(), and decouples
preconditioner application from skip_preconditioners which only controls
whether to estimate statistics during build.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Rename accumulate_grads -> aggregate_grads in preprocess_grads
- Apply query-side preconditioning after cat with offset for VRAM
- Warn and skip on zero gradient norm instead of asserting
- Revert config docstrings to original wording

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
torch.from_numpy doesn't support ml_dtypes bfloat16. Use the existing
numpy_to_tensor helper which handles it via uint16 reinterpretation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@luciaquirke luciaquirke force-pushed the claude/issue-142-20260217-0932 branch from 655d0c8 to 3b3f17a Compare March 3, 2026 00:40
@luciaquirke luciaquirke merged commit b9b767e into main Mar 3, 2026
8 checks passed
@luciaquirke luciaquirke changed the title Enable gradient preprocessing in reduce and in score [Large Refactor] Enable gradient preprocessing in reduce and in score Mar 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enable grad preprocessing in reduce

2 participants