[Large Refactor] Enable gradient preprocessing in reduce and in score#153
[Large Refactor] Enable gradient preprocessing in reduce and in score#153luciaquirke merged 21 commits intomainfrom
Conversation
|
@claude can you finish this off? change accumulate_grads to accumulation (which can be 'mean' 'sum' or 'none') and ensure the two config flags are in the query_config |
|
Claude finished @luciaquirke's task in 2s —— View job Changes complete
Summary of changes:
|
bergson/config.py
Outdated
| unit_normalize: bool = False | ||
| """Whether to unit normalize the gradients before computing the scores.""" | ||
|
|
||
| accumulation: Literal["mean", "sum", "none"] | None = None |
There was a problem hiding this comment.
@claude we don't need a string "none" if we already have a real None
There was a problem hiding this comment.
Claude finished @luciaquirke's task in 2s —— View job
Changes complete
- Gather context
- Read config.py and score.py
- Remove
"none"fromLiteraltype inaccumulation(bothQueryConfigandScoreConfig) - Update docstrings to remove references to
"none"string - Update
score_dataset()fallback:"none"→None - Update
preprocess_grads()signature and internal checks:"none"→None - Tests pass (39 passed, 52 skipped)
- Committed and pushed
Summary: Replaced Literal["mean", "sum", "none"] with Literal["mean", "sum"] | None in both QueryConfig and ScoreConfig. The "none" string option was redundant since None already means "no accumulation". Updated preprocess_grads() and score_dataset() accordingly.
| """Config for reducing the gradients.""" | ||
| """Config for reducing a dataset into a standalone query.""" | ||
|
|
||
| method: Literal["mean", "sum"] = "mean" |
There was a problem hiding this comment.
maybe we can add a "none" option here, rename this to PreprocessConfig, remove the reduce script so it's just build script, and let users pass the PreprocessConfig into build if they want?
- reduce CLI calls build under the hood
- build CLI calls build ...
- preconditioners CLI calls build
There was a problem hiding this comment.
I like it! Let's put it in the collector refactor
bergson/score/score.py
Outdated
| ) | ||
| score_device = torch.device(f"cuda:{rank}") | ||
|
|
||
| # Compute preconditioner for index-side application |
There was a problem hiding this comment.
TODO index_cfg.skip_preconditioners usage looks sus. We should apply the preconditioner at ^-.0.5 if preconditioner_path is provided and unit normalize is true, and otherwise not
bergson/score/scorer.py
Outdated
| @torch.inference_mode() | ||
| def score(self, mod_grads: dict[str, torch.Tensor]) -> torch.Tensor: | ||
| """Compute scores for a batch of gradients.""" | ||
| # Apply per-module preconditioners to index grads if provided |
There was a problem hiding this comment.
TODO ensure only ^-1/2 are provided + only when unit normalize is True
54e6f32 to
52e8066
Compare
52e8066 to
33ebc4a
Compare
|
@claude can you add a docs page discussing query and index preprocessing and when each of them should happen? Give case studies of how you would handle a full-gradients cosine similarity example with an optimizer normalizer, a full-gradients inner product example with an optimizer normalizer, a randomly projected gradients example with reduce and score, and a randomly projected gradients example with unit normalize, preconditioners, build, and score. |
|
Claude finished @luciaquirke's task in 6m 38s —— View job Changes complete
Added
|
|
@claude it should be in a readthedocs/sphinx compatible file |
|
Claude finished @luciaquirke's task in 2m 46s —— View job Changes complete
Summary: Replaced RST improvements over the previous markdown:
|
|
@claude in the docs can you move the experiment walkthroughs and gradient preprocessing sections above the automatically generated docstrings sections |
|
Claude finished @luciaquirke's task in 39s —— View job Changes complete
Summary: In |
|
@claude could you update the docs to make the normalizer definitions and explanations in the case study correct - the Adam normalizer divides by sqrt(avg_sq) + eps, where avg_sq is the second moment (mean of squared gradients), not the variance/standard deviation, and doesn't use a moving average like Adam does during training. So like:
"Inverse standard deviation" would imply centering (subtracting the mean), which Adam doesn't do. It's the raw second moment, not the variance. The distinction matters when the gradient mean is nonzero. |
|
Claude finished @luciaquirke's task in 2m 53s —— View job Changes complete
Summary of changes to
|
Otherwise good |
|
Claude finished @luciaquirke's task in 43s —— View job Changes complete
Applied all five corrections to
|
| modules: list[str] = field(default_factory=list) | ||
| """Modules to use for the query. If empty, all modules will be used.""" | ||
|
|
||
| normalize_reduced_grad: bool = False |
There was a problem hiding this comment.
@claude do we need this arg in ReduceConfig now we have the PreprocessConfig?
There was a problem hiding this comment.
Claude finished @luciaquirke's task in 1m 42s —— View job
- Read
bergson/config.pyto analyze ReduceConfig and PreprocessConfig - Determine if the arg at line 324 is redundant
normalize_reduced_grad (ReduceConfig, line 324) and PreprocessConfig.unit_normalize do different things and don't overlap:
PreprocessConfig.unit_normalize— normalizes each per-example gradient individually before accumulation (affects how examples are weighted in the mean/sum)normalize_reduced_grad— normalizes the final flat concatenated gradient vector after reduction (just affects the stored artifact's magnitude)
So they're not redundant with each other. However, normalize_reduced_grad may be mostly pointless anyway:
- Its own docstring says it "has no effect on future score rankings"
- In
score_dataset(score.py line 348),normalize_accumulated_gradis hardcoded toTruewhenscore == "mean"— meaning the score step already re-normalizes the query gradient regardless of what was stored byreduce
The only scenario where normalize_reduced_grad would have any effect is if the reduced artifact is used outside the score pipeline, or with score != "mean". So it's not made redundant by PreprocessConfig specifically, but it is arguably superseded by the hardcoded normalization in score_dataset. It could safely be removed unless there's a use case for the stored artifact being pre-normalized.
|
@LouisYRYJ early cc for any high level feedback, will probably need another day to get this ready. I've added a documentation section for the TrackStar preconditioners. I should probably add EK-FAC/K-FAC there too. |
|
will this subsume #158? |
bergson/builders.py
Outdated
| attribute_tokens: bool = False, | ||
| path: Path | None = None, | ||
| reduce_cfg: ReduceConfig | None = None, | ||
| preprocess_cfg: PreprocessConfig | None = None, |
There was a problem hiding this comment.
these seem unused? (other builders also specify unused args)
| * ``path`` given + ``attribute_tokens`` → :class:`TokenBuilder` | ||
| * ``path`` given → :class:`SequenceBuilder` | ||
| * no ``path`` + ``attribute_tokens`` → :class:`InMemoryTokenBuilder` | ||
| * no ``path`` → :class:`InMemorySequenceBuilder` | ||
| """ |
There was a problem hiding this comment.
nice, this is pretty helpful
|
This looks pretty solid 🚀 |
I used Claude to benchmark it and didn't check the results properly, it is slower. I think the new code is faster. |
c85e9ea to
655d0c8
Compare
…d, reduce, score - Create PreprocessConfig with unit_normalize, preconditioner paths, and mixing_coefficient fields extracted from ReduceConfig and ScoreConfig - Fix broken imports in process_grads.py (gradient_processor -> gradients, utils.utils -> utils.math) - Fix accumulate_grads -> aggregate_grads variable name bug in process_grads.py - Add compute_preconditioner() returning H^(-1/2) for unit_normalize or H^(-1) otherwise, and normalize_flat_grad() for flat tensors - Fix data.py import from nonexistent .reduce.process_query_grads - Fix SequenceBuilder/InMemorySequenceBuilder missing h_inv computation and broken normalize_grad calls with undefined device - Add preconditioner support to Scorer for index-side H^(-1/2) application - Thread PreprocessConfig through build, reduce, score, and collection - Fix accumulate_grads -> aggregate_grads in huggingface.py callback - Add tests for compute_preconditioner and Scorer preconditioner support Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Lucia Quirke <luciaquirke@users.noreply.github.com>
Replace preprocessing.md with preprocessing.rst using native RST format:
- list-table directive for the query vs index operations table
- math directives for formulas (H^{-1}, H^{-1/2}, influence function)
- note directive for the preprocessing order callout
- code-block directives for bash examples
Co-authored-by: Lucia Quirke <luciaquirke@users.noreply.github.com>
Co-authored-by: Lucia Quirke <luciaquirke@users.noreply.github.com>
- Replace 'inverse standard deviation' with 'inverse root-mean-square (RMS)' - Replace 'exponential moving average' with 'mean of squared gradients across the dataset' - Clarify that bergson's normalizer uses a simple dataset mean, not EMA like Adam during training - Update case study formula: 1/(sqrt(v) + eps) where v = E[g²] Co-authored-by: Lucia Quirke <luciaquirke@users.noreply.github.com>
Co-authored-by: Lucia Quirke <luciaquirke@users.noreply.github.com>
…ction Preconditioner mixing was previously done inline at scoring time inside compute_preconditioner(). This extracts it into a standalone mix_preconditioners() that saves a single mixed preconditioner to disk, so PreprocessConfig just points at one preconditioner_path with no mixing logic in downstream consumers. - Replace query_preconditioner_path, index_preconditioner_path, and mixing_coefficient on PreprocessConfig with single preconditioner_path - Add mix_preconditioners() to process_grads.py with provenance metadata - Simplify compute_preconditioner() to load from a single path - Move mixing_coefficient to TrackstarConfig - Update Trackstar pipeline to 5 steps with explicit mixing step - Update examples and tests for new API Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move preconditioner application in Scorer.score() to operate on slices of the already-concatenated gradient tensor instead of per-module before concatenation. This avoids holding both the original and preconditioned module tensors on device simultaneously. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Rename unit_normalize -> apply_rsqrt in compute_preconditioner to describe what the arg does rather than referencing unit normalize - Update all call sites to use keyword argument form for clarity - Update test to use keyword form: apply_rsqrt=False - compute_scores_with_bergson now accepts query_preconditioner_path and index_preconditioner_path separately, mixing internally when both given - compute_metrics updated to pass separate preconditioner paths through Co-authored-by: Lucia Quirke <luciaquirke@users.noreply.github.com>
Scorer now owns the full preconditioning lifecycle: loading from disk, applying to query grads (once at init), and applying to index grads (per-batch in score(), split mode only). This removes preconditioner concerns from score_worker() and score_dataset(), and decouples preconditioner application from skip_preconditioners which only controls whether to estimate statistics during build. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Rename accumulate_grads -> aggregate_grads in preprocess_grads - Apply query-side preconditioning after cat with offset for VRAM - Warn and skip on zero gradient norm instead of asserting - Revert config docstrings to original wording Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
torch.from_numpy doesn't support ml_dtypes bfloat16. Use the existing numpy_to_tensor helper which handles it via uint16 reinterpretation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…lder to a standalone file
655d0c8 to
3b3f17a
Compare
Closes #154
Closes #158