Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 22 additions & 4 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ as much information as you can. Details like these are incredibly useful:


## Contributing via Pull Requests

Contributions via pull requests are much appreciated. Before sending us a pull
request, please ensure that:

Expand All @@ -45,31 +46,48 @@ To send us a pull request, please:
not merge any PR which does not pass them. Use `./black.sh` and `./flake.sh`
on your code to make sure it complies.
4. Ensure that all local tests pass. Quite a few tests run on GPU devices only.
If you have the resources to run them, we would appreciate that.
5. Commit to your fork using clear commit messages.
6. Send us a pull request, answering any default questions in the pull request
5. If you have GPU resources to run all the tests, we would appreciate you did
that. Otherwise, we may need read access to your branch in order to run them
on our side.
6. Commit to your fork using clear commit messages.
7. Send us a pull request, answering any default questions in the pull request
interface.
7. Pay attention to any automated CI failures reported in the pull request, and
8. Pay attention to any automated CI failures reported in the pull request, and
stay involved in the conversation.
9. If you used AI in order to create your PR, we would love to know what you
did in terms of steering. Please do add a summary of your conversation and
prompts to the `ai_dev` directory.

GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and
[creating a pull request](https://help.github.com/articles/creating-a-pull-request/).


## Using Artificial Intelligence

You are encouraged to use AI in order to improve your contributions. In case
you did so, please document your prompts in the `ai_dev` directory. We are
particularly keen on instructions or skills which work particularly well for
our code base. If possible, link your AI documentation in the pull request, so
the reviewer can have a look at it.


## Finding contributions to work on

Looking at the existing issues is a great way to find something to contribute
on. As our projects, by default, use the default GitHub issue labels
(enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at
any 'help wanted' issues is a great place to start.


## Code of Conduct

This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
opensource-codeofconduct@amazon.com with any additional questions or comments.


## Security issue notifications

If you discover a potential security issue in this project we ask that you notify
AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/).
Please do **not** create a public github issue.
Expand Down
281 changes: 281 additions & 0 deletions ai_dev/tests_for_fused.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,281 @@
# Writing unit tests for `keys_values/fused` operators with Claude Code

## Initial prompt

Please write comparison unit tests for a number of fused operators:

- `keys_values.fused.fused_rmsnorm.fused_rmsnorm`
- `keys_values.fused.fused_rope.fused_apply_rope`
- `keys_values.fused.fused_swiglu.fused_swiglu`

Each represents a PyTorch expression, and the test should compare the output of
the fused operator with the expression.

- I need tests for `forward` (output of operator) and `backward` (gradients
with respect to inputs). For `backward`, create a scalar function as random
linear combination of the outputs and compare gradients of that.
- The tests should iterate over a number of different input shapes
- Please write the tests in the style of tests in `tests`.
Good examples are `tests.attention.test_flex_attention.test_comparison` or
all tests in `tests.kvcache.test_sdpa_op`.

Put the new tests into the module `tests.fused`.

I have created a new branch `add_fused_tests`. Please work in this branch.

Please use the virtual environment `~/virtenvs/keysvals` for running Python
code.

## Comment

You implemented the baseline comparisons as functions `_rmsnorm_eager`,
`_rope_eager`, `_swiglu_eager`. This is OK, but I'd also like to compare against
the existing code that is used if the fused operators are not active. This would
be:

- `keys_values.model.RMSNorm`
- `litgpt.model.apply_rope`
- `keys_values.lora.LlaMAMLP` and `litgpt.model.LlaMAMLP`

Can you add these to the comparisons?

### Claude Code answer

What was added to each test:

* test_fused_rmsnorm_forward/backward — now also compares against kv_model.RMSNorm and litgpt.model.RMSNorm module instances (weight is copied in so all three see identical parameters). Both forward and backward loop over all three references.

* test_fused_apply_rope_forward/backward (3-D and 4-D) — now also compares against litgpt.model.apply_rope. A small helper _rope_via_litgpt reshapes the (T, D) cos/sin to the (1, T, D) form that apply_rope requires.

* test_fused_swiglu_forward/backward — unchanged structurally (the fused_swiglu op is a pure activation with no module state to compare against).

* Two new test functions test_fused_swiglu_llamamlp_forward and test_fused_swiglu_llamamlp_backward — run a real litgpt.model.LLaMAMLP and kv_lora.LLaMAMLP forward pass with set_fused_swiglu_enabled(True) and compare the output/gradients against the same module with the fused kernel off. This is the end-to-end module comparison you asked for. The flag is always restored in a finally block to avoid test pollution.

## Comment

I ran the tests on a GPU instance and obtained many failures.

Let us start with `test_fused_apply_rope_forward`.

I ran:
```bash
pytest test/fused/test_fused_ops.py -k test_fused_apply_rope_forward
```

Two configurations failed:
```text
_________________________________________________ test_fused_apply_rope_forward[8-32-128-dtype5] __________________________________________________

BH = 8, T = 32, D = 128, dtype = torch.bfloat16

@_RunIf(min_cuda_gpus=1)
@pytest.mark.parametrize("BH, T, D, dtype", _ROPE_PARAMS_3D)
def test_fused_apply_rope_forward(BH, T, D, dtype):
seed = 31415927
torch.manual_seed(seed)
device = torch.device("cuda", 0)

x = torch.randn(BH, T, D, device=device, dtype=dtype)
cos = torch.randn(T, D, device=device, dtype=dtype)
sin = torch.randn(T, D, device=device, dtype=dtype)

out_fused = fused_apply_rope(x, cos, sin)

atol, rtol = _tolerances(dtype)
references = {
"eager": _rope_eager(x, cos, sin),
"litgpt.apply_rope": _rope_via_litgpt(x, cos, sin),
}
for name, out_ref in references.items():
print(f"rope fwd 3D: BH={BH}, T={T}, D={D}, dtype={dtype}, ref={name}")
> torch.testing.assert_close(out_fused, out_ref, atol=atol, rtol=rtol)
E AssertionError: Tensor-likes are not close!
E
E Mismatched elements: 2 / 32768 (0.0%)
E Greatest absolute difference: 0.0146484375 at index (2, 4, 125) (up to 0.01 allowed)
E Greatest relative difference: 0.1953125 at index (6, 4, 68) (up to 0.01 allowed)

test/fused/test_fused_ops.py:259: AssertionError
-------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------
rope fwd 3D: BH=8, T=32, D=128, dtype=torch.bfloat16, ref=eager
rope fwd 3D: BH=8, T=32, D=128, dtype=torch.bfloat16, ref=litgpt.apply_rope
_________________________________________________ test_fused_apply_rope_forward[2-64-256-dtype8] __________________________________________________

BH = 2, T = 64, D = 256, dtype = torch.bfloat16

@_RunIf(min_cuda_gpus=1)
@pytest.mark.parametrize("BH, T, D, dtype", _ROPE_PARAMS_3D)
def test_fused_apply_rope_forward(BH, T, D, dtype):
seed = 31415927
torch.manual_seed(seed)
device = torch.device("cuda", 0)

x = torch.randn(BH, T, D, device=device, dtype=dtype)
cos = torch.randn(T, D, device=device, dtype=dtype)
sin = torch.randn(T, D, device=device, dtype=dtype)

out_fused = fused_apply_rope(x, cos, sin)

atol, rtol = _tolerances(dtype)
references = {
"eager": _rope_eager(x, cos, sin),
"litgpt.apply_rope": _rope_via_litgpt(x, cos, sin),
}
for name, out_ref in references.items():
print(f"rope fwd 3D: BH={BH}, T={T}, D={D}, dtype={dtype}, ref={name}")
> torch.testing.assert_close(out_fused, out_ref, atol=atol, rtol=rtol)
E AssertionError: Tensor-likes are not close!
E
E Mismatched elements: 2 / 32768 (0.0%)
E Greatest absolute difference: 0.015625 at index (0, 36, 247) (up to 0.01 allowed)
E Greatest relative difference: 0.05419921875 at index (1, 6, 144) (up to 0.01 allowed)

test/fused/test_fused_ops.py:259: AssertionError
-------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------
rope fwd 3D: BH=2, T=64, D=256, dtype=torch.bfloat16, ref=eager
rope fwd 3D: BH=2, T=64, D=256, dtype=torch.bfloat16, ref=litgpt.apply_rope
```

One issue I noted in your code: `sin` and `cos` are supposed to be sines and
cosines, but you sample them independently using `torch.randn`. You can check
`litgpt.model.build_rope_cache` how they are created.

Can you please fix the tests? Do not simply reduce the required tolerance. Absolute
differences of 0.015 are too large.

### Claude Code answer:

The root cause was that the Triton kernel and the eager reference both compute in float32 internally, but with random independent cos/sin (which can be large, e.g. O(1) from randn), the intermediate products x * cos and rot_x * sin can be large enough that adding them produces bf16 rounding differences of ~0.015. Since real RoPE always satisfies cos² + sin² = 1, the values are bounded in [-1, 1], keeping all intermediate products close to the magnitude of x and eliminating the cancellation. The fix replaces torch.randn cos/sin with proper unit-circle values built from random angles.

## Comment

Next I ran:
```bash
pytest test/fused/test_fused_ops.py -k test_fused_apply_rope_backward
```

All configurations fail. For example:
```text
_________________________________________________ test_fused_apply_rope_backward[4-33-64-dtype18] _________________________________________________

BH = 4, T = 33, D = 64, dtype = torch.float32

@_RunIf(min_cuda_gpus=1)
@pytest.mark.parametrize("BH, T, D, dtype", _ROPE_PARAMS_3D)
def test_fused_apply_rope_backward(BH, T, D, dtype):
seed = 31415927
torch.manual_seed(seed)
device = torch.device("cuda", 0)

_x = torch.randn(BH, T, D, device=device, dtype=dtype)
cos, sin = _build_cos_sin(T, D, device)
cos, sin = cos.to(dtype), sin.to(dtype)

def _run(fn):
x = _copy_with_grad(_x)
_random_linear_loss(fn(x)).backward()
return x.grad

grad_fused = _run(lambda x: fused_apply_rope(x, cos, sin))
references = {
"eager": _run(lambda x: _rope_eager(x, cos, sin)),
"litgpt.apply_rope": _run(lambda x: _rope_via_litgpt(x, cos, sin)),
}

print(f"rope bwd 3D: BH={BH}, T={T}, D={D}, dtype={dtype}")
atol, rtol = _tolerances(dtype)
for name, grad_ref in references.items():
> torch.testing.assert_close(grad_fused, grad_ref, atol=atol, rtol=rtol)
E AssertionError: Tensor-likes are not close!
E
E Mismatched elements: 8448 / 8448 (100.0%)
E Greatest absolute difference: 6.446865081787109 at index (3, 8, 20) (up to 0.0001 allowed)
E Greatest relative difference: 17691.923828125 at index (1, 23, 15) (up to 0.0001 allowed)

test/fused/test_fused_ops.py:300: AssertionError
-------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------
rope bwd 3D: BH=4, T=33, D=64, dtype=torch.float32
```

I think the problem may be you call `_run` twice, and different linear combinations
are used, because `_random_linear_loss` is called twice. I think you need to fix
the random combination and use the same. Can you try and fix this?

Note that also all other "backward" tests fail. This could be the same reason.
Could you rework all of them?

### Answer of Claude Code

Good — no remaining _random_linear_loss calls (the grep exit code 1 means no matches found), and all 210 tests still collect.

The fix applied uniformly across all backward tests: instead of calling torch.randn_like(y) inside the loss function on each invocation (which produces a different random tensor every time), each test now generates one coeff = torch.randn(output_shape, device=device) before any forward/backward run, and passes it explicitly to _linear_loss. Every run in the loop then differentiates exactly the same scalar function, making the gradient comparison meaningful.

## Comment

Now, `test_fused_rmsnorm_backward` fails. Here is the output for one config:
```text
____________________________________________ test_fused_rmsnorm_backward[batch_dims41-32-dtype41-True] ____________________________________________

batch_dims = (16,), D = 32, dtype = torch.bfloat16, add_unit_offset = True

@_RunIf(min_cuda_gpus=1)
@pytest.mark.parametrize("batch_dims, D, dtype, add_unit_offset", _RMSNORM_PARAMS)
def test_fused_rmsnorm_backward(batch_dims, D, dtype, add_unit_offset):
seed = 31415927
torch.manual_seed(seed)
device = torch.device("cuda", 0)
eps = 1e-5

shape = (*batch_dims, D)
_x = torch.randn(shape, device=device, dtype=dtype)
_w = torch.randn(D, device=device, dtype=dtype) * 0.1
coeff = torch.randn(shape, device=device)

# Collect gradients from fused op and each reference
def _run(fn):
x = _copy_with_grad(_x)
w = _copy_with_grad(_w)
_linear_loss(fn(x, w), coeff).backward()
return {"x": x.grad, "w": w.grad}

grads_fused = _run(lambda x, w: fused_rmsnorm(x, w, eps, add_unit_offset))
references = {
"eager": _run(lambda x, w: _rmsnorm_eager(x, w, eps, add_unit_offset)),
"kv_model.RMSNorm": _run(
lambda x, w: _rmsnorm_via_module(kv_model.RMSNorm, x, w, eps, add_unit_offset)
),
"litgpt.RMSNorm": _run(
lambda x, w: _rmsnorm_via_module(litgpt_model.RMSNorm, x, w, eps, add_unit_offset)
),
}

atol, rtol = _tolerances(dtype)
for ref_name, grads_ref in references.items():
for param_name in ("x", "w"):
print(
f"rmsnorm bwd: shape={shape}, dtype={dtype},"
f" add_unit_offset={add_unit_offset}, ref={ref_name}, grad[{param_name}]"
)
> torch.testing.assert_close(
grads_fused[param_name],
grads_ref[param_name],
atol=atol,
rtol=rtol,
)
E AssertionError: None mismatch: tensor([ 1.6016, 2.2656, 4.3438, -4.0938, 0.4160, -1.0156, -0.9570, 1.7266,
E 3.7969, 3.5469, 6.6250, -2.6875, -2.5156, -0.0199, 6.2812, -3.9375,
E 4.5312, 4.7188, 2.4531, -4.5000, -1.4844, 4.0000, -8.5000, 1.8125,
E -1.9375, -7.9688, -4.6250, 4.8750, 5.1562, 4.7188, -5.3125, 5.0312],
E device='cuda:0', dtype=torch.bfloat16) is not None

test/fused/test_fused_ops.py:175: AssertionError
-------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------
rmsnorm bwd: shape=(16, 32), dtype=torch.bfloat16, add_unit_offset=True, ref=eager, grad[x]
rmsnorm bwd: shape=(16, 32), dtype=torch.bfloat16, add_unit_offset=True, ref=eager, grad[w]
rmsnorm bwd: shape=(16, 32), dtype=torch.bfloat16, add_unit_offset=True, ref=kv_model.RMSNorm, grad[x]
rmsnorm bwd: shape=(16, 32), dtype=torch.bfloat16, add_unit_offset=True, ref=kv_model.RMSNorm, grad[w]
```

### Answer Claude Code:

The root cause: _rmsnorm_via_module was doing m.weight.copy_(weight.float()), which copies data into the module's own nn.Parameter. During backward, gradients accumulate in m.weight.grad, not in weight.grad, so weight.grad stayed None. The fix is to delete the parameter and set m.weight = weight.float() as a plain tensor. Since float() returns a new tensor (when dtype isn't float32), autograd will propagate the gradient through that cast back to the original weight tensor.
13 changes: 13 additions & 0 deletions test/fused/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License").
# You may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
Loading
Loading